Re: [ceph-users] ceph.conf file update

2016-01-28 Thread M Ranga Swami Reddy
HI Matt,
Thank you..
But - what I understood is - ceph osd service restart will pickup the
cephconf from the monitor node.
Is this understanding correct?

Thanks
Swami

On Fri, Jan 29, 2016 at 11:46 AM, Matt Taylor  wrote:
> Hi Swami,
>
> You will need to deploy ceph.conf to all respective nodes in order for
> changes to be persistent.
>
> I would probably say this is better to from the location of where you did
> your initial "ceph-deploy" from.
>
> eg: ceph-deploy --username ceph --overwrite-conf admin
> cephnode{1..15}.somefancyhostname.com
>
> Thanks,
> Matt.
>
> On 29/01/2016 17:08, M Ranga Swami Reddy wrote:
>>
>> Hi All,
>> I have injected a few conf file changes using the "injectargs" (there
>> are mon_ and osd_.). Now I want to preserve them after the ceph mons
>> and osds reboot also.
>>   I have updated the ceph.conf file to preserve the changes after
>> restart of ceph mon services. do I need to update the ceph.conf file @
>> osd node also in-order to preserver the changes?  Or OSD nodes will
>> fetch the ceph.conf file from monitor nodes?
>>
>>
>> Thanks
>> Swami
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf file update

2016-01-28 Thread Matt Taylor

Hi Swami,

You will need to deploy ceph.conf to all respective nodes in order for 
changes to be persistent.


I would probably say this is better to from the location of where you 
did your initial "ceph-deploy" from.


eg: ceph-deploy --username ceph --overwrite-conf admin 
cephnode{1..15}.somefancyhostname.com


Thanks,
Matt.

On 29/01/2016 17:08, M Ranga Swami Reddy wrote:

Hi All,
I have injected a few conf file changes using the "injectargs" (there
are mon_ and osd_.). Now I want to preserve them after the ceph mons
and osds reboot also.
  I have updated the ceph.conf file to preserve the changes after
restart of ceph mon services. do I need to update the ceph.conf file @
osd node also in-order to preserver the changes?  Or OSD nodes will
fetch the ceph.conf file from monitor nodes?


Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph.conf file update

2016-01-28 Thread M Ranga Swami Reddy
Hi All,
I have injected a few conf file changes using the "injectargs" (there
are mon_ and osd_.). Now I want to preserve them after the ceph mons
and osds reboot also.
 I have updated the ceph.conf file to preserve the changes after
restart of ceph mon services. do I need to update the ceph.conf file @
osd node also in-order to preserver the changes?  Or OSD nodes will
fetch the ceph.conf file from monitor nodes?


Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-28 Thread Bill WONG
Hi Jason,

it works fine but flatten take how longer than than before..
i would like to know how to decide the --stripe-unit & --stripe-count to
gain the best performance?
i see you put unit of 4K and count with 16.. why?
--
 --stripe-unit 4K --stripe-count 16
--

thank you!

On Fri, Jan 29, 2016 at 11:15 AM, Jason Dillaman 
wrote:

> When you set "--stripe-count" to 1 and set the "--stripe-unit" to the
> object size, you have actually explicitly told the rbd CLI to not use
> "fancy" striping.  A better example would be something like:
>
> rbd clone --stripe-unit 4K --stripe-count 16 storage1/cloudlet-1@snap1
> storage1/cloudlet-1-clone
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "Bill WONG" 
> > To: "Jason Dillaman" 
> > Cc: "ceph-users" 
> > Sent: Thursday, January 28, 2016 10:08:38 PM
> > Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
> > images
>
> > hi jason,
>
> > how i can make the stripping parameters st the time of clone creation?
> as i
> > have tested, which looks doesn't work properly..
> > the clone image still without stripping.. any idea?
> > --
> > rbd clone --stripe-unit 4096K --stripe-count 1 storage1/cloudlet-1@snap1
> > storage1/cloudlet-1-clone
> > rbd flatten storage1/cloudlet-1-clone
> > rbd info storage1/cloudlet-1-clone
> > rbd image 'cloudlet-1-clone':
> > size 1000 GB in 256000 objects
> > order 22 (4096 kB objects)
> > block_name_prefix: rbd_data.5ecd364dfe1f8
> > format: 2
> > features: layering
> > flags:
> > ---
>
> > On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman < dilla...@redhat.com >
> > wrote:
>
> > > You must specify the clone's striping parameters at the time of its
> > > creation
> > > -- it is not inherited from the parent image.
> >
>
> > > --
> >
>
> > > Jason Dillaman
> >
>
> > > - Original Message -
> >
>
> > > > From: "Bill WONG" < wongahsh...@gmail.com >
> >
> > > > To: "ceph-users" < ceph-users@lists.ceph.com >
> >
> > > > Sent: Thursday, January 28, 2016 1:25:12 PM
> >
> > > > Subject: [ceph-users] Striping feature gone after flatten with cloned
> > > > images
> >
>
> > > > Hi,
> >
>
> > > > i have tested with the flatten:
> >
> > > > 1) make a snapshot of image
> >
> > > > 2) protect the snapshot
> >
> > > > 3) clone the snapshot
> >
> > > > 4) flatten the clone
> >
> > > > then i found issue:
> >
> > > > with the original image / snapshot or the clone before flatten, all
> are
> > > > with
> >
> > > > stripping feature, BUT after flattened the clone, then there is no
> more
> >
> > > > Stripping with the clone image...what is the issue? and how can
> enable
> > > > the
> >
> > > > striping feature?
> >
>
> > > > thank you!
> >
>
> > > > ___
> >
> > > > ceph-users mailing list
> >
> > > > ceph-users@lists.ceph.com
> >
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Somnath Roy
> wrote:

Hi,
Ceph needs to maintain a journal in case of filestore as underlying filesystem 
like XFS *doesn’t have* any transactional semantics. Ceph has to do a 
transactional write with data and metadata in the write path. It does in the 
following way.

"Ceph has to do a transactional write with data and metadata in the write path"
Why? Isn't that only to provide that to itself?

[Somnath] Yes, that is for Ceph..That’s 2 setattrs (for rbd) + PGLog/Info..

1. It creates a transaction object having multiple metadata operations and the 
actual payload write.

2. It is passed to Objectstore layer.

3. Objectstore can complete the transaction in sync or async (Filestore) way.

Depending on whether the write was flushed or not? How is that decided?
[Somnath] It depends on how ObjectStore backend is written..Not 
dynamic..Filestore implemented in async way , I think BlueStore is written in 
sync way (?)..


4.  Filestore dumps the entire Transaction object to the journal. It is a 
circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
way.

Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
just to avoid needless caching.
It should behave the way you want on Linux, but you must not rely on it since 
this guarantee is not portable.

[Somnath] O_DIRECT alone is not guaranteed but With O_DSYNC it is guaranteed to 
be reaching the disk..It may still be there in Disk cache , but, this is taken 
care by disks..

5. Once journal write is successful , write is acknowledged to the client. Read 
for this data is not allowed yet as it is still not been written to the actual 
location in the filesystem.

Now you are providing a guarantee for something nobody really needs. There is 
no guarantee with traditional filesystems of not returning dirty unwritten 
data. The guarentees are on writes, not reads. It might be easier to do it this 
way if you plan for some sort of concurrent access to the same data from 
multiple readers (that don't share the cache) - but is that really the case 
here if it's still the same OSD that serves the data?
Do the journals absorb only the unbuffered IO or all IO?

And what happens currently if I need to read the written data rightaway? When 
do I get it then?

[Somnath] Well, this is debatable, but currently reads are blocked till entire 
Tx execution is completed (not after doing syncfs)..Journal absorbs all the IO..

6. The actual execution of the transaction is done in parallel for the 
filesystem that can do check pointing like BTRFS. For the filesystem like 
XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
first and then the Tx execution will happen.

7. Tx execution is done in parallel by the filestore worker threads. The 
payload write is a buffered write and a sync thread within filestore is 
periodically calling ‘syncfs’ to persist data/metadata to the actual location.

8. Before each ‘syncfs’ call it determines the seq number till it is persisted 
and trim the transaction objects from journal upto that point. This will make 
room for more writes in the journal. If journal is full, write will be stuck.

9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
the last successful backend commit seq number (maintained in a file after 
‘syncfs’).


You can just completely rip at least 6-9 out and mirror what the client sends 
to the filesystem with the same effect (and without journal). Who cares how the 
filesystem implements it then, everybody can choose the filesystem that matches 
the workload (e.g. the one they use alread on a physical volume they are 
migrating from).
It's a sensible solution to a non existing problem...

[Somnath] May be but different client has different requirement, can’t design 
OSD I guess based on what client will do..One has to do all effort to make OSD 
crash consistent IMO..
Probably, it would be better if filestore gives user a choice where to use 
journal or not based on client’s need….If client can live without being 
consistent , so be it..


So, as you can see, it’s not a flaw but a necessity to have a journal for 
filestore in case of rbd workload as it can do partial overwrites. It is not 
needed for full writes like for objects and that’s the reason Sage came up with 
new store which will not be doing double writes for Object workload.
The keyvaluestore backend also doesn’t have any journal as it is relying on 
backend like leveldb/rocksdb for that.

Regarding Jan’s point for block vs file journal, IMO the only advantage of 
journal being a block device is filestore can do aio writes on that.

You also

Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-28 Thread Jason Dillaman
When you set "--stripe-count" to 1 and set the "--stripe-unit" to the object 
size, you have actually explicitly told the rbd CLI to not use "fancy" 
striping.  A better example would be something like:

rbd clone --stripe-unit 4K --stripe-count 16 storage1/cloudlet-1@snap1 
storage1/cloudlet-1-clone

--

Jason Dillaman 


- Original Message - 

> From: "Bill WONG" 
> To: "Jason Dillaman" 
> Cc: "ceph-users" 
> Sent: Thursday, January 28, 2016 10:08:38 PM
> Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
> images

> hi jason,

> how i can make the stripping parameters st the time of clone creation? as i
> have tested, which looks doesn't work properly..
> the clone image still without stripping.. any idea?
> --
> rbd clone --stripe-unit 4096K --stripe-count 1 storage1/cloudlet-1@snap1
> storage1/cloudlet-1-clone
> rbd flatten storage1/cloudlet-1-clone
> rbd info storage1/cloudlet-1-clone
> rbd image 'cloudlet-1-clone':
> size 1000 GB in 256000 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.5ecd364dfe1f8
> format: 2
> features: layering
> flags:
> ---

> On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman < dilla...@redhat.com >
> wrote:

> > You must specify the clone's striping parameters at the time of its
> > creation
> > -- it is not inherited from the parent image.
> 

> > --
> 

> > Jason Dillaman
> 

> > - Original Message -
> 

> > > From: "Bill WONG" < wongahsh...@gmail.com >
> 
> > > To: "ceph-users" < ceph-users@lists.ceph.com >
> 
> > > Sent: Thursday, January 28, 2016 1:25:12 PM
> 
> > > Subject: [ceph-users] Striping feature gone after flatten with cloned
> > > images
> 

> > > Hi,
> 

> > > i have tested with the flatten:
> 
> > > 1) make a snapshot of image
> 
> > > 2) protect the snapshot
> 
> > > 3) clone the snapshot
> 
> > > 4) flatten the clone
> 
> > > then i found issue:
> 
> > > with the original image / snapshot or the clone before flatten, all are
> > > with
> 
> > > stripping feature, BUT after flattened the clone, then there is no more
> 
> > > Stripping with the clone image...what is the issue? and how can enable
> > > the
> 
> > > striping feature?
> 

> > > thank you!
> 

> > > ___
> 
> > > ceph-users mailing list
> 
> > > ceph-users@lists.ceph.com
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-28 Thread Bill WONG
hi jason,

how i can make the stripping parameters st the time of clone creation? as i
have tested, which looks doesn't work properly..
the clone image still without stripping.. any idea?
--
rbd clone --stripe-unit 4096K --stripe-count 1 storage1/cloudlet-1@snap1
storage1/cloudlet-1-clone
rbd flatten storage1/cloudlet-1-clone
 rbd info storage1/cloudlet-1-clone
rbd image 'cloudlet-1-clone':
size 1000 GB in 256000 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5ecd364dfe1f8
format: 2
features: layering
flags:
---



On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman  wrote:

> You must specify the clone's striping parameters at the time of its
> creation -- it is not inherited from the parent image.
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "Bill WONG" 
> > To: "ceph-users" 
> > Sent: Thursday, January 28, 2016 1:25:12 PM
> > Subject: [ceph-users] Striping feature gone after flatten with cloned
> images
>
> > Hi,
>
> > i have tested with the flatten:
> > 1) make a snapshot of image
> > 2) protect the snapshot
> > 3) clone the snapshot
> > 4) flatten the clone
> > then i found issue:
> > with the original image / snapshot or the clone before flatten, all are
> with
> > stripping feature, BUT after flattened the clone, then there is no more
> > Stripping with the clone image...what is the issue? and how can enable
> the
> > striping feature?
>
> > thank you!
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] s3cmd list bucket ok, but get object failed for Ceph object

2016-01-28 Thread WD_Hwang
Hi:
  Is there anyone know the reason why 's3cmd lists bucket ok, but get object 
failed for Ceph object'? For example:

(1)   List the bucket
$ s3cmd ls
2015-12-24 02:26  s3://DIR1


(2)   List the objects under the bucket 'DIR1'
$ s3cmd ls s3://DIR1
2015-12-25 08:17  3091   s3://DIR1/logo.png
2015-12-24 02:28   533   s3://DIR1/mkpool.sh


(3)   Get the object
$ s3cmd get s3://DIR1/mkpool.sh x
s3://DIR1/mkpool.sh -> x  [1 of 1]
ERROR: S3 error: 404 (Not Found):

  Any help would be much appreciated.

WD

---
This email contains confidential or legally privileged information and is for 
the sole use of its intended recipient. 
Any unauthorized review, use, copying or distribution of this email or the 
content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should 
delete this e-mail immediately.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer

> On 28 Jan 2016, at 23:19, Lionel Bouton  
> wrote:
> 
> Le 28/01/2016 22:32, Jan Schermer a écrit :
>> P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
>> already have a journal for the filesystem which is time proven, well behaved 
>> and above all fast. Instead there's this reinvented wheel which supposedly 
>> does it better in userspace while not really avoiding the filesystem journal 
>> either. It would maybe make sense if OSD was storing the data on a block 
>> device directly, avoiding the filesystem altogether. But it would still do 
>> the same bloody thing and (no disrespect) ext4 does this better than Ceph 
>> ever will.
>> 
> 
> Hum I've seen this discussed previously but I'm not sure the fs journal could 
> be used as a Ceph journal.
> 
> First BTRFS doesn't have a journal per se, so you would not be able to use 
> xfs or ext4 journal on another device with journal=data setup to make write 
> bursts/random writes fast. And I won't go back to XFS or test ext4... I've 
> detected too much silent corruption by hardware with BTRFS to trust our data 
> to any filesystem not using CRC on reads (and in our particular case the 
> compression and speed are additional bonuses).

ZFS takes care of all those concerns... Most people are quite happy with 
ext2/3/4, oblivious to the fact they are losing bits here and there... and the 
world still spins the same :-)
I personally believe the task of not corrupting data doesn't belong in the 
fileystem layer but rather should be handled by RAID array, mdraid, RBD... ZFS 
does it because it handles those tasks too.

> 
> Second I'm not familiar with Ceph internals but OSDs must make sure that 
> their PGs are synced so I was under the impression that the OSD content for a 
> PG on the filesystem should always be guaranteed to be on all the other 
> active OSDs *or* their journals (so you wouldn't apply journal content unless 
> the other journals have already committed the same content). If you remove 
> the journals there's no intermediate on-disk "buffer" that can be used to 
> guarantee such a thing: one OSD will always have data that won't be 
> guaranteed to be on disk on the others. As I understand this you could say 
> that this is some form of 2-phase commit.

You can simply commit the data (to the filestore), and it would be in fact 
faster.
Client gets the write acknowledged when all the OSDs have the data - that 
doesn't change in this scenario. If one OSD gets ahead of the others and 
commits something the other OSDs do not before the whole cluster goes down then 
it doesn't hurt anything - you didn't acknowledge so the client has to replay 
if it cares, _NOT_ the OSDs.
The problem still exists, just gets shifted elsewhere. But the client (guest 
filesystem) already handles this.

> 
> I may be mistaken: there are structures in the filestore that *may* take on 
> this role but I'm not sure what their exact use is : the _TEMP dirs, 
> the omap and meta dirs. My guess is that they serve other purposes: it would 
> make sense to use the journals for this because the data is already there and 
> the commit/apply coherency barriers seem both trivial and efficient to use.
> 
> That's not to say that the journals are the only way to maintain the needed 
> coherency, just that they might be used to do so because once they are here, 
> this is a trivial extension of their use.
> 

In the context of cloud more and more people realize that clinging to things 
like "durability" and "consitency" is out of fashion. I think the future will 
take a different turn... I can't say I agree with that, though, I'm usually the 
one fixing those screw ups afterwards.


> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
Thanks for a great walkthrough explanation.
I am not really going to (and capable) of commenting on everything but.. see 
below

> On 28 Jan 2016, at 23:35, Somnath Roy  wrote:
> 
> Hi,
> Ceph needs to maintain a journal in case of filestore as underlying 
> filesystem like XFS *doesn’t have* any transactional semantics. Ceph has to 
> do a transactional write with data and metadata in the write path. It does in 
> the following way.

"Ceph has to do a transactional write with data and metadata in the write path"
Why? Isn't that only to provide that to itself?

>  
> 1. It creates a transaction object having multiple metadata operations and 
> the actual payload write.
>  
> 2. It is passed to Objectstore layer.
>  
> 3. Objectstore can complete the transaction in sync or async (Filestore) way.

Depending on whether the write was flushed or not? How is that decided?

>  
> 4.  Filestore dumps the entire Transaction object to the journal. It is a 
> circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
> way.

Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
just to avoid needless caching.
It should behave the way you want on Linux, but you must not rely on it since 
this guarantee is not portable.

> 5. Once journal write is successful , write is acknowledged to the client. 
> Read for this data is not allowed yet as it is still not been written to the 
> actual location in the filesystem.

Now you are providing a guarantee for something nobody really needs. There is 
no guarantee with traditional filesystems of not returning dirty unwritten 
data. The guarentees are on writes, not reads. It might be easier to do it this 
way if you plan for some sort of concurrent access to the same data from 
multiple readers (that don't share the cache) - but is that really the case 
here if it's still the same OSD that serves the data?
Do the journals absorb only the unbuffered IO or all IO?

And what happens currently if I need to read the written data rightaway? When 
do I get it then?

>  
> 6. The actual execution of the transaction is done in parallel for the 
> filesystem that can do check pointing like BTRFS. For the filesystem like 
> XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
> first and then the Tx execution will happen.
>  
> 7. Tx execution is done in parallel by the filestore worker threads. The 
> payload write is a buffered write and a sync thread within filestore is 
> periodically calling ‘syncfs’ to persist data/metadata to the actual location.
>  
> 8. Before each ‘syncfs’ call it determines the seq number till it is 
> persisted and trim the transaction objects from journal upto that point. This 
> will make room for more writes in the journal. If journal is full, write will 
> be stuck.
>  
> 9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
> the last successful backend commit seq number (maintained in a file after 
> ‘syncfs’).
>  

You can just completely rip at least 6-9 out and mirror what the client sends 
to the filesystem with the same effect (and without journal). Who cares how the 
filesystem implements it then, everybody can choose the filesystem that matches 
the workload (e.g. the one they use alread on a physical volume they are 
migrating from).
It's a sensible solution to a non existing problem...



> So, as you can see, it’s not a flaw but a necessity to have a journal for 
> filestore in case of rbd workload as it can do partial overwrites. It is not 
> needed for full writes like for objects and that’s the reason Sage came up 
> with new store which will not be doing double writes for Object workload.
> The keyvaluestore backend also doesn’t have any journal as it is relying on 
> backend like leveldb/rocksdb for that.
>  
> Regarding Jan’s point for block vs file journal, IMO the only advantage of 
> journal being a block device is filestore can do aio writes on that.

You also don't have the filesystem journal. You can simply divide the whole 
block divice into 4MB blocks and use them.
But my point was that you are getting even closer to reimplementing a fileystem 
in userspace, which is just nonsense.

>  
> Now, here is what SanDisk changed..
>  
> 1. In the write path Filestore has to do some throttling as journal can’t go 
> much further than the actual backend write (Tx execution). We have introduced 
> a dynamic throlling based on journal fill rate and a % increase from a config 
> option filestore_queue_max_bytes. This config option keeps track of 
> outstanding backend byte writes.
>  
> 2. Instead of buffered write we have introduced a O_DSYNC write during 
> transaction execution as it is reducing the amount of data syncfs has to 
> write and thus getting a more stable performance out.
>  
> 3. Main reason that we can’t allow journal to go further ahead as the Tx 
> object will not be deleted till the Tx executes. More behind the Tx execution 
> , more 

Re: [ceph-users] SSD Journal

2016-01-28 Thread Somnath Roy
Hi,
Ceph needs to maintain a journal in case of filestore as underlying filesystem 
like XFS *doesn’t have* any transactional semantics. Ceph has to do a 
transactional write with data and metadata in the write path. It does in the 
following way.

1. It creates a transaction object having multiple metadata operations and the 
actual payload write.

2. It is passed to Objectstore layer.

3. Objectstore can complete the transaction in sync or async (Filestore) way.

4.  Filestore dumps the entire Transaction object to the journal. It is a 
circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
way.

5. Once journal write is successful , write is acknowledged to the client. Read 
for this data is not allowed yet as it is still not been written to the actual 
location in the filesystem.

6. The actual execution of the transaction is done in parallel for the 
filesystem that can do check pointing like BTRFS. For the filesystem like 
XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
first and then the Tx execution will happen.

7. Tx execution is done in parallel by the filestore worker threads. The 
payload write is a buffered write and a sync thread within filestore is 
periodically calling ‘syncfs’ to persist data/metadata to the actual location.

8. Before each ‘syncfs’ call it determines the seq number till it is persisted 
and trim the transaction objects from journal upto that point. This will make 
room for more writes in the journal. If journal is full, write will be stuck.

9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
the last successful backend commit seq number (maintained in a file after 
‘syncfs’).

So, as you can see, it’s not a flaw but a necessity to have a journal for 
filestore in case of rbd workload as it can do partial overwrites. It is not 
needed for full writes like for objects and that’s the reason Sage came up with 
new store which will not be doing double writes for Object workload.
The keyvaluestore backend also doesn’t have any journal as it is relying on 
backend like leveldb/rocksdb for that.

Regarding Jan’s point for block vs file journal, IMO the only advantage of 
journal being a block device is filestore can do aio writes on that.

Now, here is what SanDisk changed..

1. In the write path Filestore has to do some throttling as journal can’t go 
much further than the actual backend write (Tx execution). We have introduced a 
dynamic throlling based on journal fill rate and a % increase from a config 
option filestore_queue_max_bytes. This config option keeps track of outstanding 
backend byte writes.

2. Instead of buffered write we have introduced a O_DSYNC write during 
transaction execution as it is reducing the amount of data syncfs has to write 
and thus getting a more stable performance out.

3. Main reason that we can’t allow journal to go further ahead as the Tx object 
will not be deleted till the Tx executes. More behind the Tx execution , more 
memory growth will happen. Presently, Tx object is deleted asynchronously (and 
thus taking more time)and we changed it to delete it from the filestore worker 
thread itself.

4. The sync thread is optimized to do a fast sync. The extra last commit seq 
file is not maintained any more for *the write ahead journal* as this 
information can be found in journal header.

Here is the related pull requests..




https://github.com/ceph/ceph/pull/7271

https://github.com/ceph/ceph/pull/7303

https://github.com/ceph/ceph/pull/7278

https://github.com/ceph/ceph/pull/6743



Regarding bypassing filesystem and accessing block device directly, yes, that 
should be more clean/simple and efficient solution. With Sage’s Bluestore, Ceph 
is moving towards that very fast !!!

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tyler 
Bishop
Sent: Thursday, January 28, 2016 1:35 PM
To: Jan Schermer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] SSD Journal

What approach did sandisk take with this for jewel?




 [http://static.beyondhosting.net/img/bh-small.png]


Tyler Bishop
Chief Technical Officer
513-299-7108 x10


tyler.bis...@beyondhosting.net



If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.





From: "Jan Schermer" mailto:j...@schermer.cz>>
To: "Tyler Bishop" 
mailto:tyler.bis...@beyondhosting.net>>
Cc: "Bill WONG" mailto:wongahsh...@gmail.com>>, 
ceph-users@lists.ceph.com
Sent: Thursday, January 28, 2016 4:32:54 PM
Subject: Re: [ceph-users] SSD Journal

You can't run Ceph OSD without a journal. The journal is always there.
If you don't have a journal partition then there's a "journal" file on the OSD 
filesystem that does the same

Re: [ceph-users] SSD Journal

2016-01-28 Thread Lionel Bouton
Le 28/01/2016 22:32, Jan Schermer a écrit :
> P.S. I feel very strongly that this whole concept is broken
> fundamentaly. We already have a journal for the filesystem which is
> time proven, well behaved and above all fast. Instead there's this
> reinvented wheel which supposedly does it better in userspace while
> not really avoiding the filesystem journal either. It would maybe make
> sense if OSD was storing the data on a block device directly, avoiding
> the filesystem altogether. But it would still do the same bloody thing
> and (no disrespect) ext4 does this better than Ceph ever will.
>

Hum I've seen this discussed previously but I'm not sure the fs journal
could be used as a Ceph journal.

First BTRFS doesn't have a journal per se, so you would not be able to
use xfs or ext4 journal on another device with journal=data setup to
make write bursts/random writes fast. And I won't go back to XFS or test
ext4... I've detected too much silent corruption by hardware with BTRFS
to trust our data to any filesystem not using CRC on reads (and in our
particular case the compression and speed are additional bonuses).

Second I'm not familiar with Ceph internals but OSDs must make sure that
their PGs are synced so I was under the impression that the OSD content
for a PG on the filesystem should always be guaranteed to be on all the
other active OSDs *or* their journals (so you wouldn't apply journal
content unless the other journals have already committed the same
content). If you remove the journals there's no intermediate on-disk
"buffer" that can be used to guarantee such a thing: one OSD will always
have data that won't be guaranteed to be on disk on the others. As I
understand this you could say that this is some form of 2-phase commit.

I may be mistaken: there are structures in the filestore that *may* take
on this role but I'm not sure what their exact use is : the
_TEMP dirs, the omap and meta dirs. My guess is that they serve
other purposes: it would make sense to use the journals for this because
the data is already there and the commit/apply coherency barriers seem
both trivial and efficient to use.

That's not to say that the journals are the only way to maintain the
needed coherency, just that they might be used to do so because once
they are here, this is a trivial extension of their use.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Tyler Bishop
What approach did sandisk take with this for jewel? 







Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 




From: "Jan Schermer"  
To: "Tyler Bishop"  
Cc: "Bill WONG" , ceph-users@lists.ceph.com 
Sent: Thursday, January 28, 2016 4:32:54 PM 
Subject: Re: [ceph-users] SSD Journal 

You can't run Ceph OSD without a journal. The journal is always there. 
If you don't have a journal partition then there's a "journal" file on the OSD 
filesystem that does the same thing. If it's a partition then this file turns 
into a symlink. 

You will always be better off with a journal on a separate partition because of 
the way writeback cache in linux works (someone correct me if I'm wrong). 
The journal needs to flush to disk quite often, and linux is not always able to 
flush only the journal data. You can't defer metadata flushing forever and also 
doing fsync() makes all the dirty data flush as well. ext2/3/4 also flushes 
data to the filesystem periodicaly (5s is it I think?) which will make the 
latency of the journal go through the roof momentarily. 
(I'll leave researching how exactly XFS does it to those who care about that 
"filesystem'o'thing"). 

P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
already have a journal for the filesystem which is time proven, well behaved 
and above all fast. Instead there's this reinvented wheel which supposedly does 
it better in userspace while not really avoiding the filesystem journal either. 
It would maybe make sense if OSD was storing the data on a block device 
directly, avoiding the filesystem altogether. But it would still do the same 
bloody thing and (no disrespect) ext4 does this better than Ceph ever will. 





On 28 Jan 2016, at 20:01, Tyler Bishop < tyler.bis...@beyondhosting.net > 
wrote: 

This is an interesting topic that i've been waiting for. 

Right now we run the journal as a partition on the data disk. I've build drives 
without journals and the write performance seems okay but random io performance 
is poor in comparison to what it should be. 





Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 

tyler.bis...@beyondhosting.net 

If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 





From: "Bill WONG" < wongahsh...@gmail.com > 
To: "ceph-users" < ceph-users@lists.ceph.com > 
Sent: Thursday, January 28, 2016 1:36:01 PM 
Subject: [ceph-users] SSD Journal 

Hi, 
i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
testing with full SSD ceph cluster, now with full SSD ceph cluster, do i still 
need to have SSD as journal disk? 

[ assumed i do not have PCIe SSD Flash which is better performance than normal 
SSD disk] 

please give some ideas on full ssd ceph cluster ... thank you! 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
You can't run Ceph OSD without a journal. The journal is always there.
If you don't have a journal partition then there's a "journal" file on the OSD 
filesystem that does the same thing. If it's a partition then this file turns 
into a symlink.

You will always be better off with a journal on a separate partition because of 
the way writeback cache in linux works (someone correct me if I'm wrong).
The journal needs to flush to disk quite often, and linux is not always able to 
flush only the journal data. You can't defer metadata flushing forever and also 
doing fsync() makes all the dirty data flush as well. ext2/3/4 also flushes 
data to the filesystem periodicaly (5s is it I think?) which will make the 
latency of the journal go through the roof momentarily.
(I'll leave researching how exactly XFS does it to those who care about that 
"filesystem'o'thing").

P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
already have a journal for the filesystem which is time proven, well behaved 
and above all fast. Instead there's this reinvented wheel which supposedly does 
it better in userspace while not really avoiding the filesystem journal either. 
It would maybe make sense if OSD was storing the data on a block device 
directly, avoiding the filesystem altogether. But it would still do the same 
bloody thing and (no disrespect) ext4 does this better than Ceph ever will.


> On 28 Jan 2016, at 20:01, Tyler Bishop  wrote:
> 
> This is an interesting topic that i've been waiting for.
> 
> Right now we run the journal as a partition on the data disk.  I've build 
> drives without journals and the write performance seems okay but random io 
> performance is poor in comparison to what it should be.
> 
>  
>  
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> tyler.bis...@beyondhosting.net
> If you are not the intended recipient of this transmission you are notified 
> that disclosing, copying, distributing or taking any action in reliance on 
> the contents of this information is strictly prohibited.
>  
> 
> From: "Bill WONG" 
> To: "ceph-users" 
> Sent: Thursday, January 28, 2016 1:36:01 PM
> Subject: [ceph-users] SSD Journal
> 
> Hi,
> i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
> testing with full SSD ceph cluster, now with full SSD ceph cluster, do i 
> still need to have SSD as journal disk? 
> 
> [ assumed i do not have PCIe SSD Flash which is better performance than 
> normal SSD disk]
> 
> please give some ideas on full ssd ceph cluster ... thank you!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Jason Dillaman
As far as I am aware, QEMU/RBD supports discard in IDE, SCSI, and virtio-scsi 
controllers (if enabled).

-- 

Jason Dillaman 


- Original Message -
> From: "koukou73gr" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, January 28, 2016 2:51:11 PM
> Subject: Re: [ceph-users] Ceph + Libvirt + QEMU-KVM
> 
> On 01/28/2016 03:44 PM, Simon Ironside wrote:
> 
> > Btw, using virtio-scsi devices as above and discard='unmap' above
> > enables TRIM support. This means you can use fstrim or mount file
> > systems with discard inside the VM to free up unused space in the image.
> 
> Doesn't discard require the pc-q35-rhel7 (or equivalent) guest machine
> type, which in turn shoves a non-removable SATA AHCI device in the guest
> which can't be frozen and thus disables guest live migration?
> 
> Has there been any change regarding this in a recent QEMU version?
> 
> 
> Thanks,
> 
> -K.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

Hi,

On 28/01/16 19:51, koukou73gr wrote:


Doesn't discard require the pc-q35-rhel7 (or equivalent) guest machine
type, which in turn shoves a non-removable SATA AHCI device in the guest
which can't be frozen and thus disables guest live migration?


I have no trouble with live migration and using discard. The guest 
machine type is:


hvm

Cheers,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Tech Talk - High-Performance Production Databases on Ceph

2016-01-28 Thread Patrick McGarry
Hey cephers,

Here are the links to both the video and the slides from the Ceph Tech
Talk today. Thanks again to Thorvald and Medallia for stepping forward
to present.

Video: https://youtu.be/OqlC7S3cUKs

Slides: 
http://www.slideshare.net/Inktank_Ceph/2016jan28-high-performance-production-databases-on-ceph-57620014


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-28 Thread Jason Dillaman
You must specify the clone's striping parameters at the time of its creation -- 
it is not inherited from the parent image.

-- 

Jason Dillaman 


- Original Message - 

> From: "Bill WONG" 
> To: "ceph-users" 
> Sent: Thursday, January 28, 2016 1:25:12 PM
> Subject: [ceph-users] Striping feature gone after flatten with cloned images

> Hi,

> i have tested with the flatten:
> 1) make a snapshot of image
> 2) protect the snapshot
> 3) clone the snapshot
> 4) flatten the clone
> then i found issue:
> with the original image / snapshot or the clone before flatten, all are with
> stripping feature, BUT after flattened the clone, then there is no more
> Stripping with the clone image...what is the issue? and how can enable the
> striping feature?

> thank you!

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread koukou73gr
On 01/28/2016 03:44 PM, Simon Ironside wrote:

> Btw, using virtio-scsi devices as above and discard='unmap' above
> enables TRIM support. This means you can use fstrim or mount file
> systems with discard inside the VM to free up unused space in the image.

Doesn't discard require the pc-q35-rhel7 (or equivalent) guest machine
type, which in turn shoves a non-removable SATA AHCI device in the guest
which can't be frozen and thus disables guest live migration?

Has there been any change regarding this in a recent QEMU version?


Thanks,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Delete a bucket with 14 millions objects

2016-01-28 Thread Marius Vaitiekunas
Hi,

Anybody could give a hint how to delete a bucket with lots of files (about
14 millions)? I've unsuccessfully tried:
# radosgw-admin bucket rm --bucket=big-bucket --purge-objects
--yes-i-really-mean-it


-- 
Marius Vaitiekūnas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Tyler Bishop
This is an interesting topic that i've been waiting for. 

Right now we run the journal as a partition on the data disk. I've build drives 
without journals and the write performance seems okay but random io performance 
is poor in comparison to what it should be. 







Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 




From: "Bill WONG"  
To: "ceph-users"  
Sent: Thursday, January 28, 2016 1:36:01 PM 
Subject: [ceph-users] SSD Journal 

Hi, 
i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
testing with full SSD ceph cluster, now with full SSD ceph cluster, do i still 
need to have SSD as journal disk? 

[ assumed i do not have PCIe SSD Flash which is better performance than normal 
SSD disk] 

please give some ideas on full ssd ceph cluster ... thank you! 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Journal

2016-01-28 Thread Bill WONG
Hi,

i have tested with SSD Journal with SATA, it works perfectly.. now, i am
testing with full SSD ceph cluster, now with full SSD ceph cluster, do i
still need to have SSD as journal disk?

[ assumed i do not have PCIe SSD Flash which is better performance than
normal SSD disk]

please give some ideas on full ssd ceph cluster ... thank you!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Antw: Re: Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
hi Steffen,

you meant the live VM migration with ceph disk? and this should be
discussed on qemu-kvm list, and i have tested, it works fine.

ceph have the rbd snap, we can use it, but some qemu-kvm features which we
developed are based on qemu snapshots, so required qcow2 or qemu snapshot..
.

On Fri, Jan 29, 2016 at 2:01 AM, Steffen Weißgerber 
wrote:

>
>
> >>> Bill WONG  schrieb am Donnerstag, 28. Januar
> 2016 um
> 09:30:
> > Hi Marius,
> >
>
> Hello,
>
> > with ceph rdb, it looks can support qcow2 as well as per its
> document: -
> > http://docs.ceph.com/docs/master/rbd/qemu-rbd/
> > --
> > Important The raw data format is really the only sensible format
> option to
> > use with RBD. Technically, you could use other QEMU-supported formats
> (such
> > as qcow2 or vmdk), but doing so would add additional overhead, and
> would
> > also render the volume unsafe for virtual machine live migration
> when
> > caching (see below) is enabled.
>
> Normally my question would be off topic in this list, but I asked it
> already in the qemu list
> and got no answer:
>
> Is there documentation available on how to do live migration on rbd
> disks
> with the qemu-monitor?
>
> > ---
> >
> > without having qcow2, the qemu-kvm cannot make snapshot and other
> > features anyone have ideas or experiences on this?
> > thank you!
> >
>
> Why using qemu snapshots when rbd-snapshots are available?
>
> Regards
>
> Steffen
>
> >
> > On Thu, Jan 28, 2016 at 3:54 PM, Marius Vaitiekunas <
> > mariusvaitieku...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> With ceph rbd you should use raw image format. As i know qcow2 is
> not
> >> supported.
> >>
> >> On Thu, Jan 28, 2016 at 6:21 AM, Bill WONG 
> wrote:
> >>
> >>> Hi Simon,
> >>>
> >>> i have installed ceph package into the compute node, but it looks
> qcow2
> >>> format is unable to create.. it show error with : Could not write
> qcow2
> >>> header: Invalid argument
> >>>
> >>> ---
> >>> qemu-img create -f qcow2 rbd:storage1/CentOS7-3 10G
> >>> Formatting 'rbd:storage1/CentOS7-3', fmt=qcow2 size=10737418240
> >>> encryption=off cluster_size=65536 lazy_refcounts=off
> refcount_bits=16
> >>> qemu-img: rbd:storage1/CentOS7-3: Could not write qcow2 header:
> Invalid
> >>> argument
> >>> ---
> >>>
> >>> any ideas?
> >>> thank you!
> >>>
> >>> On Thu, Jan 28, 2016 at 1:01 AM, Simon Ironside
> 
> >>> wrote:
> >>>
>  On 27/01/16 16:51, Bill WONG wrote:
> 
>  i have ceph cluster and KVM in different machine the qemu-kvm
> > (CentOS7) is dedicated compute node installed with qemu-kvm +
> libvirtd
> > only, there should be no /etc/ceph/ceph.conf
> >
> 
>  Likewise, my compute nodes are separate machines from the
> OSDs/monitors
>  but the compute nodes still have the ceph package installed and
>  /etc/ceph/ceph.conf present. They just aren't running any ceph
> daemons.
> 
>  I give the compute nodes their own ceph key with write access to
> the
>  pool for VM storage and read access to the monitors. I can then
> use ceph
>  status, rbd create, qemu-img etc directly on the compute nodes.
> 
>  Cheers,
>  Simon.
> 
> >>>
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>
> >>
> >> --
> >> Marius Vaitiek*nas
> >>
>
> --
> Klinik-Service Neubrandenburg GmbH
> Allendestr. 30, 17036 Neubrandenburg
> Amtsgericht Neubrandenburg, HRB 2457
> Geschaeftsfuehrerin: Gudrun Kappich
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Striping feature gone after flatten with cloned images

2016-01-28 Thread Bill WONG
Hi,

i have tested with the flatten:
1) make a snapshot of image
2) protect the snapshot
3) clone the snapshot
4) flatten the clone
then i found issue:
with the original image / snapshot or the clone before flatten, all are
with stripping feature, BUT after flattened the clone, then there is no
more Stripping with the clone image...what is the issue? and how can enable
the striping feature?

thank you!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Antw: Re: Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Steffen Weißgerber


>>> Bill WONG  schrieb am Donnerstag, 28. Januar
2016 um
09:30:
> Hi Marius,
> 

Hello,

> with ceph rdb, it looks can support qcow2 as well as per its
document: -
> http://docs.ceph.com/docs/master/rbd/qemu-rbd/ 
> --
> Important The raw data format is really the only sensible format
option to
> use with RBD. Technically, you could use other QEMU-supported formats
(such
> as qcow2 or vmdk), but doing so would add additional overhead, and
would
> also render the volume unsafe for virtual machine live migration
when
> caching (see below) is enabled.

Normally my question would be off topic in this list, but I asked it
already in the qemu list
and got no answer:

Is there documentation available on how to do live migration on rbd
disks
with the qemu-monitor?

> ---
> 
> without having qcow2, the qemu-kvm cannot make snapshot and other
> features anyone have ideas or experiences on this?
> thank you!
> 

Why using qemu snapshots when rbd-snapshots are available?

Regards

Steffen

> 
> On Thu, Jan 28, 2016 at 3:54 PM, Marius Vaitiekunas <
> mariusvaitieku...@gmail.com> wrote:
> 
>> Hi,
>>
>> With ceph rbd you should use raw image format. As i know qcow2 is
not
>> supported.
>>
>> On Thu, Jan 28, 2016 at 6:21 AM, Bill WONG 
wrote:
>>
>>> Hi Simon,
>>>
>>> i have installed ceph package into the compute node, but it looks
qcow2
>>> format is unable to create.. it show error with : Could not write
qcow2
>>> header: Invalid argument
>>>
>>> ---
>>> qemu-img create -f qcow2 rbd:storage1/CentOS7-3 10G
>>> Formatting 'rbd:storage1/CentOS7-3', fmt=qcow2 size=10737418240
>>> encryption=off cluster_size=65536 lazy_refcounts=off
refcount_bits=16
>>> qemu-img: rbd:storage1/CentOS7-3: Could not write qcow2 header:
Invalid
>>> argument
>>> ---
>>>
>>> any ideas?
>>> thank you!
>>>
>>> On Thu, Jan 28, 2016 at 1:01 AM, Simon Ironside

>>> wrote:
>>>
 On 27/01/16 16:51, Bill WONG wrote:

 i have ceph cluster and KVM in different machine the qemu-kvm
> (CentOS7) is dedicated compute node installed with qemu-kvm +
libvirtd
> only, there should be no /etc/ceph/ceph.conf
>

 Likewise, my compute nodes are separate machines from the
OSDs/monitors
 but the compute nodes still have the ceph package installed and
 /etc/ceph/ceph.conf present. They just aren't running any ceph
daemons.

 I give the compute nodes their own ceph key with write access to
the
 pool for VM storage and read access to the monitors. I can then
use ceph
 status, rbd create, qemu-img etc directly on the compute nodes.

 Cheers,
 Simon.

>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>
>>>
>>
>>
>> --
>> Marius Vaitiek*nas
>>

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Tech Talk in 10 mins

2016-01-28 Thread Patrick McGarry
Just a reminder that this month’s Ceph Tech Talk is starting in about 10m.

http://ceph.com/ceph-tech-talks/

Come join us to hear about: “PostgreSQL on Ceph under Mesos/Aurora with Docker.”

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Jason Dillaman
In order to be fast, the 'rbd du' command counts the existence of any data 
object as fully used on disk.  Therefore, if you have a 4MB object size, 
writing 1 byte to two objects would result in 'rbd du' reporting 8MB in-use.  
You can simulate the same result via 'rbd diff':

rbd diff --whole-object  | awk '{ SUM += $2 } END { print 
SUM/1024/1024 " MB" }'  (should be similar to rbd du output -- let me know if 
it's not)

-- vs --

rbd diff  | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }' 
(removes sparse space)

Another possibility is if you wrote a bunch of zeros to the RBD image, it will 
no longer be sparsely allocated -- but converting to qcow2 will re-sparsify the 
resulting image.

-- 

Jason Dillaman 


- Original Message - 

> From: "Bill WONG" 
> To: "Jason Dillaman" 
> Cc: "Mihai Gheorghe" , "ceph-users"
> 
> Sent: Thursday, January 28, 2016 12:34:36 PM
> Subject: Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

> hi jason,

> i got it. thank you!
> i have a questions about the thin provisioning, suppose Ceph should be using
> thin provisioning by default, but how come the rbd du command show the disk
> usage is almost 19G for a 20G allocation, as when i use the expert to a .img
> from ceph to external local disk, it's about 4G image only, and the OS
> actually is about 1-2G if provisioned via local disk by using qcow2 format.
> any comments or ideas?

> thank you!

> On Thu, Jan 28, 2016 at 11:51 PM, Jason Dillaman < dilla...@redhat.com >
> wrote:

> > The way to interpret that output is that the HEAD revision of "CentOS3" has
> > about a 700MB delta from the previous snapshot (i.e. 700MB + 18G are used
> > by
> > this image and its snapshot). There probably should be an option in the rbd
> > CLI to generate the full usage for a particular image and all of its
> > snapshots. Right now the only way to do that is to perform a du on the
> > whole
> > pool.
> 

> > --
> 

> > Jason Dillaman
> 

> > - Original Message -
> 

> > > From: "Bill WONG" < wongahsh...@gmail.com >
> 
> > > To: "Mihai Gheorghe" < mcaps...@gmail.com >
> 
> > > Cc: "ceph-users" < ceph-users@lists.ceph.com >
> 
> > > Sent: Thursday, January 28, 2016 6:17:05 AM
> 
> > > Subject: Re: [ceph-users] Ceph + Libvirt + QEMU-KVM
> 

> > > Hi Mihai,
> 

> > > it looks rather strange in ceph snapshot, the size of snapshot is bigger
> > > than
> 
> > > the size of original images..
> 
> > > Original Image actual used size is 684M w/ provisioned 20G
> 
> > > but the snapshot actual used size is ~18G w/ provisioned 20G
> 

> > > any ideas?
> 

> > > ==
> 
> > > [root@compute2 ~]# rbd du CentOS3 -p storage1
> 
> > > warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> 
> > > NAME PROVISIONED USED
> 
> > > CentOS3 20480M 684M
> 

> > > [root@compute2 ~]# rbd du CentOS3@snap1 -p storage1
> 
> > > warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> 
> > > NAME PROVISIONED USED
> 
> > > CentOS3@snap1 20480M 18124M
> 
> > > ==
> 
> > > qemu-img info rbd:storage1/CentOS3
> 
> > > image: rbd:storage1/CentOS3
> 
> > > file format: raw
> 
> > > virtual size: 20G (21474836480 bytes)
> 
> > > disk size: unavailable
> 
> > > cluster_size: 4194304
> 
> > > Snapshot list:
> 
> > > ID TAG VM SIZE DATE VM CLOCK
> 
> > > snap1 snap1 20G 1970-01-01 08:00:00 00:00:00.000
> 

> > > On Thu, Jan 28, 2016 at 7:09 PM, Mihai Gheorghe < mcaps...@gmail.com >
> > > wrote:
> 

> > > > As far as i know, snapshotting with qemu will download a copy of the
> > > > image
> 
> > > > on
> 
> > > > local storage and then upload it into ceph. At least this is the
> > > > default
> 
> > > > behaviour when taking a snapshot in openstack of a running instance. I
> 
> > > > don't
> 
> > > > see why it would be any different with qemu-kvm. You must use the rbd
> > > > snap
> 
> > > > feature to make a copy on write clone of the image.
> 
> > >
> 
> > > > On 28 Jan 2016 12:59, "Bill WONG" < wongahsh...@gmail.com > wrote:
> 
> > >
> 

> > > > > Hi Simon,
> 
> > > >
> 
> > >
> 

> > > > > how you manage to preform snapshot with the raw format in qemu-kvm
> > > > > VMs?
> 
> > > >
> 
> > >
> 
> > > > > and i found some issues with libvirt virsh commands with ceph:
> 
> > > >
> 
> > >
> 
> > > > > --
> 
> > > >
> 
> > >
> 
> > > > > 1) create storage pool with ceph via virsh
> 
> > > >
> 
> > >
> 
> > > > > 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
> 
> > > >
> 
> > >
> 

> > > > > problem is here.. if we directly create vol via qemu-img create -f
> > > > > rbd
> 
> > > > > rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
> 
> > > > > vol-list
> 
> > > > > rbdpool command is unable to list the vol, it looks such commands -
> > > > > rbd,
> 
> > > > > virsh and qemu-img creating images are not synced with each other...
> 
> > > >
> 
> > >
> 

> > > > > cloud you please let me know how you use the ceph as backend storage
> > > > > of
> 
> > > > > qemu-kvm, as if i google it, most of 

Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
hi jason,

i got it. thank you!
i have a questions about the thin provisioning, suppose Ceph should be
using thin provisioning by default, but how come the rbd du command show
the disk usage is almost 19G for a 20G allocation, as when i use the expert
to a .img from ceph to external local disk, it's about 4G image only, and
the OS actually is about 1-2G if provisioned via local disk by using qcow2
format.
any comments or ideas?

thank you!

On Thu, Jan 28, 2016 at 11:51 PM, Jason Dillaman 
wrote:

> The way to interpret that output is that the HEAD revision of "CentOS3"
> has about a 700MB delta from the previous snapshot (i.e. 700MB + 18G are
> used by this image and its snapshot).  There probably should be an option
> in the rbd CLI to generate the full usage for a particular image and all of
> its snapshots.  Right now the only way to do that is to perform a du on the
> whole pool.
>
> --
>
> Jason Dillaman
>
> - Original Message -
>
> > From: "Bill WONG" 
> > To: "Mihai Gheorghe" 
> > Cc: "ceph-users" 
> > Sent: Thursday, January 28, 2016 6:17:05 AM
> > Subject: Re: [ceph-users] Ceph + Libvirt + QEMU-KVM
>
> > Hi Mihai,
>
> > it looks rather strange in ceph snapshot, the size of snapshot is bigger
> than
> > the size of original images..
> > Original Image actual used size is 684M w/ provisioned 20G
> > but the snapshot actual used size is ~18G w/ provisioned 20G
>
> > any ideas?
>
> > ==
> > [root@compute2 ~]# rbd du CentOS3 -p storage1
> > warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> > NAME PROVISIONED USED
> > CentOS3 20480M 684M
>
> > [root@compute2 ~]# rbd du CentOS3@snap1 -p storage1
> > warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> > NAME PROVISIONED USED
> > CentOS3@snap1 20480M 18124M
> > ==
> > qemu-img info rbd:storage1/CentOS3
> > image: rbd:storage1/CentOS3
> > file format: raw
> > virtual size: 20G (21474836480 bytes)
> > disk size: unavailable
> > cluster_size: 4194304
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > snap1 snap1 20G 1970-01-01 08:00:00 00:00:00.000
>
> > On Thu, Jan 28, 2016 at 7:09 PM, Mihai Gheorghe < mcaps...@gmail.com >
> wrote:
>
> > > As far as i know, snapshotting with qemu will download a copy of the
> image
> > > on
> > > local storage and then upload it into ceph. At least this is the
> default
> > > behaviour when taking a snapshot in openstack of a running instance. I
> > > don't
> > > see why it would be any different with qemu-kvm. You must use the rbd
> snap
> > > feature to make a copy on write clone of the image.
> >
> > > On 28 Jan 2016 12:59, "Bill WONG" < wongahsh...@gmail.com > wrote:
> >
>
> > > > Hi Simon,
> > >
> >
>
> > > > how you manage to preform snapshot with the raw format in qemu-kvm
> VMs?
> > >
> >
> > > > and i found some issues with libvirt virsh commands with ceph:
> > >
> >
> > > > --
> > >
> >
> > > > 1) create storage pool with ceph via virsh
> > >
> >
> > > > 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
> > >
> >
>
> > > > problem is here.. if we directly create vol via qemu-img create -f
> rbd
> > > > rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
> > > > vol-list
> > > > rbdpool command is unable to list the vol, it looks such commands -
> rbd,
> > > > virsh and qemu-img creating images are not synced with each other...
> > >
> >
>
> > > > cloud you please let me know how you use the ceph as backend storage
> of
> > > > qemu-kvm, as if i google it, most of the ceph application is used for
> > > > OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and
> > > > Cinder
> > > > is troublesome...
> > >
> >
>
> > > > thank you!
> > >
> >
>
> > > > On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside <
> sirons...@caffetine.org
> > > > >
> > > > wrote:
> > >
> >
>
> > > > > On 28/01/16 08:30, Bill WONG wrote:
> > > >
> > >
> >
>
> > > > > > without having qcow2, the qemu-kvm cannot make snapshot and other
> > > > >
> > > >
> > >
> >
> > > > > > features anyone have ideas or experiences on this?
> > > > >
> > > >
> > >
> >
> > > > > > thank you!
> > > > >
> > > >
> > >
> >
>
> > > > > I'm using raw too and create snapshots using "rbd snap create"
> > > >
> > >
> >
>
> > > > > Cheers,
> > > >
> > >
> >
> > > > > Simon
> > > >
> > >
> >
>
> > > > ___
> > >
> >
> > > > ceph-users mailing list
> > >
> >
> > > > ceph-users@lists.ceph.com
> > >
> >
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW :: bucket quota not enforced below 1

2016-01-28 Thread seapasu...@uchicago.edu
On top of this. In my attempts to create a read-only user I think I 
found another issue::
radosgw-admin subuser create --subuser=s3test:fun --key-type=s3 
--gen-access-key --gen-secret

radosgw-admin subuser modify --subuser=s3test:fun --access="read"

{
"user_id": "s3test",
"display_name": "s3test",
"email": "",
"suspended": 0,
"max_buckets": 1,
"auid": 0,
"subusers": [
{
"id": "s3test:fun",
"permissions": "read"
}
],
"keys": [
{
"user": "s3test:fun",
"access_key": "N8Z8IJ1JK6A6ECB41VLV",

},
{
"user": "s3test",
"access_key": "ZREKTGN633R2U87OS8ZN",

}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": true,
"max_size_kb": -1,
"max_objects": 2
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

The above results in a set of keys that can only create buckets but not 
delete them. I can't create objects with this user either which is half 
the goal but I can still create a million buckets if I wanted which can 
make things very painful for a primary user. Is there a way to set it so 
that the subuser can not create buckets either?


On 1/28/16 10:14 AM, seapasu...@uchicago.edu wrote:
Ah thanks for the clarification. Sorry. so even setting max_buckets to 
0 will not prevent them from creating buckets:::


lacadmin@ko35-10:~$ radosgw-admin user modify --uid=s3test 
--max-buckets=0

{
"user_id": "s3test",
"display_name": "s3test",
"email": "",
"suspended": 0,
"max_buckets": 0,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "s3test:whoami",

},
{
"user": "s3test",

}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": true,
"max_size_kb": -1,
"max_objects": 2
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

-
-
-
In [5]: conn.get_canonical_user_id()
Out[5]: u's3test'
In [6]: conn.create_bucket('test')
Out[6]: 

In [7]: conn.create_bucket('test_one')
Out[7]: 

In [8]: conn.create_bucket('test_two')
Out[8]: 

In [9]: conn.create_bucket('test_three')
Out[9]: 

In [10]: conn.create_bucket('test_four')
Out[10]: 

In [11]: for bucket in conn.get_all_buckets():
   : print(bucket.name)
   :
test
test_four
test_one
test_three
test_two

In [12]: for bucket in conn.get_all_buckets():
   : conn.delete_bucket(bucket.name)



lacadmin@ko35-10:~$ radosgw-admin user modify --uid=s3test 
--max-buckets=1

{
"user_id": "s3test",
"display_name": "s3test",
"email": "",
"suspended": 0,
"max_buckets": 1,
-
-
-
In [15]: conn.create_bucket('s3test_one')
Out[15]: 

In [16]: conn.create_bucket('s3test_two')
--- 

S3ResponseError   Traceback (most recent call 
last)

 in ()
> 1 conn.create_bucket('s3test_two')

/usr/lib/python2.7/dist-packages/boto/s3/connection.pyc in 
create_bucket(self, bucket_name, headers, location, policy)

502 else:
503 raise self.provider.storage_response_error(
--> 504 response.status, response.reason, body)
505
506 def delete_bucket(self, bucket, headers=None):

S3ResponseError: S3ResponseError: 400 Bad Request
encoding="UTF-8"?>TooManyBuckets






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW :: bucket quota not enforced below 1

2016-01-28 Thread seapasu...@uchicago.edu
Ah thanks for the clarification. Sorry. so even setting max_buckets to 0 
will not prevent them from creating buckets:::


lacadmin@ko35-10:~$ radosgw-admin user modify --uid=s3test --max-buckets=0
{
"user_id": "s3test",
"display_name": "s3test",
"email": "",
"suspended": 0,
"max_buckets": 0,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "s3test:whoami",

},
{
"user": "s3test",

}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": true,
"max_size_kb": -1,
"max_objects": 2
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

-
-
-
In [5]: conn.get_canonical_user_id()
Out[5]: u's3test'
In [6]: conn.create_bucket('test')
Out[6]: 

In [7]: conn.create_bucket('test_one')
Out[7]: 

In [8]: conn.create_bucket('test_two')
Out[8]: 

In [9]: conn.create_bucket('test_three')
Out[9]: 

In [10]: conn.create_bucket('test_four')
Out[10]: 

In [11]: for bucket in conn.get_all_buckets():
   : print(bucket.name)
   :
test
test_four
test_one
test_three
test_two

In [12]: for bucket in conn.get_all_buckets():
   : conn.delete_bucket(bucket.name)



lacadmin@ko35-10:~$ radosgw-admin user modify --uid=s3test --max-buckets=1
{
"user_id": "s3test",
"display_name": "s3test",
"email": "",
"suspended": 0,
"max_buckets": 1,
-
-
-
In [15]: conn.create_bucket('s3test_one')
Out[15]: 

In [16]: conn.create_bucket('s3test_two')
---
S3ResponseError   Traceback (most recent call last)
 in ()
> 1 conn.create_bucket('s3test_two')

/usr/lib/python2.7/dist-packages/boto/s3/connection.pyc in 
create_bucket(self, bucket_name, headers, location, policy)

502 else:
503 raise self.provider.storage_response_error(
--> 504 response.status, response.reason, body)
505
506 def delete_bucket(self, bucket, headers=None):

S3ResponseError: S3ResponseError: 400 Bad Request
encoding="UTF-8"?>TooManyBuckets




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw-admin bucket link: empty bucket instance id

2016-01-28 Thread Wido den Hollander
Hi,

I'm trying to link a bucket to a new user and this is failing for me.

The Ceph version is 0.94.5 (Hammer).

The bucket is called 'packer' and I can verify that it exists:

$ radosgw-admin bucket stats --bucket packer

{
"bucket": "packer",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets",
"id": "ams02.5862567.3564",
"marker": "ams02.5862567.3564",
"owner": "X_beta",
"ver": "0#21975",
"master_ver": "0#0",
"mtime": "2015-08-04 12:31:06.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 10737764,
"size_kb_actual": 10737836,
"num_objects": 27
},
"rgw.multimeta": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Now when I try to link this bucket it fails:

$ radosgw-admin bucket link --bucket packer --uid 

"failure: (22) Invalid argument: empty bucket instance id"

It seems like this is a bug in the radosgw-admin tool where it doesn't
parse the --bucket argument properly.

Any ideas?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object-map

2016-01-28 Thread Jason Dillaman
There are no known issues with object map in the hammer release.  Support for 
this feature is not yet available in krbd.  Infernalis adds the ability to 
dynamically enable/disable features which certainly lowers the bar for testing 
it out.  Infernalis also adds support for fast-diff (which is based on the 
object map), which will dramatically speed up diffs and disk usage calculations.

-- 

Jason Dillaman 

- Original Message -
> From: "Wukongming" 
> To: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, January 27, 2016 4:06:17 AM
> Subject: [ceph-users] Object-map
> 
> Hi, all
> How is robust & stability when enable object-map feature in "Hammer"?
> 
> -
> wukongming ID: 12019
> Tel:0571-86760239
> Dept:2014 UIS2 ONEStor
> 
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C,
> which is
> intended only for the person or entity whose address is listed above. Any use
> of the
> information contained herein in any way (including, but not limited to, total
> or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Jason Dillaman
The way to interpret that output is that the HEAD revision of "CentOS3" has 
about a 700MB delta from the previous snapshot (i.e. 700MB + 18G are used by 
this image and its snapshot).  There probably should be an option in the rbd 
CLI to generate the full usage for a particular image and all of its snapshots. 
 Right now the only way to do that is to perform a du on the whole pool.

-- 

Jason Dillaman 

- Original Message - 

> From: "Bill WONG" 
> To: "Mihai Gheorghe" 
> Cc: "ceph-users" 
> Sent: Thursday, January 28, 2016 6:17:05 AM
> Subject: Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

> Hi Mihai,

> it looks rather strange in ceph snapshot, the size of snapshot is bigger than
> the size of original images..
> Original Image actual used size is 684M w/ provisioned 20G
> but the snapshot actual used size is ~18G w/ provisioned 20G

> any ideas?

> ==
> [root@compute2 ~]# rbd du CentOS3 -p storage1
> warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> NAME PROVISIONED USED
> CentOS3 20480M 684M

> [root@compute2 ~]# rbd du CentOS3@snap1 -p storage1
> warning: fast-diff map is not enabled for CentOS3. operation may be slow.
> NAME PROVISIONED USED
> CentOS3@snap1 20480M 18124M
> ==
> qemu-img info rbd:storage1/CentOS3
> image: rbd:storage1/CentOS3
> file format: raw
> virtual size: 20G (21474836480 bytes)
> disk size: unavailable
> cluster_size: 4194304
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> snap1 snap1 20G 1970-01-01 08:00:00 00:00:00.000

> On Thu, Jan 28, 2016 at 7:09 PM, Mihai Gheorghe < mcaps...@gmail.com > wrote:

> > As far as i know, snapshotting with qemu will download a copy of the image
> > on
> > local storage and then upload it into ceph. At least this is the default
> > behaviour when taking a snapshot in openstack of a running instance. I
> > don't
> > see why it would be any different with qemu-kvm. You must use the rbd snap
> > feature to make a copy on write clone of the image.
> 
> > On 28 Jan 2016 12:59, "Bill WONG" < wongahsh...@gmail.com > wrote:
> 

> > > Hi Simon,
> > 
> 

> > > how you manage to preform snapshot with the raw format in qemu-kvm VMs?
> > 
> 
> > > and i found some issues with libvirt virsh commands with ceph:
> > 
> 
> > > --
> > 
> 
> > > 1) create storage pool with ceph via virsh
> > 
> 
> > > 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
> > 
> 

> > > problem is here.. if we directly create vol via qemu-img create -f rbd
> > > rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
> > > vol-list
> > > rbdpool command is unable to list the vol, it looks such commands - rbd,
> > > virsh and qemu-img creating images are not synced with each other...
> > 
> 

> > > cloud you please let me know how you use the ceph as backend storage of
> > > qemu-kvm, as if i google it, most of the ceph application is used for
> > > OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and
> > > Cinder
> > > is troublesome...
> > 
> 

> > > thank you!
> > 
> 

> > > On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside < sirons...@caffetine.org
> > > >
> > > wrote:
> > 
> 

> > > > On 28/01/16 08:30, Bill WONG wrote:
> > > 
> > 
> 

> > > > > without having qcow2, the qemu-kvm cannot make snapshot and other
> > > > 
> > > 
> > 
> 
> > > > > features anyone have ideas or experiences on this?
> > > > 
> > > 
> > 
> 
> > > > > thank you!
> > > > 
> > > 
> > 
> 

> > > > I'm using raw too and create snapshots using "rbd snap create"
> > > 
> > 
> 

> > > > Cheers,
> > > 
> > 
> 
> > > > Simon
> > > 
> > 
> 

> > > ___
> > 
> 
> > > ceph-users mailing list
> > 
> 
> > > ceph-users@lists.ceph.com
> > 
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

On 28/01/16 14:51, Bill WONG wrote:


perfect! thank you!


You're welcome :)


do you find a strange issue on the snapshot of rbd? the snapshot actual
size is bigger than the original image file.. if you use rbd du to check
the size..


I'm using the hammer release so "rbd du" doesn't work for me.

Cheers,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data loss when flattening a cloned image on giant

2016-01-28 Thread Jason Dillaman
The parent / clone relationship is still established via the snapshot, just not 
the HEAD revision of the image.  Therefore, probably the easiest fix would be 
to ensure that a rollback would also copy that parent / clone relationship link 
to the HEAD revision from the snapshot.  Basically, since you rolled back past 
the flatten operation, we would also want the rollback operation to fully 
rollback the flatten (not just the block objects).

-- 

Jason Dillaman 


- Original Message -
> From: "wuxingyi" 
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, January 28, 2016 4:52:57 AM
> Subject: RE: [ceph-users] data loss when flattening a cloned image on giant
> 
> Thank you for your quick reply :)
>  
> The first object of the cloned image has already lost after flattening, so it
> may be too late to restore the parent relationship during the rollback
> operation.
> 
> 
> 
> > Date: Tue, 26 Jan 2016 09:50:56 -0500
> > From: dilla...@redhat.com
> > To: wuxingyi...@outlook.com
> > CC: ceph-users@lists.ceph.com; wuxin...@letv.com
> > Subject: Re: [ceph-users] data loss when flattening a cloned image on giant
> >
> > Interesting find. This is an interesting edge case interaction between
> > snapshot, flatten, and rollback. I believe this was unintentionally fixed
> > by the deep-flatten feature added to infernalis. Probably the simplest fix
> > for giant would be to restore the parent image link during the rollback
> > (since that link is still established via the snapshot).
> >
> > --
> >
> > Jason Dillaman
> >
> >
> > - Original Message -
> >> From: "wuxingyi" 
> >> To: ceph-users@lists.ceph.com
> >> Cc: wuxin...@letv.com
> >> Sent: Tuesday, January 26, 2016 3:11:11 AM
> >> Subject: Re: [ceph-users] data loss when flattening a cloned image on
> >> giant
> >>
> >> really sorry for the bad format, I will put it here again.
> >>
> >> I found data lost when flattening a cloned image on giant(0.87.2). The
> >> problem can be easily reproduced by runing the following script:
> >> #!/bin/bash
> >> ceph osd pool create wuxingyi 1 1
> >> rbd create --image-format 2 wuxingyi/disk1.img --size 8
> >> #writing "FOOBAR" at offset 0
> >> python writetooffset.py disk1.img 0 FOOBAR
> >> rbd snap create wuxingyi/disk1.img@SNAPSHOT
> >> rbd snap protect wuxingyi/disk1.img@SNAPSHOT
> >>
> >> echo "start cloing"
> >> rbd clone wuxingyi/disk1.img@SNAPSHOT wuxingyi/CLONEIMAGE
> >>
> >> #writing "WUXINGYI" at offset 4M of cloned image
> >> python writetooffset.py CLONEIMAGE $((4*1048576)) WUXINGYI
> >> rbd snap create wuxingyi/CLONEIMAGE@CLONEDSNAPSHOT
> >>
> >> #modify at offset 4M of cloned image
> >> python writetooffset.py CLONEIMAGE $((4*1048576)) HEHEHEHE
> >>
> >> echo "start flattening CLONEIMAGE"
> >> rbd flatten wuxingyi/CLONEIMAGE
> >>
> >> echo "before rollback"
> >> rbd export wuxingyi/CLONEIMAGE && hexdump -C CLONEIMAGE
> >> rm CLONEIMAGE -f
> >> rbd snap rollback wuxingyi/CLONEIMAGE@CLONEDSNAPSHOT
> >> echo "after rollback"
> >> rbd export wuxingyi/CLONEIMAGE && hexdump -C CLONEIMAGE
> >> rm CLONEIMAGE -f
> >>
> >>
> >> where writetooffset.py is a simple python script writing specific data to
> >> the
> >> specific offset of the image:
> >> #!/usr/bin/python
> >> #coding=utf-8
> >> import sys
> >> import rbd
> >> import rados
> >>
> >> cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
> >> cluster.connect()
> >> ioctx = cluster.open_ioctx('wuxingyi')
> >> rbd_inst = rbd.RBD()
> >> image=rbd.Image(ioctx, sys.argv[1])
> >> image.write(sys.argv[3], int(sys.argv[2]))
> >>
> >> The output is something like:
> >>
> >> before rollback
> >> Exporting image: 100% complete...done.
> >>  46 4f 4f 42 41 52 00 00 00 00 00 00 00 00 00 00
> >> |FOOBAR..|
> >> 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ||
> >> *
> >> 0040 48 45 48 45 48 45 48 45 00 00 00 00 00 00 00 00
> >> |HEHEHEHE|
> >> 00400010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ||
> >> *
> >> 0080
> >> Rolling back to snapshot: 100% complete...done.
> >> after rollback
> >> Exporting image: 100% complete...done.
> >>  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ||
> >> *
> >> 0040 57 55 58 49 4e 47 59 49 00 00 00 00 00 00 00 00
> >> |WUXINGYI|
> >> 00400010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ||
> >> *
> >> 0080
> >>
> >>
> >> We can easily fount that the first object of the image is definitely lost,
> >> and I found the data loss is happened when flattening, there is only a
> >> "head" version of the first object, actually a "snapid" version of the
> >> object should also be created and writed when flattening.
> >> But when running this scripts on upstream code, I cannot hit this problem.
> >> I
> >> look through the upstream code but could not find which commit fixes this
> >> bug. I also found the whole state m

Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
Hi Simon,

perfect! thank you!
do you find a strange issue on the snapshot of rbd? the snapshot actual
size is bigger than the original image file.. if you use rbd du to check
the size..


On Thu, Jan 28, 2016 at 9:44 PM, Simon Ironside 
wrote:

> On 28/01/16 12:56, Bill WONG wrote:
>
> you dump the snapshot to raw img, possible to exprt to qcow2 format?
>>
>
> Yes, I dump to raw because I'm able to get better and faster compression
> of the image myself than using the qcow2 format.
>
> You can export directly to qcow2 with qemu-img convert if you want:
> qemu-img convert -c -p -f rbd -O qcow2 \
> rbd:rbd/testvm@backup \
> testvm.qcow2
>
> and you meant create the VM using qcow2 with the local HDD storage, then
>> convert to rbd?
>> is it perfect, if you can provide more details... it's highly
>> appreciated.. thank you!
>>
>
> Ok . . .
>
> 1. Create the VM using a file-based qcow2 image then convert to rbd
>
> # Create a VM using a regular file
> virt-install --name testvm \
> --ram=1024 --vcpus=1 --os-variant=rhel7 \
> --controller scsi,model=virtio-scsi \
> --disk
> path=/var/lib/libvirt/images/testvm.qcow2,size=10,bus=scsi,format=qcow2 \
> --cdrom=/var/lib/libvirt/images/rhel7.iso \
> --nonetworks --graphics vnc
>
> # Complete your VM's setup then shut it down
>
> # Convert qcow2 image to rbd
> qemu-img convert -p -f qcow2 -O rbd \
> /var/lib/libvirt/images/testvm.qcow2 \
> rbd:rbd/testvm
>
> # Delete the qcow2 image, don't need it anymore
> rm -f /var/lib/libvirt/images/testvm.qcow2
>
> # Update the VM definition
> virsh edit testvm
>   # Find the  section referring to your original qcow2 image
>   # Delete it and replace with:
>
>   
> 
> 
>   
>   
>   
> 
> 
>   
> 
> 
>   
>
>   # Obvious use your own ceph monitor host name(s)
>   # Also change CEPH_USERNAME and SECRET_UUID to suit
>
> # Restart your VM, it'll now be using ceph storage directly.
>
> Btw, using virtio-scsi devices as above and discard='unmap' above enables
> TRIM support. This means you can use fstrim or mount file systems with
> discard inside the VM to free up unused space in the image.
>
> 2. Modify the XML produced by virt-install before the VM is started
>
> The process here is basically the same as above, the trick is to make the
> disk XML change before the VM is started for the first time so that it's
> not necessary to shut down the VM to copy from qcow2 file to rbd image.
>
> # Create RBD image for the VM
> qemu-img create -f rbd rbd:rbd/testvm 10G
>
> # Create a VM XML but don't start it
> virt-install --name testvm \
> --ram=1024 --vcpus=1 --os-variant=rhel7 \
> --controller scsi,model=virtio-scsi \
> --disk
> path=/var/lib/libvirt/images/deleteme.img,size=1,bus=scsi,format=raw \
> --cdrom=/var/lib/libvirt/images/rhel7.iso \
> --nonetworks --graphics vnc \
> --dry-run --print-step 1 > testvm.xml
>
> # Define the VM from XML
> virsh define testvm.xml
>
> # Update the VM definition
> virsh edit testvm
>   # Find the  section referring to your original deleteme image
>   # Delete it and replace it with RBD disk XML as in procedure 1.
>
> # Restart your VM, it'll now be using ceph storage directly.
>
> I think it's easier to understand what's going on with procedure 1 but
> once you're comfortable I suspect you'll end up using procedure 2, mainly
> because it saves having to shut down the VM and do the conversion and also
> because my compute nodes only have tiny local storage.
>
> It's also possible to script much of the above with the likes of virsh
> detach-disk and virsh attach-device to make the disk XML change.
>
> Cheers,
> Simon.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

On 28/01/16 12:56, Bill WONG wrote:


you dump the snapshot to raw img, possible to exprt to qcow2 format?


Yes, I dump to raw because I'm able to get better and faster compression 
of the image myself than using the qcow2 format.


You can export directly to qcow2 with qemu-img convert if you want:
qemu-img convert -c -p -f rbd -O qcow2 \
rbd:rbd/testvm@backup \
testvm.qcow2


and you meant create the VM using qcow2 with the local HDD storage, then
convert to rbd?
is it perfect, if you can provide more details... it's highly
appreciated.. thank you!


Ok . . .

1. Create the VM using a file-based qcow2 image then convert to rbd

# Create a VM using a regular file
virt-install --name testvm \
--ram=1024 --vcpus=1 --os-variant=rhel7 \
--controller scsi,model=virtio-scsi \
--disk 
path=/var/lib/libvirt/images/testvm.qcow2,size=10,bus=scsi,format=qcow2 \

--cdrom=/var/lib/libvirt/images/rhel7.iso \
--nonetworks --graphics vnc

# Complete your VM's setup then shut it down

# Convert qcow2 image to rbd
qemu-img convert -p -f qcow2 -O rbd \
/var/lib/libvirt/images/testvm.qcow2 \
rbd:rbd/testvm

# Delete the qcow2 image, don't need it anymore
rm -f /var/lib/libvirt/images/testvm.qcow2

# Update the VM definition
virsh edit testvm
  # Find the  section referring to your original qcow2 image
  # Delete it and replace with:

  


  
  
  


  


  

  # Obvious use your own ceph monitor host name(s)
  # Also change CEPH_USERNAME and SECRET_UUID to suit

# Restart your VM, it'll now be using ceph storage directly.

Btw, using virtio-scsi devices as above and discard='unmap' above 
enables TRIM support. This means you can use fstrim or mount file 
systems with discard inside the VM to free up unused space in the image.


2. Modify the XML produced by virt-install before the VM is started

The process here is basically the same as above, the trick is to make 
the disk XML change before the VM is started for the first time so that 
it's not necessary to shut down the VM to copy from qcow2 file to rbd image.


# Create RBD image for the VM
qemu-img create -f rbd rbd:rbd/testvm 10G

# Create a VM XML but don't start it
virt-install --name testvm \
--ram=1024 --vcpus=1 --os-variant=rhel7 \
--controller scsi,model=virtio-scsi \
--disk 
path=/var/lib/libvirt/images/deleteme.img,size=1,bus=scsi,format=raw \

--cdrom=/var/lib/libvirt/images/rhel7.iso \
--nonetworks --graphics vnc \
--dry-run --print-step 1 > testvm.xml

# Define the VM from XML
virsh define testvm.xml

# Update the VM definition
virsh edit testvm
  # Find the  section referring to your original deleteme image
  # Delete it and replace it with RBD disk XML as in procedure 1.

# Restart your VM, it'll now be using ceph storage directly.

I think it's easier to understand what's going on with procedure 1 but 
once you're comfortable I suspect you'll end up using procedure 2, 
mainly because it saves having to shut down the VM and do the conversion 
and also because my compute nodes only have tiny local storage.


It's also possible to script much of the above with the likes of virsh 
detach-disk and virsh attach-device to make the disk XML change.


Cheers,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical architecture in RDB mode - Number of servers explained ?

2016-01-28 Thread Eneko Lacunza

Hi,

El 28/01/16 a las 13:53, Gaetan SLONGO escribió:

Dear Ceph users,

We are currently working on CEPH (RBD mode only). The technology is 
currently in "preview" state in our lab. We are currently diving into 
Ceph design... We know it requires at least 3 nodes (OSDs+Monitors 
inside) to work properly. But we would like to know if it makes sense 
to use 4 nodes ? I've heard this is not a good idea because all of the 
capacity of the 4 servers won't be available ?

Someone can confirm ?
There's no problem to use 4 servers for OSD; just don't put a monitor in 
one of the nodes. Always keep an odd number of monitors (3 or 5).


Monitors don't need to be in a OSD node, and in fact for medium and 
large clusters it is recommended to have a dedicated node for them.


Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
Hi Simon,

thank you!
you dump the snapshot to raw img, possible to exprt to qcow2 format?
and you meant create the VM using qcow2 with the local HDD storage, then
convert to rbd?
is it perfect, if you can provide more details... it's highly appreciated..
thank you!



On Thu, Jan 28, 2016 at 7:57 PM, Simon Ironside 
wrote:

> On 28/01/16 10:59, Bill WONG wrote:
>
> how you manage to preform snapshot with the raw format in qemu-kvm VMs?
>>
>
> # Create a snapshot called backup of testvm's image in the rbd pool:
> rbd snap create --read-only rbd/testvm@backup
>
> # Dump the snapshot to file
> rbd export rbd/testvm@backup testvm-backup.img
>
> # Delete the snapshot
> rbd snap rm rbd/testvm@backup
>
> cloud you please let me know how you use the ceph as backend storage of
>> qemu-kvm, as if i google it, most of the ceph application is used for
>> OpenStack, but not simply pure qemu-kvm.
>>
>
> I'm using pure libvirt/qemu-kvm too. Last I checked, virt-install doesn't
> support using rbd volumes directly. There's two ways I know of to get
> around this:
>
> 1. Create the VM using a file-based qcow2 image then convert to rbd
>
> 2. Modify the XML produced by virt-install before the VM is started
>
> I can provide detailed steps for these if you need them.
>
> Cheers,
> Simon
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Typical architecture in RDB mode - Number of servers explained ?

2016-01-28 Thread Gaetan SLONGO
Dear Ceph users, 

We are currently working on CEPH (RBD mode only). The technology is currently 
in "preview" state in our lab. We are currently diving into Ceph design... We 
know it requires at least 3 nodes (OSDs+Monitors inside) to work properly. But 
we would like to know if it makes sense to use 4 nodes ? I've heard this is not 
a good idea because all of the capacity of the 4 servers won't be available ? 
Someone can confirm ? 

Thank you for advance, 

Best regards 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
Yum repo is not an intensive workload, take a look at GFS2 or OCFS then.

Jan

> On 28 Jan 2016, at 13:23, Sándor Szombat  wrote:
> 
> Sorry I forget something: we use ceph for another things (for example docker 
> registry backend).
> And yes, cephfs maybe overkill but we seach for best solution now. And we 
> want to reduce the used tools count so we can solve this with Ceph it's good.
> 
> Thanks for your help and I will check these.
> 
> 2016-01-28 13:13 GMT+01:00 Jan Schermer  >:
> Yum repo doesn't really sound like something "mission critical" (in my 
> experience). You need to wait for the repo to update anyway before you can 
> use the package, so not something realtime either.
> CephFS is overkill for this.
> 
> I would either simply rsync the repo between the three machines via cron (and 
> set DNS to point to all three IPs they have). If you need something "more 
> realtime" than you can use for example incron.
> Or you can push new packages to all three and refresh them (assuming you have 
> some tools that do that already).
> Or if you want to use Ceph then you can create one rbd image that only gets 
> mounted on one of the hosts, and if that goes down you remount it elsewhere 
> (by hand, or via pacemaker, cron script...). I don't think Ceph makes sense 
> if that going to be the only use, though...
> 
> Maybe you could also push the packages to RadosGW, but I'm not familiar with 
> it that much, not sure how you build a repo that points there. This would 
> make sense but I have no idea if it's simple to do.
> 
> Jan
> 
> 
>> On 28 Jan 2016, at 13:03, Sándor Szombat > > wrote:
>> 
>> Hello,
>> 
>> yes I missunderstand things I things. Thanks for your help!
>> 
>> So the situation is next: We have a yum repo with rpm packages. We want to 
>> store these rpm's in Ceph. But we have three main node what can be able to 
>> install the other nodes. So we have to share these rpm packages between the 
>> three host. I check the CephFs, it will be the best solution for us, but it 
>> is in beta and beta products cant allowed for us. This is why I tryto find 
>> another, Ceph based solution. 
>> 
>> 
>> 
>> 2016-01-28 12:46 GMT+01:00 Jan Schermer > >:
>> This is somewhat confusing.
>> 
>> CephFS is a shared filesystem - you mount that on N hosts and they can 
>> access the data simultaneously.
>> RBD is a block device, this block device can be accesses from more than 1 
>> host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).
>> 
>> Both CephFS and RBD use RADOS as a backend, which is responsible for data 
>> placement, high-availability and so on.
>> 
>> If you explain your scenario more we could suggest some options - do you 
>> really need to have the data accessible on more servers, or is a (short) 
>> outage acceptable when one server goes down? What type of data do you need 
>> to share and how will the data be accessed?
>> 
>> Jan
>> 
>> > On 28 Jan 2016, at 11:06, Sándor Szombat > > > wrote:
>> >
>> > Hello all!
>> >
>> > I check the Ceph FS, but it is beta now unfortunatelly. I start to meet 
>> > with the rdb. It is possible to create an image in a pool, mount it as a 
>> > block device (for example /dev/rbd0), and format this as HDD, and mount it 
>> > on 2 host? I tried to make this, and it's work but after mount the  
>> > /dev/rbd0 on the two host and I tried to put files into these mounted 
>> > folders it can't refresh automatically between hosts.
>> > So the main question: this will be a possible solution?
>> > (The task: we have 3 main node what can install the other nodes with 
>> > ansible, and we want to store our rpm's in ceph it is possible. This is 
>> > necessary because of high avability.)
>> >
>> > Thanks for your help!
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> > 
>> 
>> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Zoltan Arnold Nagy
This has recently been fixed and will be available in Mitaka (after a lot of 
people fighting this for years).

Details: https://review.openstack.org/#/c/205282/ 


> On 28 Jan 2016, at 12:09, Mihai Gheorghe  wrote:
> 
> As far as i know, snapshotting with qemu will download a copy of the image on 
> local storage and then upload it into ceph. At least this is the default 
> behaviour when taking a snapshot in openstack of a running instance. I don't 
> see why it would be any different with qemu-kvm. You must use the rbd snap 
> feature to make a copy on write clone of the image.
> 
> On 28 Jan 2016 12:59, "Bill WONG"  > wrote:
> Hi Simon,
> 
> how you manage to preform snapshot with the raw format in qemu-kvm VMs?
> and i found some issues with libvirt virsh commands with ceph:
> -- 
> 1) create storage pool with ceph via virsh 
> 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
> 
> problem is here.. if we directly create vol via qemu-img create -f rbd 
> rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh vol-list 
> rbdpool command is unable to list the vol, it looks such commands - rbd, 
> virsh and qemu-img creating images are not synced with each other...
> 
> cloud you please let me know how you use the ceph as backend storage of 
> qemu-kvm, as if i google it, most of the ceph application is used for 
> OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and Cinder 
> is troublesome...
> 
> thank you!
> 
> 
> On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside  > wrote:
> On 28/01/16 08:30, Bill WONG wrote:
> 
> without having qcow2, the qemu-kvm cannot make snapshot and other
> features anyone have ideas or experiences on this?
> thank you!
> 
> I'm using raw too and create snapshots using "rbd snap create"
> 
> Cheers,
> Simon
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Sándor Szombat
Sorry I forget something: we use ceph for another things (for example
docker registry backend).
And yes, cephfs maybe overkill but we seach for best solution now. And we
want to reduce the used tools count so we can solve this with Ceph it's
good.

Thanks for your help and I will check these.

2016-01-28 13:13 GMT+01:00 Jan Schermer :

> Yum repo doesn't really sound like something "mission critical" (in my
> experience). You need to wait for the repo to update anyway before you can
> use the package, so not something realtime either.
> CephFS is overkill for this.
>
> I would either simply rsync the repo between the three machines via cron
> (and set DNS to point to all three IPs they have). If you need something
> "more realtime" than you can use for example incron.
> Or you can push new packages to all three and refresh them (assuming you
> have some tools that do that already).
> Or if you want to use Ceph then you can create one rbd image that only
> gets mounted on one of the hosts, and if that goes down you remount it
> elsewhere (by hand, or via pacemaker, cron script...). I don't think Ceph
> makes sense if that going to be the only use, though...
>
> Maybe you could also push the packages to RadosGW, but I'm not familiar
> with it that much, not sure how you build a repo that points there. This
> would make sense but I have no idea if it's simple to do.
>
> Jan
>
>
> On 28 Jan 2016, at 13:03, Sándor Szombat  wrote:
>
> Hello,
>
> yes I missunderstand things I things. Thanks for your help!
>
> So the situation is next: We have a yum repo with rpm packages. We want to
> store these rpm's in Ceph. But we have three main node what can be able to
> install the other nodes. So we have to share these rpm packages between the
> three host. I check the CephFs, it will be the best solution for us, but it
> is in beta and beta products cant allowed for us. This is why I tryto find
> another, Ceph based solution.
>
>
>
> 2016-01-28 12:46 GMT+01:00 Jan Schermer :
>
>> This is somewhat confusing.
>>
>> CephFS is a shared filesystem - you mount that on N hosts and they can
>> access the data simultaneously.
>> RBD is a block device, this block device can be accesses from more than 1
>> host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).
>>
>> Both CephFS and RBD use RADOS as a backend, which is responsible for data
>> placement, high-availability and so on.
>>
>> If you explain your scenario more we could suggest some options - do you
>> really need to have the data accessible on more servers, or is a (short)
>> outage acceptable when one server goes down? What type of data do you need
>> to share and how will the data be accessed?
>>
>> Jan
>>
>> > On 28 Jan 2016, at 11:06, Sándor Szombat 
>> wrote:
>> >
>> > Hello all!
>> >
>> > I check the Ceph FS, but it is beta now unfortunatelly. I start to meet
>> with the rdb. It is possible to create an image in a pool, mount it as a
>> block device (for example /dev/rbd0), and format this as HDD, and mount it
>> on 2 host? I tried to make this, and it's work but after mount the
>> /dev/rbd0 on the two host and I tried to put files into these mounted
>> folders it can't refresh automatically between hosts.
>> > So the main question: this will be a possible solution?
>> > (The task: we have 3 main node what can install the other nodes with
>> ansible, and we want to store our rpm's in ceph it is possible. This is
>> necessary because of high avability.)
>> >
>> > Thanks for your help!
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
Yum repo doesn't really sound like something "mission critical" (in my 
experience). You need to wait for the repo to update anyway before you can use 
the package, so not something realtime either.
CephFS is overkill for this.

I would either simply rsync the repo between the three machines via cron (and 
set DNS to point to all three IPs they have). If you need something "more 
realtime" than you can use for example incron.
Or you can push new packages to all three and refresh them (assuming you have 
some tools that do that already).
Or if you want to use Ceph then you can create one rbd image that only gets 
mounted on one of the hosts, and if that goes down you remount it elsewhere (by 
hand, or via pacemaker, cron script...). I don't think Ceph makes sense if that 
going to be the only use, though...

Maybe you could also push the packages to RadosGW, but I'm not familiar with it 
that much, not sure how you build a repo that points there. This would make 
sense but I have no idea if it's simple to do.

Jan


> On 28 Jan 2016, at 13:03, Sándor Szombat  wrote:
> 
> Hello,
> 
> yes I missunderstand things I things. Thanks for your help!
> 
> So the situation is next: We have a yum repo with rpm packages. We want to 
> store these rpm's in Ceph. But we have three main node what can be able to 
> install the other nodes. So we have to share these rpm packages between the 
> three host. I check the CephFs, it will be the best solution for us, but it 
> is in beta and beta products cant allowed for us. This is why I tryto find 
> another, Ceph based solution. 
> 
> 
> 
> 2016-01-28 12:46 GMT+01:00 Jan Schermer  >:
> This is somewhat confusing.
> 
> CephFS is a shared filesystem - you mount that on N hosts and they can access 
> the data simultaneously.
> RBD is a block device, this block device can be accesses from more than 1 
> host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).
> 
> Both CephFS and RBD use RADOS as a backend, which is responsible for data 
> placement, high-availability and so on.
> 
> If you explain your scenario more we could suggest some options - do you 
> really need to have the data accessible on more servers, or is a (short) 
> outage acceptable when one server goes down? What type of data do you need to 
> share and how will the data be accessed?
> 
> Jan
> 
> > On 28 Jan 2016, at 11:06, Sándor Szombat  > > wrote:
> >
> > Hello all!
> >
> > I check the Ceph FS, but it is beta now unfortunatelly. I start to meet 
> > with the rdb. It is possible to create an image in a pool, mount it as a 
> > block device (for example /dev/rbd0), and format this as HDD, and mount it 
> > on 2 host? I tried to make this, and it's work but after mount the  
> > /dev/rbd0 on the two host and I tried to put files into these mounted 
> > folders it can't refresh automatically between hosts.
> > So the main question: this will be a possible solution?
> > (The task: we have 3 main node what can install the other nodes with 
> > ansible, and we want to store our rpm's in ceph it is possible. This is 
> > necessary because of high avability.)
> >
> > Thanks for your help!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Sándor Szombat
Hello,

yes I missunderstand things I things. Thanks for your help!

So the situation is next: We have a yum repo with rpm packages. We want to
store these rpm's in Ceph. But we have three main node what can be able to
install the other nodes. So we have to share these rpm packages between the
three host. I check the CephFs, it will be the best solution for us, but it
is in beta and beta products cant allowed for us. This is why I tryto find
another, Ceph based solution.



2016-01-28 12:46 GMT+01:00 Jan Schermer :

> This is somewhat confusing.
>
> CephFS is a shared filesystem - you mount that on N hosts and they can
> access the data simultaneously.
> RBD is a block device, this block device can be accesses from more than 1
> host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).
>
> Both CephFS and RBD use RADOS as a backend, which is responsible for data
> placement, high-availability and so on.
>
> If you explain your scenario more we could suggest some options - do you
> really need to have the data accessible on more servers, or is a (short)
> outage acceptable when one server goes down? What type of data do you need
> to share and how will the data be accessed?
>
> Jan
>
> > On 28 Jan 2016, at 11:06, Sándor Szombat 
> wrote:
> >
> > Hello all!
> >
> > I check the Ceph FS, but it is beta now unfortunatelly. I start to meet
> with the rdb. It is possible to create an image in a pool, mount it as a
> block device (for example /dev/rbd0), and format this as HDD, and mount it
> on 2 host? I tried to make this, and it's work but after mount the
> /dev/rbd0 on the two host and I tried to put files into these mounted
> folders it can't refresh automatically between hosts.
> > So the main question: this will be a possible solution?
> > (The task: we have 3 main node what can install the other nodes with
> ansible, and we want to store our rpm's in ceph it is possible. This is
> necessary because of high avability.)
> >
> > Thanks for your help!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

On 28/01/16 10:59, Bill WONG wrote:


how you manage to preform snapshot with the raw format in qemu-kvm VMs?


# Create a snapshot called backup of testvm's image in the rbd pool:
rbd snap create --read-only rbd/testvm@backup

# Dump the snapshot to file
rbd export rbd/testvm@backup testvm-backup.img

# Delete the snapshot
rbd snap rm rbd/testvm@backup


cloud you please let me know how you use the ceph as backend storage of
qemu-kvm, as if i google it, most of the ceph application is used for
OpenStack, but not simply pure qemu-kvm.


I'm using pure libvirt/qemu-kvm too. Last I checked, virt-install 
doesn't support using rbd volumes directly. There's two ways I know of 
to get around this:


1. Create the VM using a file-based qcow2 image then convert to rbd

2. Modify the XML produced by virt-install before the VM is started

I can provide detailed steps for these if you need them.

Cheers,
Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
This is somewhat confusing.

CephFS is a shared filesystem - you mount that on N hosts and they can access 
the data simultaneously.
RBD is a block device, this block device can be accesses from more than 1 host, 
BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).

Both CephFS and RBD use RADOS as a backend, which is responsible for data 
placement, high-availability and so on.

If you explain your scenario more we could suggest some options - do you really 
need to have the data accessible on more servers, or is a (short) outage 
acceptable when one server goes down? What type of data do you need to share 
and how will the data be accessed?

Jan

> On 28 Jan 2016, at 11:06, Sándor Szombat  wrote:
> 
> Hello all! 
> 
> I check the Ceph FS, but it is beta now unfortunatelly. I start to meet with 
> the rdb. It is possible to create an image in a pool, mount it as a block 
> device (for example /dev/rbd0), and format this as HDD, and mount it on 2 
> host? I tried to make this, and it's work but after mount the  /dev/rbd0 on 
> the two host and I tried to put files into these mounted folders it can't 
> refresh automatically between hosts. 
> So the main question: this will be a possible solution?
> (The task: we have 3 main node what can install the other nodes with ansible, 
> and we want to store our rpm's in ceph it is possible. This is necessary 
> because of high avability.)
> 
> Thanks for your help!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
Hi Mihai,

it looks rather strange in ceph snapshot, the size of snapshot is bigger
than the size of original images..
Original Image actual used size is 684M w/ provisioned 20G
but the snapshot  actual used size is ~18G w/ provisioned 20G

any ideas?

==
[root@compute2 ~]# rbd du CentOS3 -p storage1
warning: fast-diff map is not enabled for CentOS3. operation may be slow.
NAMEPROVISIONED USED
CentOS3  20480M 684M

[root@compute2 ~]# rbd du CentOS3@snap1 -p storage1
warning: fast-diff map is not enabled for CentOS3. operation may be slow.
NAME  PROVISIONED   USED
CentOS3@snap1  20480M 18124M
==
qemu-img info rbd:storage1/CentOS3
image: rbd:storage1/CentOS3
file format: raw
virtual size: 20G (21474836480 bytes)
disk size: unavailable
cluster_size: 4194304
Snapshot list:
IDTAG VM SIZEDATE   VM CLOCK
snap1 snap1   20G 1970-01-01 08:00:00   00:00:00.000


On Thu, Jan 28, 2016 at 7:09 PM, Mihai Gheorghe  wrote:

> As far as i know, snapshotting with qemu will download a copy of the image
> on local storage and then upload it into ceph. At least this is the default
> behaviour when taking a snapshot in openstack of a running instance. I
> don't see why it would be any different with qemu-kvm. You must use the rbd
> snap feature to make a copy on write clone of the image.
> On 28 Jan 2016 12:59, "Bill WONG"  wrote:
>
>> Hi Simon,
>>
>> how you manage to preform snapshot with the raw format in qemu-kvm VMs?
>> and i found some issues with libvirt virsh commands with ceph:
>> --
>> 1) create storage pool with ceph via virsh
>> 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
>>
>> problem is here.. if we directly create vol via qemu-img create -f rbd
>> rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
>> vol-list rbdpool command is unable to list the vol, it looks such commands
>> - rbd, virsh and qemu-img creating images are not synced with each other...
>>
>> cloud you please let me know how you use the ceph as backend storage of
>> qemu-kvm, as if i google it, most of the ceph application is used for
>> OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and Cinder
>> is troublesome...
>>
>> thank you!
>>
>>
>> On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside 
>> wrote:
>>
>>> On 28/01/16 08:30, Bill WONG wrote:
>>>
>>> without having qcow2, the qemu-kvm cannot make snapshot and other
 features anyone have ideas or experiences on this?
 thank you!

>>>
>>> I'm using raw too and create snapshots using "rbd snap create"
>>>
>>> Cheers,
>>> Simon
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Mihai Gheorghe
As far as i know, snapshotting with qemu will download a copy of the image
on local storage and then upload it into ceph. At least this is the default
behaviour when taking a snapshot in openstack of a running instance. I
don't see why it would be any different with qemu-kvm. You must use the rbd
snap feature to make a copy on write clone of the image.
On 28 Jan 2016 12:59, "Bill WONG"  wrote:

> Hi Simon,
>
> how you manage to preform snapshot with the raw format in qemu-kvm VMs?
> and i found some issues with libvirt virsh commands with ceph:
> --
> 1) create storage pool with ceph via virsh
> 2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G
>
> problem is here.. if we directly create vol via qemu-img create -f rbd
> rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
> vol-list rbdpool command is unable to list the vol, it looks such commands
> - rbd, virsh and qemu-img creating images are not synced with each other...
>
> cloud you please let me know how you use the ceph as backend storage of
> qemu-kvm, as if i google it, most of the ceph application is used for
> OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and Cinder
> is troublesome...
>
> thank you!
>
>
> On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside 
> wrote:
>
>> On 28/01/16 08:30, Bill WONG wrote:
>>
>> without having qcow2, the qemu-kvm cannot make snapshot and other
>>> features anyone have ideas or experiences on this?
>>> thank you!
>>>
>>
>> I'm using raw too and create snapshots using "rbd snap create"
>>
>> Cheers,
>> Simon
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
Hi Simon,

how you manage to preform snapshot with the raw format in qemu-kvm VMs?
and i found some issues with libvirt virsh commands with ceph:
-- 
1) create storage pool with ceph via virsh
2) create a vol via virsh - virsh vol-create-as rbdpool VM1 100G

problem is here.. if we directly create vol via qemu-img create -f rbd
rbd:rbdpool/VM1 100G, then virsh is unable to find the vol. - virsh
vol-list rbdpool command is unable to list the vol, it looks such commands
- rbd, virsh and qemu-img creating images are not synced with each other...

cloud you please let me know how you use the ceph as backend storage of
qemu-kvm, as if i google it, most of the ceph application is used for
OpenStack, but not simply pure qemu-kvm. as if setting up Glnace and Cinder
is troublesome...

thank you!


On Thu, Jan 28, 2016 at 5:23 PM, Simon Ironside 
wrote:

> On 28/01/16 08:30, Bill WONG wrote:
>
> without having qcow2, the qemu-kvm cannot make snapshot and other
>> features anyone have ideas or experiences on this?
>> thank you!
>>
>
> I'm using raw too and create snapshots using "rbd snap create"
>
> Cheers,
> Simon
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Loris Cuoghi

Le 28/01/2016 11:06, Sándor Szombat a écrit :

Hello all!

I check the Ceph FS, but it is beta now unfortunatelly. I start to meet
with the rdb. It is possible to create an image in a pool, mount it as a
block device (for example /dev/rbd0), and format this as HDD, and mount
it on 2 host? I tried to make this, and it's work but after mount
the  /dev/rbd0 on the two host and I tried to put files into these
mounted folders it can't refresh automatically between hosts.
So the main question: this will be a possible solution?
(The task: we have 3 main node what can install the other nodes with
ansible, and we want to store our rpm's in ceph it is possible. This is
necessary because of high avability.)

Thanks for your help!


Hi Sándor,

sharing a block device and filesystem between two or more hosts is only 
possible with a purpose-built filesystem : a shared-disk clustered 
filesystem like OCFS2 or others listed here [0].


[0] 
https://en.wikipedia.org/wiki/Clustered_file_system#Shared-disk_file_system


Classic filesystems like EXT{2,3,4}, XFS, ... are meant to be mounted by 
one host at a time, otherwise, corruption ensues.


Loris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Sándor Szombat
Hello all!

I check the Ceph FS, but it is beta now unfortunatelly. I start to meet
with the rdb. It is possible to create an image in a pool, mount it as a
block device (for example /dev/rbd0), and format this as HDD, and mount it
on 2 host? I tried to make this, and it's work but after mount
the  /dev/rbd0 on the two host and I tried to put files into these mounted
folders it can't refresh automatically between hosts.
So the main question: this will be a possible solution?
(The task: we have 3 main node what can install the other nodes with
ansible, and we want to store our rpm's in ceph it is possible. This is
necessary because of high avability.)

Thanks for your help!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data loss when flattening a cloned image on giant

2016-01-28 Thread wuxingyi
Thank you for your quick reply :)
 
The first object of the cloned image has already lost after flattening, so it 
may be too late to restore the parent relationship during the rollback 
operation.



> Date: Tue, 26 Jan 2016 09:50:56 -0500
> From: dilla...@redhat.com
> To: wuxingyi...@outlook.com
> CC: ceph-users@lists.ceph.com; wuxin...@letv.com
> Subject: Re: [ceph-users] data loss when flattening a cloned image on giant
>
> Interesting find. This is an interesting edge case interaction between 
> snapshot, flatten, and rollback. I believe this was unintentionally fixed by 
> the deep-flatten feature added to infernalis. Probably the simplest fix for 
> giant would be to restore the parent image link during the rollback (since 
> that link is still established via the snapshot).
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>> From: "wuxingyi" 
>> To: ceph-users@lists.ceph.com
>> Cc: wuxin...@letv.com
>> Sent: Tuesday, January 26, 2016 3:11:11 AM
>> Subject: Re: [ceph-users] data loss when flattening a cloned image on giant
>>
>> really sorry for the bad format, I will put it here again.
>>
>> I found data lost when flattening a cloned image on giant(0.87.2). The
>> problem can be easily reproduced by runing the following script:
>> #!/bin/bash
>> ceph osd pool create wuxingyi 1 1
>> rbd create --image-format 2 wuxingyi/disk1.img --size 8
>> #writing "FOOBAR" at offset 0
>> python writetooffset.py disk1.img 0 FOOBAR
>> rbd snap create wuxingyi/disk1.img@SNAPSHOT
>> rbd snap protect wuxingyi/disk1.img@SNAPSHOT
>>
>> echo "start cloing"
>> rbd clone wuxingyi/disk1.img@SNAPSHOT wuxingyi/CLONEIMAGE
>>
>> #writing "WUXINGYI" at offset 4M of cloned image
>> python writetooffset.py CLONEIMAGE $((4*1048576)) WUXINGYI
>> rbd snap create wuxingyi/CLONEIMAGE@CLONEDSNAPSHOT
>>
>> #modify at offset 4M of cloned image
>> python writetooffset.py CLONEIMAGE $((4*1048576)) HEHEHEHE
>>
>> echo "start flattening CLONEIMAGE"
>> rbd flatten wuxingyi/CLONEIMAGE
>>
>> echo "before rollback"
>> rbd export wuxingyi/CLONEIMAGE && hexdump -C CLONEIMAGE
>> rm CLONEIMAGE -f
>> rbd snap rollback wuxingyi/CLONEIMAGE@CLONEDSNAPSHOT
>> echo "after rollback"
>> rbd export wuxingyi/CLONEIMAGE && hexdump -C CLONEIMAGE
>> rm CLONEIMAGE -f
>>
>>
>> where writetooffset.py is a simple python script writing specific data to the
>> specific offset of the image:
>> #!/usr/bin/python
>> #coding=utf-8
>> import sys
>> import rbd
>> import rados
>>
>> cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
>> cluster.connect()
>> ioctx = cluster.open_ioctx('wuxingyi')
>> rbd_inst = rbd.RBD()
>> image=rbd.Image(ioctx, sys.argv[1])
>> image.write(sys.argv[3], int(sys.argv[2]))
>>
>> The output is something like:
>>
>> before rollback
>> Exporting image: 100% complete...done.
>>  46 4f 4f 42 41 52 00 00 00 00 00 00 00 00 00 00
>> |FOOBAR..|
>> 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ||
>> *
>> 0040 48 45 48 45 48 45 48 45 00 00 00 00 00 00 00 00
>> |HEHEHEHE|
>> 00400010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ||
>> *
>> 0080
>> Rolling back to snapshot: 100% complete...done.
>> after rollback
>> Exporting image: 100% complete...done.
>>  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ||
>> *
>> 0040 57 55 58 49 4e 47 59 49 00 00 00 00 00 00 00 00
>> |WUXINGYI|
>> 00400010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ||
>> *
>> 0080
>>
>>
>> We can easily fount that the first object of the image is definitely lost,
>> and I found the data loss is happened when flattening, there is only a
>> "head" version of the first object, actually a "snapid" version of the
>> object should also be created and writed when flattening.
>> But when running this scripts on upstream code, I cannot hit this problem. I
>> look through the upstream code but could not find which commit fixes this
>> bug. I also found the whole state machine dealing with RBD layering changed
>> a lot since giant release.
>>
>> Could you please give me some hints on which commits should I backport?
>> Thanks
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS fsync failed and read error

2016-01-28 Thread Yan, Zheng
On Thu, Jan 28, 2016 at 3:14 PM, FaHui Lin  wrote:
> Dear Mr. Zheng and all,
>
> I'm pretty sure the permission is OK because I could read/write normally
> before and even now I still can write a new file (via stdout redirection) ,
> as shown in my last mail. The problem happens when fsync is called (e.g.
> using vim) and fails and the file can no longer be opened (say via cat,
> less, cp, ... ).
>
> What I'd like to know are:
> 1) How can I find out more clues about this issues
> 2) Is there any way to regain my data on this CephFS? (something like
> fsck?).


this should be kernel client issue, no need fsck

please enable kernel dynamic debug:
echo module ceph +p > /sys/kernel/debug/dynamic_debug/control
echo module libceph +p > /sys/kernel/debug/dynamic_debug/control

run vim and cat

gather the kernel log and send it to us.

Regards
Yan, Zheng.








>
> By the way. Here is the setup for the authentication:
>
> [root@dl-disk4 ~]# ceph auth get client.user
> exported keyring for client.user
> [client.user]
> key = ..==
> caps mon = "allow r"
> caps osd = "allow rw pool=cephfs,cephfsmeta"
>
> [root@dl-disk4 ~]# cephfs-journal-tool journal inspect
> Overall journal integrity: OK
>
> Thank you.
>
> Best Regards,
> FaHui
>
>
>
> Yan, Zheng 於 2016/1/28 下午 02:51 寫道:
>
> This seems like the user has no read/write permission to cephfs data pool.
>
> Regards
> Yan, Zheng
>
> On Thu, Jan 28, 2016 at 11:36 AM, FaHui Lin  wrote:
>
> Dear Ceph experts,
>
> I've got a problem with CephFS one day.
> When I use vim to edit a file on cephfs, it will show fsync failed, and
> later the file cannot be read/open anymore.
> Strangely there is no error I can spot on ceph logs, dmesg, etc.
> Here is an example below: (all machines in my ceph cluster have the same OS,
> kernel, and ceph version)
>
> [root@dl-disk4 ceph-dir]# uname -a
> Linux dl-disk4 3.10.0-327.4.4.el7.x86_64 #1 SMP Tue Jan 5 16:07:00 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
>
> [root@dl-disk4 ceph-dir]# ceph version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>
> [root@dl-disk4 ceph-dir]# mount | grep cephfs
> xxx.xxx.xxx.xxx:6789:/ on /cephfs type ceph
> (rw,relatime,name=user,secret=)
>
> [root@dl-disk4 ceph-dir]# ceph -s
> cluster 13c231fc-837e-48bb-b4d4-8a0ce1c12a24
>  health HEALTH_WARN
> too many PGs per OSD (645 > max 300)
>  monmap e1: 3 mons at
> {dl-disk1=xxx.xxx.xxx.xxx:6789/0,dl-disk2=xxx.xxx.xxx.xxx:6789/0,dl-disk3=xxx.xxx.xxx.xxx:6789/0}
> election epoch 60, quorum 0,1,2 dl-disk1,dl-disk2,dl-disk3
>  mdsmap e76: 1/1/1 up {0=dl-disk4=up:active}
>  osdmap e307: 32 osds: 32 up, 32 in
>   pgmap v239602: 8288 pgs, 4 pools, 375 GB data, 1311 kobjects
> 924 GB used, 348 TB / 349 TB avail
> 8288 active+clean
> [root@dl-disk4 ceph-dir]# ceph health detail
> HEALTH_WARN too many PGs per OSD (645 > max 300)
> too many PGs per OSD (645 > max 300)
>
> [root@dl-disk4 ceph-dir]# echo "hello123" > /cephfs/test1
> [root@dl-disk4 ceph-dir]# cat /cephfs/test1
> hello123
>
> [root@dl-disk4 ~]# ll /cephfs/test1
> -rw-r--r-- 1 root root 9 Jan 28 02:27 /cephfs/test1
>
> [root@dl-disk4 ceph-dir]# vim /cephfs/test1
> (in vim)
> "/cephfs/test1"
> "/cephfs/test1" E667: Fsync failed
>
> [root@dl-disk4 ceph-dir]# cat /cephfs/test1
> cat: /cephfs/test1: Operation not permitted
>
> [root@dl-disk4 ceph-dir]# less /cephfs/test1
> (read error)
>
> [root@dl-disk4 ceph-dir]# strace /cephfs/test1
> execve("/cephfs/test1", ["/cephfs/test1"], [/* 32 vars */]) = -1 EACCES
> (Permission denied)
> write(2, "strace: exec: Permission denied\n", 32strace: exec: Permission
> denied
> ) = 32
> exit_group(1)   = ?
> +++ exited with 1 +++
> [root@dl-disk4 ceph-dir]# strace cat /cephfs/test1
> execve("/usr/bin/cat", ["cat", "/cephfs/test1"], [/* 32 vars */]) = 0
> brk(0)  = 0x977000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f44e7a67000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or
> directory)
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=129385, ...}) = 0
> mmap(NULL, 129385, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f44e7a47000
> close(3)= 0
> open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0
> mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
> 0x7f44e7486000
> mprotect(0x7f44e763c000, 2097152, PROT_NONE) = 0
> mmap(0x7f44e783c000, 24576, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f44e783c000
> mmap(0x7f44e7842000, 16960, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f44e7842000
> close(3)= 0
> mmap(NULL, 4096

Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

On 28/01/16 08:30, Bill WONG wrote:


without having qcow2, the qemu-kvm cannot make snapshot and other
features anyone have ideas or experiences on this?
thank you!


I'm using raw too and create snapshots using "rbd snap create"

Cheers,
Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Simon Ironside

On 28/01/16 03:37, Bill WONG wrote:

thank you! possible to show me what package you installed in compute
node for ceph?


Sure, here's the package selection from my kickstart script:

@^virtualization-host-environment
@base
@core
@virtualization-hypervisor
@virtualization-tools
@virtualization-platform
@virtualization-client
xorg-x11-xauth
ceph

I've got some other bits and pieces installed (like net-snmp, telnet 
etc) but the above are the core bits.


Simon.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rsync access to downloads.ceph.com

2016-01-28 Thread Wido den Hollander


On 28-01-16 01:23, Matt Taylor wrote:
> Hi Fred,
> 
> There's currently a mirror system being setup. There is currently 1
> primary mirror and 2 secondary mirrors:
> 
> download.ceph.com - America.
> eu.ceph.com - Europe.
> au.ceph.com - Australia.
> 
> All of which have http and rsync available, so pick your closest mirror.
> 

Indeed. The pull request with a script to mirror is open currently on
Github: https://github.com/ceph/ceph/pull/7384

You can find the code here:
https://github.com/wido/ceph/tree/mirroring/mirroring

That should soon be merged upstream into master.

Wido

> Thanks,
> Matt.
> 
> On 28/01/2016 09:50, Fred Newtz wrote:
>> Hello,
>>
>>
>> Does anyone know if it is possible to get rsync access to the ceph
>> repositories for creating local mirrors?  We would also be happy to host
>> a mirror for the repository as well if necessary.  We would not want to
>> put rsync in a cronjob and have it mirror frequently. We would only be
>> interested in a manual sync every 3 to 6 months.
>>
>> We have multiple machines and distributions that use CEPH for various
>> reasons.  The performance of the repository is usually sub-optimal for
>> us and it does increase the provisioning time of our servers.  Hosting a
>> local repository would solve this issue for us.
>>
>> Thanks,
>>
>> Fred
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing cluster size

2016-01-28 Thread Adrien Gillard
Yes, this is normal behaviour.

An OSD out has its placement groups remapped temporarily to other OSDs.
Removing it from the crush map remaps them definitively, so placement
groups get rebalanced twice.
To avoid this, you have to change the weight of the OSD first so data is
only rebalanced once, as this directly remaps permanently the placement
groups.

There was an interesting topic about this subject on the mailing list a few
weeks ago [1], you can refer to it for more complete information.


[1] http://www.spinics.net/lists/ceph-users/msg24685.html


On Wed, Jan 27, 2016 at 10:55 PM, Mihai Gheorghe  wrote:

> To answer my own question, the osd must be first reweighted before setting
> it as out. So first 'ceph osd crush reweight osd.X 0' then 'ceph osd out X'
> and proceeding with removing the osd from crushmap and cluster,
>
> I don't know if this is the normal behaviour but shoudn't the reweighting
> be done automatically when setting an OSD out?
>
> 2016-01-27 12:40 GMT+02:00 Mihai Gheorghe :
>
>> Hi,
>>
>> I have a ceph cluster consisting of 4 hosts. Two of them have 3 SSD OSD
>> each and the other two 8 HDD OSD each. I have different crush rules for ssd
>> and hdd.
>>
>> Now when  i first made the cluster i only gave one ssd for journaling to
>> all 8 hdd osd on the host. The host has 10 sata ports. One is used for OS,
>> one for journaling and 8 for osd. Now i want to add another journal ssd on
>> the 2 hosts each. So i need to remove one hdd osd from each host,
>>
>> Following the docs, i set an osd as out and the cluster starts
>> rebalancing data. My problem is that it never achieves active+clean state.
>> I always end up with some pgs stuck unclean. If i bring the osd back in the
>> cluster returns to an active+clean state (with an error of too many pgs per
>> osd 431-max 300).
>>
>> I run a mds server aswell and radosgw.
>>
>> What could be the problem. How can i shrink the cluster to add 2 more
>> journals?!
>>
>> Should i restart the mons and osd after rebalancing?
>>
>> Thank you!
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Libvirt + QEMU-KVM

2016-01-28 Thread Bill WONG
Hi Marius,

with ceph rdb, it looks can support qcow2 as well as per its document: -
http://docs.ceph.com/docs/master/rbd/qemu-rbd/
--
Important The raw data format is really the only sensible format option to
use with RBD. Technically, you could use other QEMU-supported formats (such
as qcow2 or vmdk), but doing so would add additional overhead, and would
also render the volume unsafe for virtual machine live migration when
caching (see below) is enabled.
---

without having qcow2, the qemu-kvm cannot make snapshot and other
features anyone have ideas or experiences on this?
thank you!


On Thu, Jan 28, 2016 at 3:54 PM, Marius Vaitiekunas <
mariusvaitieku...@gmail.com> wrote:

> Hi,
>
> With ceph rbd you should use raw image format. As i know qcow2 is not
> supported.
>
> On Thu, Jan 28, 2016 at 6:21 AM, Bill WONG  wrote:
>
>> Hi Simon,
>>
>> i have installed ceph package into the compute node, but it looks qcow2
>> format is unable to create.. it show error with : Could not write qcow2
>> header: Invalid argument
>>
>> ---
>> qemu-img create -f qcow2 rbd:storage1/CentOS7-3 10G
>> Formatting 'rbd:storage1/CentOS7-3', fmt=qcow2 size=10737418240
>> encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
>> qemu-img: rbd:storage1/CentOS7-3: Could not write qcow2 header: Invalid
>> argument
>> ---
>>
>> any ideas?
>> thank you!
>>
>> On Thu, Jan 28, 2016 at 1:01 AM, Simon Ironside 
>> wrote:
>>
>>> On 27/01/16 16:51, Bill WONG wrote:
>>>
>>> i have ceph cluster and KVM in different machine the qemu-kvm
 (CentOS7) is dedicated compute node installed with qemu-kvm + libvirtd
 only, there should be no /etc/ceph/ceph.conf

>>>
>>> Likewise, my compute nodes are separate machines from the OSDs/monitors
>>> but the compute nodes still have the ceph package installed and
>>> /etc/ceph/ceph.conf present. They just aren't running any ceph daemons.
>>>
>>> I give the compute nodes their own ceph key with write access to the
>>> pool for VM storage and read access to the monitors. I can then use ceph
>>> status, rbd create, qemu-img etc directly on the compute nodes.
>>>
>>> Cheers,
>>> Simon.
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Marius Vaitiekūnas
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com