[ceph-users] RGW Civetweb + CentOS7 boto errors

2016-01-29 Thread Ben Hines
After updating our RGW servers to Centos 7 + civetweb, when hit with a fair
amount of load (20 gets/sec + a few puts/sec) i'm seeing 'BadStatusLine'
exceptions from boto relatively often.

Happens most when calling bucket.get_key() (about 10 times in 1000) These
appear to be possibly random TCP resets when viewing with Wireshark.
Happens with both Hammer and Infernalis.

These happen regardless of the civetweb num_threads or rgw num rados
handles setting. Has anyone seen something similar?

The servers don't appear to be running out of tcp sockets or similar but
perhaps there is some sysctl setting or other tuning that I should be
using.

I may try going back to apache + fastcgi as an experiment (if it still
works with Infernalis?)

thanks,

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd kernel mapping on 3.13

2016-01-29 Thread Deneau, Tom
Ah, yes I see this...
   feature set mismatch, my 4a042a42 < server's 104a042a42, missing 10
which looks like CEPH_FEATURE_CRUSH_V2

Is there any workaround for that?
Or what ceph version would I have to back up to?

The cbt librbdfio benchmark worked fine (once I had installed librbd-dev on the 
client).

-- Tom

> -Original Message-
> From: Ilya Dryomov [mailto:idryo...@gmail.com]
> Sent: Friday, January 29, 2016 4:53 PM
> To: Deneau, Tom
> Cc: ceph-users; c...@lists.ceph.com
> Subject: Re: [ceph-users] rbd kernel mapping on 3.13
> 
> On Fri, Jan 29, 2016 at 11:43 PM, Deneau, Tom  wrote:
> > The commands shown below had successfully mapped rbd images in the past
> on kernel version 4.1.
> >
> > Now I need to map one on a system running the 3.13 kernel.
> > Ceph version is 9.2.0.  Rados bench operations work with no problem.
> > I get the same error message whether I use format 1 or format 2 or --
> image-shared.
> > Is there something different I need to with the 3.13 kernel?
> >
> > -- Tom
> >
> >   # rbd create --size 1000 --image-format 1 rbd/rbddemo
> >   # rbd info rbddemo
> > rbd image 'rbddemo':
> >   size 1000 MB in 250 objects
> >   order 22 (4096 kB objects)
> >   block_name_prefix: rb.0.4f08.77bd73c7
> >   format: 1
> >
> >   # rbd map rbd/rbddemo
> > rbd: sysfs write failed
> > rbd: map failed: (5) Input/output error
> 
> You are likely missing feature bits - 3.13 was released way before 9.2.0.
> The exact error is printed to the kernel log - do dmesg | tail or so.
> 
> Thanks,
> 
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd kernel mapping on 3.13

2016-01-29 Thread Ilya Dryomov
On Fri, Jan 29, 2016 at 11:43 PM, Deneau, Tom  wrote:
> The commands shown below had successfully mapped rbd images in the past on 
> kernel version 4.1.
>
> Now I need to map one on a system running the 3.13 kernel.
> Ceph version is 9.2.0.  Rados bench operations work with no problem.
> I get the same error message whether I use format 1 or format 2 or 
> --image-shared.
> Is there something different I need to with the 3.13 kernel?
>
> -- Tom
>
>   # rbd create --size 1000 --image-format 1 rbd/rbddemo
>   # rbd info rbddemo
> rbd image 'rbddemo':
>   size 1000 MB in 250 objects
>   order 22 (4096 kB objects)
>   block_name_prefix: rb.0.4f08.77bd73c7
>   format: 1
>
>   # rbd map rbd/rbddemo
> rbd: sysfs write failed
> rbd: map failed: (5) Input/output error

You are likely missing feature bits - 3.13 was released way before 9.2.0.
The exact error is printed to the kernel log - do dmesg | tail or so.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd kernel mapping on 3.13

2016-01-29 Thread Deneau, Tom
The commands shown below had successfully mapped rbd images in the past on 
kernel version 4.1.

Now I need to map one on a system running the 3.13 kernel.
Ceph version is 9.2.0.  Rados bench operations work with no problem.
I get the same error message whether I use format 1 or format 2 or 
--image-shared.
Is there something different I need to with the 3.13 kernel?

-- Tom

  # rbd create --size 1000 --image-format 1 rbd/rbddemo
  # rbd info rbddemo
rbd image 'rbddemo':
  size 1000 MB in 250 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.4f08.77bd73c7
  format: 1

  # rbd map rbd/rbddemo
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

I
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Tech Talk - High-Performance Production Databases on Ceph

2016-01-29 Thread Gregory Farnum
This is super cool — thanks, Thorvald, for the realistic picture of
how databases behave on rbd!

On Thu, Jan 28, 2016 at 11:56 AM, Patrick McGarry  wrote:
> Hey cephers,
>
> Here are the links to both the video and the slides from the Ceph Tech
> Talk today. Thanks again to Thorvald and Medallia for stepping forward
> to present.
>
> Video: https://youtu.be/OqlC7S3cUKs
>
> Slides: 
> http://www.slideshare.net/Inktank_Ceph/2016jan28-high-performance-production-databases-on-ceph-57620014
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] storing bucket index in different pool than default

2016-01-29 Thread Krzysztof Księżyk

Hi,

When I show bucket info I see:


> [root@prod-ceph-01 /home/chris.ksiezyk]> radosgw-admin bucket stats -b bucket1
> {
> "bucket": "bucket1",
> "pool": ".rgw.buckets",
> "index_pool": ".rgw.buckets.index",
> "id": "default.4162.3",
> "marker": "default.4162.3",
> "owner": "user1",
> "ver": "0#9442297",
> "master_ver": "0#0",
> "mtime": "2015-12-04 14:03:17.00",
> "max_marker": "0#",
> "usage": {
> "rgw.main": {
> "size_kb": 1082449749,
> "size_kb_actual": 1092031396,
> "num_objects": 4707779
> }
> },
> "bucket_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> }
> }
> 

Is there a way to store bucket and bucket index in newly created pools?

Kind regards -
Krzysztof Księżyk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Anthony D'Atri
> Right now we run the journal as a partition on the data disk. I've build 
> drives without journals and the write performance seems okay but random io 
> performance is poor in comparison to what it should be. 


Co-located journals have multiple issues:

o The disks are presented with double the number of write ops -- and lots of 
long seeks -- which impacts even read performance due to contention

o The atypical seek pattern can collide with disk firmware and factory config 
in ways that result in a very elevated level of read errors.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Jan,

I know that Sage has worked through a lot of this and spent a lot of
time on it, so I'm somewhat inclined to say that if he says it needs
to be there, then it needs to be there. I, however, have been known to
stare at the tress so much that I miss the forest and I understand
some of the points that you bring up about the data consistency and
recovery from the client prospective. One thing that might be helpful
is for you (or someone else) to get in the code and disable the
journal pieces (not sure how difficult this would be) and test it
against your theories. It seems like you have some deep and sincere
interests in seeing Ceph be successful. If you theory holds up, then
presenting the data and results will help others understand and be
more interested in it. It took me a few months of this kind of work
with the WeightedPriorityQueue, and I think the developers and
understanding the limitations of the PrioritizedQueue and how
WeightedPriorityQueue can overcome them with the battery of tests I've
done with a proof of concept. Theory and actual results can be
different, but results are generally more difficult to argue.

Some of the decision about the journal may be based on RADOS and not
RBD. For instance, the decision may have been made that if a RADOS
write has been given to the cluster, it is to be assumed that the
write is durable without waiting for an ACK. I can't see why an
S3/RADOS client can't wait for an ACK from the web server/OSD, but I
haven't gotten into that area yet. That is something else to keep in
mind.

Lionel,

I don't think the journal is used for anything more than crash
consistency of the OSD. I don't believe the journal is used a playback
instrument for bringing other OSDs into sync. An osd that is out of
sync will write it's updates to it's journal to speed up the process,
but that is the extent. The OSD providing the update has to read the
updates to send from disk/page cache. My understanding that the
journal is "never" read from, only when the OSD process crashes.

I'm happy to be corrected if I've misstated anything.

Robert LeBlanc
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWq5u8CRDmVDuy+mK58QAALaoP/3r6oB7cZepby/KRbEGO
pXfZ0X9bDW5S55KgcJjvt0bfrAdLrDxD+8TiKMxwFnFFfuErfVOr6+8E1lZD
tuYuUBKivy8NKfJzeIZx/i81vdFkKSP7jwj8CGXeLoVes29xUY9vxING2ydI
hC9fhb+xdSxn01aPiwnocmVWA5YE/nJ7mKiHAIwgSYAcIITjmtarAKXxRUz6
TAw3mxjLCpLzBd9qP4yZ4q3F35Z9HCvPwES3OogYmimI0sxHM6xZlqChLkKA
aquWWcy+RdBrLhxv+i8NcO835vVnQtCbu6MBpOuVzLiTW/sbXNyOSJiFW9Df
XUKw1biv2znNN534hprAYMgE2+XxzxkpX1j1seplS+cHA+5uNfHbvu4DdHP2
0zeCm3GNgj3cpU0NGbchfyxT+b1VyzafrjQs3Ltv5CqUtfvYTCmpIS59BZkZ
K1KwoBX2cv22WQoP3mnc8eOp44uRkBOfdqefnAf8zE25X0jBW46atFfw52CP
OIdrPJ1+woUgMrhJXjHNG8mybAjAS6lx5YIEx7beHuYIqVCyhuYXjZyNdko2
H410+91n/RK3NvvSvJmdJ0wU93KMyf9QMZ43jwYVj0nkFk0mHHhb+NQ6wJeC
fah9vRIeX5Fi4UNGGW5H0O+LB2mFoMr7ecHB50UnEja67XIUPsWRdMySPs1E
qIDe
=TUe+
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton
 wrote:
> Le 29/01/2016 16:25, Jan Schermer a écrit :
>
> [...]
>
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).
>
> That list of transactions becomes useful only when OSD crashes and comes
> back up - it needs to catch up somehow and this is one of the options. But
> do you really need the "content" of those transactions which is what the
> journal does?
> If you have no such list then you need to either rely on things like mtime
> of the object, or simply compare the hash of the objects (scrub).
>
>
> This didn't seem robust enough to me but I think I had forgotten about the
> monitors' role in maintaining coherency.
>
> Let's say you use a pool with size=3 and min_size=2. You begin with a PG
> with 3 active OSDs then you lose a first OSD for this PG and only two active
> OSDs remain: the clients still happily read and write to this PG and the
> downed OSD is now lagging behind.
> Then one of the remaining active OSDs disappears. Client I/O blocks because
> of min_size. Now the first downed (lagging) OSD comes back. At this point
> Ceph has everything it needs to recover (enough OSDs to reach min_size and
> all the data reported committed to disk to the client in the surviving OSD)
> but must decide which OSD actually has this valid data between the two.
>
> At this point I was under the impression that OSDs could determine this for
> themselves without any outside intervention. But reflecting on this
> situation I don't see how they could handle all cases by themselves (for
> example an active primary should be able to determine by itself that it must
> send the last 

Re: [ceph-users] Lost access when removing cache pool overlay

2016-01-29 Thread Gerd Jakobovitsch
Thank you for the response. It seems to me it is a transient situation. 
At this moment, I regained access to most, but not all buckets/index 
objects. But the overall performance dropped once again - I already have 
huge performance issues.


Regards.

Em 29-01-2016 14:41, Robert LeBlanc escreveu:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the client key have access to the base pool? Something similar bit
us when adding a caching tier. Since the cache tier may be proxying
all the I/O, the client may not have had access to the base pool and
it still worked ok. Once you removed the cache tier, it could no
longer access the pool.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Jan 29, 2016 at 8:47 AM, Gerd Jakobovitsch  wrote:

Dear all,

I had to move .rgw.buckets.index pool to another structure; therefore, I
created a new pool .rgw.buckets.index.new ; added the old pool as cache
pool, and flushed the data.

Up to this moment everything was ok. With radosgw -p  df, I saw the
objects moving to the new pool; the moved objects where ok, I could list
omap keys and so on.

When everything got moved, I removed the overlay cache pool. But at this
moment, the objects became unresponsive:

[(13:39:20) ceph@spchaog1 ~]$ rados -p .rgw.buckets.index listomapkeys
.dir.default.198764998.1
error getting omap key set .rgw.buckets.index/.dir.default.198764998.1: (5)
Input/output error

That happens to all objects. When trying the access to the bucket through
radosgw, I also get problems:

[(13:16:01) root@spcogp1 ~]# radosgw-admin bucket stats --bucket="mybucket"
error getting bucket stats ret=-2

Looking at the disk, data seems to be there:

[(13:47:10) root@spcsnp1 ~]# ls
/var/lib/ceph/osd/ceph-23/current/34.1f_head/|grep 198764998.1
\.dir.default.198764998.1__head_8A7482FF__22

Does anyone have a hint? Could I have lost ownership of the objects?

Regards.













--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução
ou qualquer forma de utilização do teor deste documento depende de
autorização do emissor, sujeitando-se o infrator às sanções legais. Caso
esta comunicação tenha sido recebida por engano, favor avisar imediatamente,
respondendo esta mensagem.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWq5YjCRDmVDuy+mK58QAAzUMQALqON8Ux5KPaotbyOMcr
SzWVigfIa9Go1on8snKmehVzkwC25XaJxYQNU2OwsUUhHa1cy7v+rG6DTbDQ
UDUK4IJ0O6ItGz4IeoyL06KyqmRy06OnuLRyzpQQD+nbIN+/82CVhMRMaKN+
U/GM+avDArN1JjjuQXFMgX/bS6ZoqJOBGqZKt3QWpJnkob1wgxP1tZA7MjZt
p6Sfm/ci0dhveRhzylpEoxYKXwR6hN1hy/wiH2P5yeQBYYmpOALLDJDSTvln
VZ/MbxPL5c0U/RRAkVMic1CvteeQ2nil2wEPFlu7cDjERvoBCMoyQeDXlep4
l+sAJbkKoOEKqE9xDo6CPnPNTePZsEaeSWvupkaypKL2bocBuZcwK6/c4IKE
ITrhT2WTMxDiV5+h29f1ph5TQOHN72nEebggHtPnvoFI9nU50AaWb+QMr8oP
ImerkQpLtvTwO3riLOY5arHXljf5X5IPtj+yDCD03QUoFLqELV+nnL8+v85v
x3C0cL0n0TKm0zQpqvSoB1cXkZ1pCKATq8l7GFclR46P7a5PrDcVzsl+/p3X
lqX94IoI+IIWqm7jVmOuMI2Pgo9c6FuprnG+bT997ivmucka4h/2ORNPbVt+
lz8hB1jU6dClgiaN1IdmzHDNFYniDFgnBWgfSN/N0qNZ2a84S1aTka+fr0ac
MU8o
=laAp
-END PGP SIGNATURE-


--



--

As informa��es contidas nesta mensagem s�o CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulga��o, distribui��o, reprodu��o ou 
qualquer forma de utiliza��o do teor deste documento depende de autoriza��o do 
emissor, sujeitando-se o infrator �s san��es legais. Caso esta comunica��o 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost access when removing cache pool overlay

2016-01-29 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the client key have access to the base pool? Something similar bit
us when adding a caching tier. Since the cache tier may be proxying
all the I/O, the client may not have had access to the base pool and
it still worked ok. Once you removed the cache tier, it could no
longer access the pool.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Jan 29, 2016 at 8:47 AM, Gerd Jakobovitsch  wrote:
> Dear all,
>
> I had to move .rgw.buckets.index pool to another structure; therefore, I
> created a new pool .rgw.buckets.index.new ; added the old pool as cache
> pool, and flushed the data.
>
> Up to this moment everything was ok. With radosgw -p  df, I saw the
> objects moving to the new pool; the moved objects where ok, I could list
> omap keys and so on.
>
> When everything got moved, I removed the overlay cache pool. But at this
> moment, the objects became unresponsive:
>
> [(13:39:20) ceph@spchaog1 ~]$ rados -p .rgw.buckets.index listomapkeys
> .dir.default.198764998.1
> error getting omap key set .rgw.buckets.index/.dir.default.198764998.1: (5)
> Input/output error
>
> That happens to all objects. When trying the access to the bucket through
> radosgw, I also get problems:
>
> [(13:16:01) root@spcogp1 ~]# radosgw-admin bucket stats --bucket="mybucket"
> error getting bucket stats ret=-2
>
> Looking at the disk, data seems to be there:
>
> [(13:47:10) root@spcsnp1 ~]# ls
> /var/lib/ceph/osd/ceph-23/current/34.1f_head/|grep 198764998.1
> \.dir.default.198764998.1__head_8A7482FF__22
>
> Does anyone have a hint? Could I have lost ownership of the objects?
>
> Regards.
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo
> sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução
> ou qualquer forma de utilização do teor deste documento depende de
> autorização do emissor, sujeitando-se o infrator às sanções legais. Caso
> esta comunicação tenha sido recebida por engano, favor avisar imediatamente,
> respondendo esta mensagem.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWq5YjCRDmVDuy+mK58QAAzUMQALqON8Ux5KPaotbyOMcr
SzWVigfIa9Go1on8snKmehVzkwC25XaJxYQNU2OwsUUhHa1cy7v+rG6DTbDQ
UDUK4IJ0O6ItGz4IeoyL06KyqmRy06OnuLRyzpQQD+nbIN+/82CVhMRMaKN+
U/GM+avDArN1JjjuQXFMgX/bS6ZoqJOBGqZKt3QWpJnkob1wgxP1tZA7MjZt
p6Sfm/ci0dhveRhzylpEoxYKXwR6hN1hy/wiH2P5yeQBYYmpOALLDJDSTvln
VZ/MbxPL5c0U/RRAkVMic1CvteeQ2nil2wEPFlu7cDjERvoBCMoyQeDXlep4
l+sAJbkKoOEKqE9xDo6CPnPNTePZsEaeSWvupkaypKL2bocBuZcwK6/c4IKE
ITrhT2WTMxDiV5+h29f1ph5TQOHN72nEebggHtPnvoFI9nU50AaWb+QMr8oP
ImerkQpLtvTwO3riLOY5arHXljf5X5IPtj+yDCD03QUoFLqELV+nnL8+v85v
x3C0cL0n0TKm0zQpqvSoB1cXkZ1pCKATq8l7GFclR46P7a5PrDcVzsl+/p3X
lqX94IoI+IIWqm7jVmOuMI2Pgo9c6FuprnG+bT997ivmucka4h/2ORNPbVt+
lz8hB1jU6dClgiaN1IdmzHDNFYniDFgnBWgfSN/N0qNZ2a84S1aTka+fr0ac
MU8o
=laAp
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Lionel Bouton
Le 29/01/2016 16:25, Jan Schermer a écrit :
>
> [...]
>
>
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).
>
> That list of transactions becomes useful only when OSD crashes and comes back 
> up - it needs to catch up somehow and this is one of the options. But do you 
> really need the "content" of those transactions which is what the journal 
> does?
> If you have no such list then you need to either rely on things like mtime of 
> the object, or simply compare the hash of the objects (scrub).

This didn't seem robust enough to me but I think I had forgotten about
the monitors' role in maintaining coherency.

Let's say you use a pool with size=3 and min_size=2. You begin with a PG
with 3 active OSDs then you lose a first OSD for this PG and only two
active OSDs remain: the clients still happily read and write to this PG
and the downed OSD is now lagging behind.
Then one of the remaining active OSDs disappears. Client I/O blocks
because of min_size. Now the first downed (lagging) OSD comes back. At
this point Ceph has everything it needs to recover (enough OSDs to reach
min_size and all the data reported committed to disk to the client in
the surviving OSD) but must decide which OSD actually has this valid
data between the two.

At this point I was under the impression that OSDs could determine this
for themselves without any outside intervention. But reflecting on this
situation I don't see how they could handle all cases by themselves (for
example an active primary should be able to determine by itself that it
must send the last modifications to any other OSD but it wouldn't work
if all OSD go down for a PG : when coming back all could be the last
primary from their point of view with no robust way to decide which is
right without the monitors being involved).
The monitors maintain the status of each OSDs for each PG if I'm not
mistaken so I suppose the monitors knowledge of the situation will be
used to determine which OSDs have the good data (the last min_size OSDs
up for each PG) and trigger the others to resync before the PG reaches
active+clean.

That said this doesn't address the other point: when the resync happens,
using the journal content of the primary could theoretically be faster
if the filestores are on spinning disks. I realize that recent writes in
the filestore might be in the kernel's cache (which would avoid the
costly seeks) and that using the journal instead would probably mean
that the OSDs maintain an in-memory index of all the IO transactions
still stored in the journal to be efficient so it isn't such a clear win.

Thanks a lot for the explanations.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-29 Thread Jason Dillaman
High queue depth, sequential, direct IO.  With appropriate striping settings, 
instead of all the sequential IO being processed by the same PG sequentially, 
the IO will be processed by multiple PGs in parallel.

-- 

Jason Dillaman 


- Original Message -
> From: "Василий Ангапов" 
> To: "Jason Dillaman" 
> Cc: "Bill WONG" , "ceph-users" 
> 
> Sent: Friday, January 29, 2016 11:04:52 AM
> Subject: Re: [ceph-users] Striping feature gone after flatten with cloned 
> images
> 
> Btw, in terms of fio for example - how can I practically see the
> benefit of striping?
> 
> 2016-01-29 22:45 GMT+08:00 Jason Dillaman :
> > That was intended as an example of how to use fancy striping with RBD
> > images.  The stripe unit and count are knobs to tweak depending on your IO
> > situation.
> >
> > Taking a step back, what are you actually trying to accomplish?  The
> > flatten operation doesn't necessarily fit the use case for this feature.
> > Instead, it is useful for applications that issue many small, sequential
> > IO operations.  If you set your stripe unit to be the IO size (and
> > properly align it), the IO load will be able to be distributed to more
> > OSDs in parallel.
> >
> > --
> >
> > Jason Dillaman
> >
> >
> > - Original Message -
> >
> >> From: "Bill WONG" 
> >> To: "Jason Dillaman" 
> >> Cc: "ceph-users" 
> >> Sent: Friday, January 29, 2016 12:30:24 AM
> >> Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
> >> images
> >
> >> Hi Jason,
> >
> >> it works fine but flatten take how longer than than before..
> >> i would like to know how to decide the --stripe-unit & --stripe-count to
> >> gain
> >> the best performance?
> >> i see you put unit of 4K and count with 16.. why?
> >> --
> >> --stripe-unit 4K --stripe-count 16
> >> --
> >
> >> thank you!
> >
> >> On Fri, Jan 29, 2016 at 11:15 AM, Jason Dillaman < dilla...@redhat.com >
> >> wrote:
> >
> >> > When you set "--stripe-count" to 1 and set the "--stripe-unit" to the
> >> > object
> >> > size, you have actually explicitly told the rbd CLI to not use "fancy"
> >> > striping. A better example would be something like:
> >>
> >
> >> > rbd clone --stripe-unit 4K --stripe-count 16 storage1/cloudlet-1@snap1
> >> > storage1/cloudlet-1-clone
> >>
> >
> >> > --
> >>
> >
> >> > Jason Dillaman
> >>
> >
> >> > - Original Message -
> >>
> >
> >> > > From: "Bill WONG" < wongahsh...@gmail.com >
> >>
> >> > > To: "Jason Dillaman" < dilla...@redhat.com >
> >>
> >> > > Cc: "ceph-users" < ceph-users@lists.ceph.com >
> >>
> >> > > Sent: Thursday, January 28, 2016 10:08:38 PM
> >>
> >> > > Subject: Re: [ceph-users] Striping feature gone after flatten with
> >> > > cloned
> >>
> >> > > images
> >>
> >
> >> > > hi jason,
> >>
> >
> >> > > how i can make the stripping parameters st the time of clone creation?
> >> > > as
> >> > > i
> >>
> >> > > have tested, which looks doesn't work properly..
> >>
> >> > > the clone image still without stripping.. any idea?
> >>
> >> > > --
> >>
> >> > > rbd clone --stripe-unit 4096K --stripe-count 1
> >> > > storage1/cloudlet-1@snap1
> >>
> >> > > storage1/cloudlet-1-clone
> >>
> >> > > rbd flatten storage1/cloudlet-1-clone
> >>
> >> > > rbd info storage1/cloudlet-1-clone
> >>
> >> > > rbd image 'cloudlet-1-clone':
> >>
> >> > > size 1000 GB in 256000 objects
> >>
> >> > > order 22 (4096 kB objects)
> >>
> >> > > block_name_prefix: rbd_data.5ecd364dfe1f8
> >>
> >> > > format: 2
> >>
> >> > > features: layering
> >>
> >> > > flags:
> >>
> >> > > ---
> >>
> >
> >> > > On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman < dilla...@redhat.com
> >> > > >
> >>
> >> > > wrote:
> >>
> >
> >> > > > You must specify the clone's striping parameters at the time of its
> >>
> >> > > > creation
> >>
> >> > > > -- it is not inherited from the parent image.
> >>
> >> > >
> >>
> >
> >> > > > --
> >>
> >> > >
> >>
> >
> >> > > > Jason Dillaman
> >>
> >> > >
> >>
> >
> >> > > > - Original Message -
> >>
> >> > >
> >>
> >
> >> > > > > From: "Bill WONG" < wongahsh...@gmail.com >
> >>
> >> > >
> >>
> >> > > > > To: "ceph-users" < ceph-users@lists.ceph.com >
> >>
> >> > >
> >>
> >> > > > > Sent: Thursday, January 28, 2016 1:25:12 PM
> >>
> >> > >
> >>
> >> > > > > Subject: [ceph-users] Striping feature gone after flatten with
> >> > > > > cloned
> >>
> >> > > > > images
> >>
> >> > >
> >>
> >
> >> > > > > Hi,
> >>
> >> > >
> >>
> >
> >> > > > > i have tested with the flatten:
> >>
> >> > >
> >>
> >> > > > > 1) make a snapshot of image
> >>
> >> > >
> >>
> >> > > > > 2) protect the snapshot
> >>
> >> > >
> >>
> >> > > > > 3) clone the snapshot
> >>
> >> > >
> >>
> >> > > > > 4) flatten the clone
> >>
> >> > >
> >>
> >> > > > > then i found issue:
> >>
> >> > >
> >>
> >> > > > > with the original image / snapshot or the clone before flatten,
> >> > > > > all
> >> > > > > are
> >>
> >> > > > > with
> >>
> >> > >
> >>
> >> > > > > stripping feature, BUT after flattened the clone, then there is no
> >> > > > > mor

Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-29 Thread Василий Ангапов
Btw, in terms of fio for example - how can I practically see the
benefit of striping?

2016-01-29 22:45 GMT+08:00 Jason Dillaman :
> That was intended as an example of how to use fancy striping with RBD images. 
>  The stripe unit and count are knobs to tweak depending on your IO situation.
>
> Taking a step back, what are you actually trying to accomplish?  The flatten 
> operation doesn't necessarily fit the use case for this feature.  Instead, it 
> is useful for applications that issue many small, sequential IO operations.  
> If you set your stripe unit to be the IO size (and properly align it), the IO 
> load will be able to be distributed to more OSDs in parallel.
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
>> From: "Bill WONG" 
>> To: "Jason Dillaman" 
>> Cc: "ceph-users" 
>> Sent: Friday, January 29, 2016 12:30:24 AM
>> Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
>> images
>
>> Hi Jason,
>
>> it works fine but flatten take how longer than than before..
>> i would like to know how to decide the --stripe-unit & --stripe-count to gain
>> the best performance?
>> i see you put unit of 4K and count with 16.. why?
>> --
>> --stripe-unit 4K --stripe-count 16
>> --
>
>> thank you!
>
>> On Fri, Jan 29, 2016 at 11:15 AM, Jason Dillaman < dilla...@redhat.com >
>> wrote:
>
>> > When you set "--stripe-count" to 1 and set the "--stripe-unit" to the
>> > object
>> > size, you have actually explicitly told the rbd CLI to not use "fancy"
>> > striping. A better example would be something like:
>>
>
>> > rbd clone --stripe-unit 4K --stripe-count 16 storage1/cloudlet-1@snap1
>> > storage1/cloudlet-1-clone
>>
>
>> > --
>>
>
>> > Jason Dillaman
>>
>
>> > - Original Message -
>>
>
>> > > From: "Bill WONG" < wongahsh...@gmail.com >
>>
>> > > To: "Jason Dillaman" < dilla...@redhat.com >
>>
>> > > Cc: "ceph-users" < ceph-users@lists.ceph.com >
>>
>> > > Sent: Thursday, January 28, 2016 10:08:38 PM
>>
>> > > Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
>>
>> > > images
>>
>
>> > > hi jason,
>>
>
>> > > how i can make the stripping parameters st the time of clone creation? as
>> > > i
>>
>> > > have tested, which looks doesn't work properly..
>>
>> > > the clone image still without stripping.. any idea?
>>
>> > > --
>>
>> > > rbd clone --stripe-unit 4096K --stripe-count 1 storage1/cloudlet-1@snap1
>>
>> > > storage1/cloudlet-1-clone
>>
>> > > rbd flatten storage1/cloudlet-1-clone
>>
>> > > rbd info storage1/cloudlet-1-clone
>>
>> > > rbd image 'cloudlet-1-clone':
>>
>> > > size 1000 GB in 256000 objects
>>
>> > > order 22 (4096 kB objects)
>>
>> > > block_name_prefix: rbd_data.5ecd364dfe1f8
>>
>> > > format: 2
>>
>> > > features: layering
>>
>> > > flags:
>>
>> > > ---
>>
>
>> > > On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman < dilla...@redhat.com >
>>
>> > > wrote:
>>
>
>> > > > You must specify the clone's striping parameters at the time of its
>>
>> > > > creation
>>
>> > > > -- it is not inherited from the parent image.
>>
>> > >
>>
>
>> > > > --
>>
>> > >
>>
>
>> > > > Jason Dillaman
>>
>> > >
>>
>
>> > > > - Original Message -
>>
>> > >
>>
>
>> > > > > From: "Bill WONG" < wongahsh...@gmail.com >
>>
>> > >
>>
>> > > > > To: "ceph-users" < ceph-users@lists.ceph.com >
>>
>> > >
>>
>> > > > > Sent: Thursday, January 28, 2016 1:25:12 PM
>>
>> > >
>>
>> > > > > Subject: [ceph-users] Striping feature gone after flatten with cloned
>>
>> > > > > images
>>
>> > >
>>
>
>> > > > > Hi,
>>
>> > >
>>
>
>> > > > > i have tested with the flatten:
>>
>> > >
>>
>> > > > > 1) make a snapshot of image
>>
>> > >
>>
>> > > > > 2) protect the snapshot
>>
>> > >
>>
>> > > > > 3) clone the snapshot
>>
>> > >
>>
>> > > > > 4) flatten the clone
>>
>> > >
>>
>> > > > > then i found issue:
>>
>> > >
>>
>> > > > > with the original image / snapshot or the clone before flatten, all
>> > > > > are
>>
>> > > > > with
>>
>> > >
>>
>> > > > > stripping feature, BUT after flattened the clone, then there is no
>> > > > > more
>>
>> > >
>>
>> > > > > Stripping with the clone image...what is the issue? and how can
>> > > > > enable
>>
>> > > > > the
>>
>> > >
>>
>> > > > > striping feature?
>>
>> > >
>>
>
>> > > > > thank you!
>>
>> > >
>>
>
>> > > > > ___
>>
>> > >
>>
>> > > > > ceph-users mailing list
>>
>> > >
>>
>> > > > > ceph-users@lists.ceph.com
>>
>> > >
>>
>> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> > >
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost access when removing cache pool overlay

2016-01-29 Thread Gerd Jakobovitsch

Dear all,

I had to move .rgw.buckets.index pool to another structure; therefore, I 
created a new pool .rgw.buckets.index.new ; added the old pool as cache 
pool, and flushed the data.


Up to this moment everything was ok. With radosgw -p  df, I saw 
the objects moving to the new pool; the moved objects where ok, I could 
list omap keys and so on.


When everything got moved, I removed the overlay cache pool. But at this 
moment, the objects became unresponsive:


[(13:39:20) ceph@spchaog1 ~]$ rados -p .rgw.buckets.index listomapkeys 
.dir.default.198764998.1
error getting omap key set .rgw.buckets.index/.dir.default.198764998.1: 
(5) Input/output error


That happens to all objects. When trying the access to the bucket 
through radosgw, I also get problems:


[(13:16:01) root@spcogp1 ~]# radosgw-admin bucket stats --bucket="mybucket"
error getting bucket stats ret=-2

Looking at the disk, data seems to be there:

[(13:47:10) root@spcsnp1 ~]# ls 
/var/lib/ceph/osd/ceph-23/current/34.1f_head/|grep 198764998.1

\.dir.default.198764998.1__head_8A7482FF__22

Does anyone have a hint? Could I have lost ownership of the objects?

Regards.



--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer

> On 29 Jan 2016, at 16:00, Lionel Bouton  
> wrote:
> 
> Le 29/01/2016 01:12, Jan Schermer a écrit :
>> [...]
>>> Second I'm not familiar with Ceph internals but OSDs must make sure that 
>>> their PGs are synced so I was under the impression that the OSD content for 
>>> a PG on the filesystem should always be guaranteed to be on all the other 
>>> active OSDs *or* their journals (so you wouldn't apply journal content 
>>> unless the other journals have already committed the same content). If you 
>>> remove the journals there's no intermediate on-disk "buffer" that can be 
>>> used to guarantee such a thing: one OSD will always have data that won't be 
>>> guaranteed to be on disk on the others. As I understand this you could say 
>>> that this is some form of 2-phase commit.
>> You can simply commit the data (to the filestore), and it would be in fact 
>> faster.
>> Client gets the write acknowledged when all the OSDs have the data - that 
>> doesn't change in this scenario. If one OSD gets ahead of the others and 
>> commits something the other OSDs do not before the whole cluster goes down 
>> then it doesn't hurt anything - you didn't acknowledge so the client has to 
>> replay if it cares, _NOT_ the OSDs.
>> The problem still exists, just gets shifted elsewhere. But the client (guest 
>> filesystem) already handles this.
> 
> Hum, if one OSD gets ahead of the others there must be a way for the
> OSDs to resynchronize themselves. I assume that on resync for each PG
> OSDs probably compare something very much like a tx_id.

Why? Yes, it makes sense when you scrub them to have the same data, but the 
client doesn't care. If it were a hard drive the situation is the same - maybe 
the data was written, maybe it was not. You have no way of knowing and you 
don't care - the filesystem (or even any sane database) handles this by design.
It's your choice whether to replay the tx or rollback because the client 
doesn't care either way - that block that you write (or don't) is either 
unallocated or containing any of the 2 versions of the data at that point. 
You clearly don't want to give the client 2 differnt versions of the data, so 
something like data=journal should be used and the data compared when OSD comes 
back up... still nothing that required "ceph journal" though.

> 
> What I was expecting is that in the case of a small backlog the journal
> - containing the last modifications by design - was used during recovery
> to fetch all the recent transaction contents. It seemed efficient to me:
> especially on rotating media fetching data from the journal would avoid
> long seeks. The first alternative I can think of is maintaining a
> separate log of the recently modified objects in the filestore without
> the actual content of the modification. Then you can fetch the objects
> from the filestore as needed but this probably seeks all over the place.
> In the case of multiple PGs lagging behind on other OSDs, reading the
> local journal would be even better as you have even more chances of
> ordering reads to avoid seeks on the journal and much more seeks would
> happen on the filestore.
> 
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).

That list of transactions becomes useful only when OSD crashes and comes back 
up - it needs to catch up somehow and this is one of the options. But do you 
really need the "content" of those transactions which is what the journal does?
If you have no such list then you need to either rely on things like mtime of 
the object, or simply compare the hash of the objects (scrub). In the meantime 
you simply have to run from the other copies or stick to one copy of the data. 
But even if you stick to the "wrong" version it does no harm as long as you 
don't arbitrarily change that copy because the client didn't know what data 
ended on drive and must be (and is) prepared to use whatever you have.


> 
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to understand the contents of .rgw.buckets.index

2016-01-29 Thread Gregory Farnum
On Fri, Jan 29, 2016 at 3:10 AM, Wido den Hollander  wrote:
>
>
> On 29-01-16 11:31, Micha Krause wrote:
>> Hi,
>>
>> I'm having problems listing the contents of an s3 bucket with ~2M objects.
>>
>> I already found the new bucket index sharding feature, but I'm
>> interested how these Indexes are stored.
>>
>> My index pool shows no space used, and all objects have 0B.
>>
>> root@mon01:~ # rados df -p .rgw.buckets.index
>> pool name KB  objects   clones degraded
>> unfound   rdrd KB   wrwr KB
>> .rgw.buckets.index0   620
>> 0   0 28177514336051356 228972310
>>
>> Why would sharing a 0B object make any difference?
>>
> The index is stored in the omap of the object which you can list with
> the 'rados' command.
>
> So it's not data inside the RADOS object, but in the omap key/value store.

...and this is an unfortunate accounting problem in terms of RADOS
pools, but a solution is very difficult technically so nobody's come
up with a good one. :(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Lionel Bouton
Le 29/01/2016 01:12, Jan Schermer a écrit :
> [...]
>> Second I'm not familiar with Ceph internals but OSDs must make sure that 
>> their PGs are synced so I was under the impression that the OSD content for 
>> a PG on the filesystem should always be guaranteed to be on all the other 
>> active OSDs *or* their journals (so you wouldn't apply journal content 
>> unless the other journals have already committed the same content). If you 
>> remove the journals there's no intermediate on-disk "buffer" that can be 
>> used to guarantee such a thing: one OSD will always have data that won't be 
>> guaranteed to be on disk on the others. As I understand this you could say 
>> that this is some form of 2-phase commit.
> You can simply commit the data (to the filestore), and it would be in fact 
> faster.
> Client gets the write acknowledged when all the OSDs have the data - that 
> doesn't change in this scenario. If one OSD gets ahead of the others and 
> commits something the other OSDs do not before the whole cluster goes down 
> then it doesn't hurt anything - you didn't acknowledge so the client has to 
> replay if it cares, _NOT_ the OSDs.
> The problem still exists, just gets shifted elsewhere. But the client (guest 
> filesystem) already handles this.

Hum, if one OSD gets ahead of the others there must be a way for the
OSDs to resynchronize themselves. I assume that on resync for each PG
OSDs probably compare something very much like a tx_id.

What I was expecting is that in the case of a small backlog the journal
- containing the last modifications by design - was used during recovery
to fetch all the recent transaction contents. It seemed efficient to me:
especially on rotating media fetching data from the journal would avoid
long seeks. The first alternative I can think of is maintaining a
separate log of the recently modified objects in the filestore without
the actual content of the modification. Then you can fetch the objects
from the filestore as needed but this probably seeks all over the place.
In the case of multiple PGs lagging behind on other OSDs, reading the
local journal would be even better as you have even more chances of
ordering reads to avoid seeks on the journal and much more seeks would
happen on the filestore.

But if I understand correctly, there is indeed a log of the recent
modifications in the filestore which is used when a PG is recovering
because another OSD is lagging behind (not when Ceph reports a full
backfill where I suppose all objects' versions of a PG are compared).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Striping feature gone after flatten with cloned images

2016-01-29 Thread Jason Dillaman
That was intended as an example of how to use fancy striping with RBD images.  
The stripe unit and count are knobs to tweak depending on your IO situation.

Taking a step back, what are you actually trying to accomplish?  The flatten 
operation doesn't necessarily fit the use case for this feature.  Instead, it 
is useful for applications that issue many small, sequential IO operations.  If 
you set your stripe unit to be the IO size (and properly align it), the IO load 
will be able to be distributed to more OSDs in parallel.

-- 

Jason Dillaman 


- Original Message - 

> From: "Bill WONG" 
> To: "Jason Dillaman" 
> Cc: "ceph-users" 
> Sent: Friday, January 29, 2016 12:30:24 AM
> Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
> images

> Hi Jason,

> it works fine but flatten take how longer than than before..
> i would like to know how to decide the --stripe-unit & --stripe-count to gain
> the best performance?
> i see you put unit of 4K and count with 16.. why?
> --
> --stripe-unit 4K --stripe-count 16
> --

> thank you!

> On Fri, Jan 29, 2016 at 11:15 AM, Jason Dillaman < dilla...@redhat.com >
> wrote:

> > When you set "--stripe-count" to 1 and set the "--stripe-unit" to the
> > object
> > size, you have actually explicitly told the rbd CLI to not use "fancy"
> > striping. A better example would be something like:
> 

> > rbd clone --stripe-unit 4K --stripe-count 16 storage1/cloudlet-1@snap1
> > storage1/cloudlet-1-clone
> 

> > --
> 

> > Jason Dillaman
> 

> > - Original Message -
> 

> > > From: "Bill WONG" < wongahsh...@gmail.com >
> 
> > > To: "Jason Dillaman" < dilla...@redhat.com >
> 
> > > Cc: "ceph-users" < ceph-users@lists.ceph.com >
> 
> > > Sent: Thursday, January 28, 2016 10:08:38 PM
> 
> > > Subject: Re: [ceph-users] Striping feature gone after flatten with cloned
> 
> > > images
> 

> > > hi jason,
> 

> > > how i can make the stripping parameters st the time of clone creation? as
> > > i
> 
> > > have tested, which looks doesn't work properly..
> 
> > > the clone image still without stripping.. any idea?
> 
> > > --
> 
> > > rbd clone --stripe-unit 4096K --stripe-count 1 storage1/cloudlet-1@snap1
> 
> > > storage1/cloudlet-1-clone
> 
> > > rbd flatten storage1/cloudlet-1-clone
> 
> > > rbd info storage1/cloudlet-1-clone
> 
> > > rbd image 'cloudlet-1-clone':
> 
> > > size 1000 GB in 256000 objects
> 
> > > order 22 (4096 kB objects)
> 
> > > block_name_prefix: rbd_data.5ecd364dfe1f8
> 
> > > format: 2
> 
> > > features: layering
> 
> > > flags:
> 
> > > ---
> 

> > > On Fri, Jan 29, 2016 at 3:54 AM, Jason Dillaman < dilla...@redhat.com >
> 
> > > wrote:
> 

> > > > You must specify the clone's striping parameters at the time of its
> 
> > > > creation
> 
> > > > -- it is not inherited from the parent image.
> 
> > >
> 

> > > > --
> 
> > >
> 

> > > > Jason Dillaman
> 
> > >
> 

> > > > - Original Message -
> 
> > >
> 

> > > > > From: "Bill WONG" < wongahsh...@gmail.com >
> 
> > >
> 
> > > > > To: "ceph-users" < ceph-users@lists.ceph.com >
> 
> > >
> 
> > > > > Sent: Thursday, January 28, 2016 1:25:12 PM
> 
> > >
> 
> > > > > Subject: [ceph-users] Striping feature gone after flatten with cloned
> 
> > > > > images
> 
> > >
> 

> > > > > Hi,
> 
> > >
> 

> > > > > i have tested with the flatten:
> 
> > >
> 
> > > > > 1) make a snapshot of image
> 
> > >
> 
> > > > > 2) protect the snapshot
> 
> > >
> 
> > > > > 3) clone the snapshot
> 
> > >
> 
> > > > > 4) flatten the clone
> 
> > >
> 
> > > > > then i found issue:
> 
> > >
> 
> > > > > with the original image / snapshot or the clone before flatten, all
> > > > > are
> 
> > > > > with
> 
> > >
> 
> > > > > stripping feature, BUT after flattened the clone, then there is no
> > > > > more
> 
> > >
> 
> > > > > Stripping with the clone image...what is the issue? and how can
> > > > > enable
> 
> > > > > the
> 
> > >
> 
> > > > > striping feature?
> 
> > >
> 

> > > > > thank you!
> 
> > >
> 

> > > > > ___
> 
> > >
> 
> > > > > ceph-users mailing list
> 
> > >
> 
> > > > > ceph-users@lists.ceph.com
> 
> > >
> 
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> > >
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical architecture in RDB mode - Number of servers explained ?

2016-01-29 Thread Gaetan SLONGO
Thank you for your answer ! 

For you what is medium and large clusters ? 

Best regards, 

- Mail original -

De: "Eneko Lacunza"  
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 28 Janvier 2016 14:32:40 
Objet: Re: [ceph-users] Typical architecture in RDB mode - Number of servers 
explained ? 


Hi, 

El 28/01/16 a las 13:53, Gaetan SLONGO escribió: 



Dear Ceph users, 

We are currently working on CEPH (RBD mode only). The technology is currently 
in "preview" state in our lab. We are currently diving into Ceph design... We 
know it requires at least 3 nodes (OSDs+Monitors inside) to work properly. But 
we would like to know if it makes sense to use 4 nodes ? I've heard this is not 
a good idea because all of the capacity of the 4 servers won't be available ? 
Someone can confirm ? 


There's no problem to use 4 servers for OSD; just don't put a monitor in one of 
the nodes. Always keep an odd number of monitors (3 or 5). 

Monitors don't need to be in a OSD node, and in fact for medium and large 
clusters it is recommended to have a dedicated node for them. 

Cheers 
Eneko 
-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) 
www.binovo.es 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 



-- 




www.it-optics.com 

Gaëtan SLONGO | IT & Project Manager 
Boulevard Initialis, 28 - 7000 Mons, BELGIUM 
Company :   +32 (0)65 84 23 85 
Direct :+32 (0)65 32 85 88 
Fax :   +32 (0)65 84 66 76 
Skype ID :  gslongo.pro 
GPG Key :   gslongo-gpg_key.asc 



- Please consider your environmental responsibility before printing this e-mail 
- 








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to understand the contents of .rgw.buckets.index

2016-01-29 Thread Micha Krause

Hi,

> The index is stored in the omap of the object which you can list with> the 
'rados' command.
>
> So it's not data inside the RADOS object, but in the omap key/value store.

Thank you very much:

 rados -p .rgw.buckets.index listomapkeys .dir.default.55059808.22 | wc -l
 2228777

So this data is then stored in the omap directory on my osd as .sst files?

is there a way to correlate a rados object with a specific sst (leveldb?) file?


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer
>> inline

> On 29 Jan 2016, at 05:03, Somnath Roy  wrote:
> 
> <  
> From: Jan Schermer [mailto:j...@schermer.cz ] 
> Sent: Thursday, January 28, 2016 3:51 PM
> To: Somnath Roy
> Cc: Tyler Bishop; ceph-users@lists.ceph.com 
> Subject: Re: SSD Journal
>  
> Thanks for a great walkthrough explanation.
> I am not really going to (and capable) of commenting on everything but.. see 
> below
>  
> On 28 Jan 2016, at 23:35, Somnath Roy  > wrote:
>  
> Hi,
> Ceph needs to maintain a journal in case of filestore as underlying 
> filesystem like XFS *doesn’t have* any transactional semantics. Ceph has to 
> do a transactional write with data and metadata in the write path. It does in 
> the following way.
>  
> "Ceph has to do a transactional write with data and metadata in the write 
> path"
> Why? Isn't that only to provide that to itself?
> 
> [Somnath] Yes, that is for Ceph..That’s 2 setattrs (for rbd) + PGLog/Info..

And why does Ceph need that? Aren't we going in circles here? No client needs 
those transactions so there's no point in needing those transactions in Ceph.

>  
> 1. It creates a transaction object having multiple metadata operations and 
> the actual payload write.
>  
> 2. It is passed to Objectstore layer.
>  
> 3. Objectstore can complete the transaction in sync or async (Filestore) way.
>  
> Depending on whether the write was flushed or not? How is that decided?
> [Somnath] It depends on how ObjectStore backend is written..Not 
> dynamic..Filestore implemented in async way , I think BlueStore is written in 
> sync way (?)..
> 
>  
> 4.  Filestore dumps the entire Transaction object to the journal. It is a 
> circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
> way.
>  
> Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
> just to avoid needless caching.
> It should behave the way you want on Linux, but you must not rely on it since 
> this guarantee is not portable.
> 
> [Somnath] O_DIRECT alone is not guaranteed but With O_DSYNC it is guaranteed 
> to be reaching the disk..It may still be there in Disk cache , but, this is 
> taken care by disks..

O_DSYNC is the same as calling fdatasync() after writes. This only flushes the 
data, not the metadata. So if your "transactions" need those (and I think they 
do) then you don't get the expected consistency. In practice it could flush 
effectively everything.

>  
> 5. Once journal write is successful , write is acknowledged to the client. 
> Read for this data is not allowed yet as it is still not been written to the 
> actual location in the filesystem.
>  
> Now you are providing a guarantee for something nobody really needs. There is 
> no guarantee with traditional filesystems of not returning dirty unwritten 
> data. The guarentees are on writes, not reads. It might be easier to do it 
> this way if you plan for some sort of concurrent access to the same data from 
> multiple readers (that don't share the cache) - but is that really the case 
> here if it's still the same OSD that serves the data?
> Do the journals absorb only the unbuffered IO or all IO?
>  
> And what happens currently if I need to read the written data rightaway? When 
> do I get it then?
> 
> [Somnath] Well, this is debatable, but currently reads are blocked till 
> entire Tx execution is completed (not after doing syncfs)..Journal absorbs 
> all the IO..

So a database doing checkpoint read/modify/write is going to suffer greatly? 
That might explain a few more things I've seen.
But it's not needed anyway, in fact things like databases are very likely to 
write to the same place over and over again and you should in fact accomodate 
them by caching.


>  
> 6. The actual execution of the transaction is done in parallel for the 
> filesystem that can do check pointing like BTRFS. For the filesystem like 
> XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
> first and then the Tx execution will happen.
>  
> 7. Tx execution is done in parallel by the filestore worker threads. The 
> payload write is a buffered write and a sync thread within filestore is 
> periodically calling ‘syncfs’ to persist data/metadata to the actual location.
>  
> 8. Before each ‘syncfs’ call it determines the seq number till it is 
> persisted and trim the transaction objects from journal upto that point. This 
> will make room for more writes in the journal. If journal is full, write will 
> be stuck.
>  
> 9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
> the last successful backend commit seq number (maintained in a file after 
> ‘syncfs’).
>  
>  
> You can just completely rip at least 6-9 out and mirror what the client sends 
> to the filesystem with the same effect (and without journal). Who cares how 
> the filesystem implements it then, everybody can choose the filesystem that 
> matc

Re: [ceph-users] Trying to understand the contents of .rgw.buckets.index

2016-01-29 Thread Wido den Hollander


On 29-01-16 11:31, Micha Krause wrote:
> Hi,
> 
> I'm having problems listing the contents of an s3 bucket with ~2M objects.
> 
> I already found the new bucket index sharding feature, but I'm
> interested how these Indexes are stored.
> 
> My index pool shows no space used, and all objects have 0B.
> 
> root@mon01:~ # rados df -p .rgw.buckets.index
> pool name KB  objects   clones degraded 
> unfound   rdrd KB   wrwr KB
> .rgw.buckets.index0   620   
> 0   0 28177514336051356 228972310
> 
> Why would sharing a 0B object make any difference?
> 
The index is stored in the omap of the object which you can list with
the 'rados' command.

So it's not data inside the RADOS object, but in the omap key/value store.

Wido

> 
> 
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf file update

2016-01-29 Thread M Ranga Swami Reddy
Thank you...Will use the Matt's suggestion to deploy the updated conf files.

Thanks
Swami

On Fri, Jan 29, 2016 at 3:10 PM, Adrien Gillard
 wrote:
> Hi,
>
> No, when an OSD or a MON service starts it fetches its local
> /etc/ceph/ceph.conf file. So, as Matt stated, you need to deploy your
> updated ceph.conf on all the nodes.
>
> Indeed, services contact the MONs, but it is mainly to retrieve the
> crushmap.
>
> Adrien
>
> On Fri, Jan 29, 2016 at 7:46 AM, M Ranga Swami Reddy 
> wrote:
>>
>> HI Matt,
>> Thank you..
>> But - what I understood is - ceph osd service restart will pickup the
>> cephconf from the monitor node.
>> Is this understanding correct?
>>
>> Thanks
>> Swami
>>
>> On Fri, Jan 29, 2016 at 11:46 AM, Matt Taylor  wrote:
>> > Hi Swami,
>> >
>> > You will need to deploy ceph.conf to all respective nodes in order for
>> > changes to be persistent.
>> >
>> > I would probably say this is better to from the location of where you
>> > did
>> > your initial "ceph-deploy" from.
>> >
>> > eg: ceph-deploy --username ceph --overwrite-conf admin
>> > cephnode{1..15}.somefancyhostname.com
>> >
>> > Thanks,
>> > Matt.
>> >
>> > On 29/01/2016 17:08, M Ranga Swami Reddy wrote:
>> >>
>> >> Hi All,
>> >> I have injected a few conf file changes using the "injectargs" (there
>> >> are mon_ and osd_.). Now I want to preserve them after the ceph mons
>> >> and osds reboot also.
>> >>   I have updated the ceph.conf file to preserve the changes after
>> >> restart of ceph mon services. do I need to update the ceph.conf file @
>> >> osd node also in-order to preserver the changes?  Or OSD nodes will
>> >> fetch the ceph.conf file from monitor nodes?
>> >>
>> >>
>> >> Thanks
>> >> Swami
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trying to understand the contents of .rgw.buckets.index

2016-01-29 Thread Micha Krause

Hi,

I'm having problems listing the contents of an s3 bucket with ~2M objects.

I already found the new bucket index sharding feature, but I'm interested how 
these Indexes are stored.

My index pool shows no space used, and all objects have 0B.

root@mon01:~ # rados df -p .rgw.buckets.index
pool name KB  objects   clones degraded  
unfound   rdrd KB   wrwr KB
.rgw.buckets.index0   6200  
 0 28177514336051356 228972310

Why would sharing a 0B object make any difference?



Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf file update

2016-01-29 Thread Adrien Gillard
Hi,

No, when an OSD or a MON service starts it fetches its local
/etc/ceph/ceph.conf file. So, as Matt stated, you need to deploy your
updated ceph.conf on all the nodes.

Indeed, services contact the MONs, but it is mainly to retrieve the
crushmap.

Adrien

On Fri, Jan 29, 2016 at 7:46 AM, M Ranga Swami Reddy 
wrote:

> HI Matt,
> Thank you..
> But - what I understood is - ceph osd service restart will pickup the
> cephconf from the monitor node.
> Is this understanding correct?
>
> Thanks
> Swami
>
> On Fri, Jan 29, 2016 at 11:46 AM, Matt Taylor  wrote:
> > Hi Swami,
> >
> > You will need to deploy ceph.conf to all respective nodes in order for
> > changes to be persistent.
> >
> > I would probably say this is better to from the location of where you did
> > your initial "ceph-deploy" from.
> >
> > eg: ceph-deploy --username ceph --overwrite-conf admin
> > cephnode{1..15}.somefancyhostname.com
> >
> > Thanks,
> > Matt.
> >
> > On 29/01/2016 17:08, M Ranga Swami Reddy wrote:
> >>
> >> Hi All,
> >> I have injected a few conf file changes using the "injectargs" (there
> >> are mon_ and osd_.). Now I want to preserve them after the ceph mons
> >> and osds reboot also.
> >>   I have updated the ceph.conf file to preserve the changes after
> >> restart of ceph mon services. do I need to update the ceph.conf file @
> >> osd node also in-order to preserver the changes?  Or OSD nodes will
> >> fetch the ceph.conf file from monitor nodes?
> >>
> >>
> >> Thanks
> >> Swami
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com