Thanks Venky, I also wanted to put forward how this can help in a openstack/cloud env. where we have 2 distinct admin roles (virt/openstack admin and storage admin)
1) Gluster volume 'health' should display the health status (OK, warn, fatal/error etc) 2) Based on that the admin can query 'health status' to know 'due to which component (AFR, quorum, geo-rep etc) the health status is 'other than OK' 3) Based on that component, run the right gluster cmd ( scrub status, afr status, split brain status? etc) to go deeper into where the problem lies 1 & 2 can be done by virt admin who then alerts the storage admin who then does 3 to figure the root cause and take necessary action thanx, deepak On Tue, Dec 9, 2014 at 2:52 PM, Venky Shankar <yknev.shan...@gmail.com> wrote: > On Tue, Dec 9, 2014 at 1:41 PM, Deepak Shetty <dpkshe...@gmail.com> wrote: > > We can use bitrot to provide a 'health' status for gluster volumes. > > Hence I would like to propose (from a upstream/community perspective) the > > notion of 'health' status (as part of gluster volume info) which can > derive > > its value from: > > > > 1) Bitrot > > If any files are corrupted and bitrot is yet to repair them and/or > its a > > signal to admin to do some manual operation to repair the corrupted files > > (for cases where we only detect, not correct) > > > > 2) brick status > > Depending on brick offline/online > > > > 3) AFR status > > Whether we have all copies in sync or not > > This makes sense. Having a notion of "volume health" helps choosing > intelligently from a list of volumes. > > > > > This i believe is on similar lines to what Ceph does today (health > status : > > OK, WARN, ERROR) > > Yes, Ceph derives those notions from PGs. Gluster can do it for > replicas and/or files marked by bitrot scrubber. > > > The health status derivation can be pluggable, so that in future more > > components can be added to query for the composite health status of the > > gluster volume. > > > > In all of the above cases, as long as data can be served by the gluster > > volume reliably gluster volume status will be Started/Available, but > Health > > status can be 'degraded' or 'warn' > > WARN may be too strict, but something lenient enough yes descriptive > should be chosen. Ceph does it pretty well: > http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ > > > > > This has many uses: > > > > 1) It helps provide indication to the admin that something is amiss and > he > > can check based on: > > bitrot / scrub status > > brick status > > AFR status > > > > and take necessary action > > > > 2) It helps mgmt applns (openstack for eg) make an intelligent decision > > based on the health status (whether or not to pick this gluster volume > for > > this create volume operation), so it helps acts a a coarse level filter > > > > 3) In general it gives user an idea of the health of the volume (which is > > different than the availability status (whether or not volume can serve > > data)) > > For eg: If we have a pure DHT volume, and bitrot detects silent file > > corruption (and we are not auto correcting) having Gluster volume status > as > > available/started isn't entirely correct ! > > +1 > > > > > thanx, > > deepak > > > > > > On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <yknev.shan...@gmail.com> > > wrote: > >> > >> On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur <vbel...@redhat.com> > wrote: > >> > On 11/28/2014 08:30 AM, Venky Shankar wrote: > >> >> > >> >> [snip] > >> >>> > >> >>> > >> >>> 1. Can the bitd be one per node like self-heal-daemon and other > >> >>> "global" > >> >>> services? I worry about creating 2 * N processes for N bricks in a > >> >>> node. > >> >>> Maybe we can consider having one thread per volume/brick etc. in a > >> >>> single > >> >>> bitd process to make it perform better. > >> >> > >> >> > >> >> Absolutely. > >> >> There would be one bitrot daemon per node, per volume. > >> >> > >> > > >> > Do you foresee any problems in having one daemon per node for all > >> > volumes? > >> > >> Not technically :). Probably that's a nice thing to do. > >> > >> > > >> >> > >> >>> > >> >>> 3. I think the algorithm for checksum computation can vary within > the > >> >>> volume. I see a reference to "Hashtype is persisted along side the > >> >>> checksum > >> >>> and can be tuned per file type." Is this correct? If so: > >> >>> > >> >>> a) How will the policy be exposed to the user? > >> >> > >> >> > >> >> Bitrot daemon would have a configuration file that can be configured > >> >> via Gluster CLI. Tuning hash types could be based on file types or > >> >> file name patterns (regexes) [which is a bit tricky as bitrot would > >> >> work on GFIDs rather than filenames, but this can be solved by a > level > >> >> of indirection]. > >> >> > >> >>> > >> >>> b) It would be nice to have the algorithm for computing checksums be > >> >>> pluggable. Are there any thoughts on pluggability? > >> >> > >> >> > >> >> Do you mean the default hash algorithm be configurable? If yes, then > >> >> that's planned. > >> > > >> > > >> > Sounds good. > >> > > >> >> > >> >>> > >> >>> c) What are the steps involved in changing the hashtype/algorithm > for > >> >>> a > >> >>> file? > >> >> > >> >> > >> >> Policy changes for file {types, patterns} are lazy, i.e., taken into > >> >> effect during the next recompute. For objects that are never modified > >> >> (after initial checksum compute), scrubbing can recompute the > checksum > >> >> using the new hash _after_ verifying the integrity of a file with the > >> >> old hash. > >> > > >> > > >> >> > >> >>> > >> >>> 4. Is the fop on which change detection gets triggered configurable? > >> >> > >> >> > >> >> As of now all data modification fops trigger checksum calculation. > >> >> > >> > > >> > Wish I was more clear on this in my OP. Is the fop on which checksum > >> > verification/bitrot detection happens configurable? The feature page > >> > talks > >> > about "open" being a trigger point for this. Users might want to > trigger > >> > detection on a "read" operation and not on open. It would be good to > >> > provide > >> > this flexibility. > >> > >> Ah! ok. As of now it's mostly open() and read(). Inline verification > >> would be "off" by default due to obvious reasons. > >> > >> > > >> >> > >> >>> > >> >>> 6. Any thoughts on integrating the bitrot repair framework with > >> >>> self-heal? > >> >> > >> >> > >> >> There are some thoughts on integration with self-heal daemon and EC. > >> >> I'm coming up with a doc which covers those [reason for delay in > >> >> replying to your questions ;)]. Expect the doc in in gluster-devel@ > >> >> soon. > >> > > >> > > >> > Will look forward to this. > >> > > >> >> > >> >>> > >> >>> 7. How does detection figure out that lazy updation is still pending > >> >>> and > >> >>> not > >> >>> raise a false positive? > >> >> > >> >> > >> >> That's one of the things that myself and Rachana discussed yesterday. > >> >> Should scrubbing *wait* till checksum updating is still in progress > or > >> >> is it expected that scrubbing happens when there is no active I/O > >> >> operations on the volume (both of which imply that bitrot daemon > needs > >> >> to know when it's done it's job). > >> >> > >> >> If both scrub and checksum updating go in parallel, then there needs > >> >> to be way to synchronize those operations. Maybe, compute checksum on > >> >> priority which is provided by the scrub process as a hint (that > leaves > >> >> little window for rot though) ? > >> >> > >> >> Any thoughts? > >> > > >> > > >> > Waiting for no active I/O in the volume might be a difficult condition > >> > to > >> > reach in some deployments. > >> > > >> > Some form of waiting is necessary to prevent false positives. One > >> > possibility might be to mark an object as dirty till checksum updation > >> > is > >> > complete. Verification/scrub can then be skipped for dirty objects. > >> > >> Makes sense. Thanks! > >> > >> > > >> > -Vijay > >> > > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel@gluster.org > >> http://supercolony.gluster.org/mailman/listinfo/gluster-devel > > > > >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel