On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote: > Dave Chinner <da...@fromorbit.com> writes: > > > On Mon, Apr 25, 2016 at 11:53:13PM +0000, Verma, Vishal L wrote: > >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote: > > You're assuming that only the DAX aware application accesses it's > > files. users, backup programs, data replicators, fileystem > > re-organisers (e.g. defragmenters) etc all may access the files and > > they may throw errors. What then? > > I'm not sure how this is any different from regular storage. If an > application gets EIO, it's up to the app to decide what to do with that.
Sure - they'll fail. But the question I'm asking is that if the application that owns the data is supposed to do error recovery, what happens when a 3rd party application hits an error? If that consumes the error, the the app that owns the data won't ever get a chance to correct the error. This is a minefield - a 3rd party app that swallows and clears DAX based IO errors is a data corruption vector. can yo imagine if *grep* did this? The model that is being promoted here effectively allows this sort of behaviour - I don't really think we should be architecting an error recovery strategy that has the capability to go this wrong.... > >> > Where does the application find the data that was lost to be able to > >> > rewrite it? > >> > >> The data that was lost is gone -- this assumes the application has some > >> ability to recover using a journal/log or other redundancy - yes, at the > >> application layer. If it doesn't have this sort of capability, the only > >> option is to restore files from a backup/mirror. > > > > So the architecture has a built in assumption that only userspace > > can handle data loss? > > Remember that the proposed programming model completely bypasses the > kernel, so yes, it is expected that user-space will have to deal with > the problem. No, it doesn't completely bypass the kernel - the kernel is the infrastructure that catches the errors in the first place, and it owns and controls all the metadata that corresponds to the physical location of that error. The only thing the kernel doesn't own is the *contents* of that location. > > What about filesytsems like NOVA, that use log structured design to > > provide DAX w/ update atomicity and can potentially also provide > > redundancy/repair through the same mechanisms? Won't pmem native > > filesystems with built in data protection features like this remove > > the need for adding all this to userspace applications? > > I don't think we'll /only/ support NOVA for pmem. So we'll have to deal > with this for existing file systems, right? Yes, but that misses my point that it seems that the design is only focussed on userspace and existing filesystems and there is no consideration of kernel side functionality that could do transparent recovery.... > > If so, shouldn't that be the focus of development rahter than > > placing the burden on userspace apps to handle storage repair > > situations? > > It really depends on the programming model. In the model Vishal is > talking about, either the applications themselves or the libraries they > link to are expected to implement the redundancies where necessary. IOWs, filesystems no longer have any control over data integrity. Yet it's the filesystem developers who will still be responsible for data integrity and when the filesystem has a data coruption event we'll get blamed and the filesystem gets a bad name, even though it's entirely the applications fault. We've seen this time and time again - application developers cannot be trusted to guarantee data integrity. yes, some apps will be fine, but do you really expect application devs that refuse to use fsync because it's too slow are going to have a different approach to integrity when it comes to DAX? > >> > There's an implicit assumption that applications will keep redundant > >> > copies of their data at the /application layer/ and be able to > >> > automatically repair it? > > That's one way to do things. It really depends on the application what > it will do for recovery. > > >> > And then there's the implicit assumption that it will unlink and > >> > free the entire file before writing a new copy > > I think Vishal was referring to restoring from backup. cp itself will > truncate the file before overwriting, iirc. Which version of cp? what happens if they use --sparse and the error is in a zeroed region? There's so many assumptions about undefined userspace environment, application and user behaviour being made here, and it's all being handwaved away. I'm asking for this to be defined, demonstrated and documented as a working model that cannot be abused and doesn't have holes the size of trucks in it, not handwaving... > >> To summarize, the two cases we want to handle are: > >> 1. Application has inbuilt recovery: > >> - hits badblock > >> - figures out it is able to recover the data > >> - handles SIGBUS or EIO > >> - does a (sector aligned) write() to restore the data > > > > The "figures out" step here is where >95% of the work we'd have to > > do is. And that's in filesystem and block layer code, not > > userspace, and userspace can't do that work in a signal handler. > > And it can still fall down to the second case when the application > > doesn't have another copy of the data somewhere. > > I read that "figures out" step as the application determining whether or > not it had a redundant copy. Another undocumented assumption, that doesn't simplify what needs to be done. Indeed, userspace can't do that until it is in SIGBUS context, which tends to imply applications need to do a major amount of work from within the signal handler.... > > FWIW, we don't have a DAX enabled filesystem that can do > > reverse block mapping, so we're a year or two away from this being a > > workable production solution from the filesystem perspective. And > > AFAICT, it's not even on the roadmap for dm/md layers. > > Do we even need that? What if we added an FIEMAP flag for determining > bad blocks. So you're assuming that the filesystem has been informed of the bad blocks and has already marked the bad regions of the file in it's extent list? How does that happen? What mechanism is used for the underlying block device to inform the filesytem that theirs a bad LBA, and how does the filesytem the map that to a path/file/offset with reverse mapping? Or is there some other magic that hasn't been explained happening here? > The file system could simply walk the list of extents for > the file and check the corresponding disk blocks. No reverse mapping > required. You're expecting the filesystem to poll the block device to find bad sectors? Ignoring the fact this is the sort of brute force scan we need reverse mapping to avoid, how does the filesystem know what file/extent list it should be searching when the block device informs it there is a bad sector somewhere? i.e. what information does the MCE convey to the block device, and what does the block device pass to the filesytem so the filesystem can do one of these scans? If the block device is only passing LBAs or a generic "new bad block has been found" message, the filesystem still has to do an full scan of it's metadata to find the owner of the LBA(s) that have gone bad.... Nobody is explaining these important little details - there seems to be an assumption that everyone "knows" how this is all going to work and that we have infrastructure that can make it work. Just because we might be able to present bad block information to userspace via FIEMAP doesn't mean that it's trivial to implement. THE FIEMAP flag is trivial - connecting the dots is the hard part and nobody is explaining to me how that is all supposed to be done. > Also note that DM/MD don't support direct_access(), either, > so I don't think they're relevant for this discussion. But they could for linear concatenation, which would be extremely useful. e.g. stitching per-node non-linear pmem into a single linear LBA range.... > >> 2. Application doesn't have any inbuilt recovery mechanism > >> - hits badblock > >> - gets SIGBUS (or EIO) and crashes > >> - Sysadmin restores file from backup > > > > Which is no different to an existing non-DAX application getting an > > EIO/sigbus from current storage technologies. > > > > Except: in the existing storage stack, redundancy and correction has > > already had to have failed for the application to see such an error. > > Hence this is normally considered a DR case as there's had to be > > cascading failures (e.g. multiple disk failures in a RAID) to get > > to this stage, not a single error in a single sector in > > non-redundant storage. > > > > We need some form of redundancy and correction in the PMEM stack to > > prevent single sector errors from taking down services until an > > administrator can correct the problem. I'm trying to understand > > where this is supposed to fit into the picture - at this point I > > really don't think userspace applications are going to be able to do > > this reliably.... > > Not all storage is configured into a RAID volume, and in some instances, > the application is better positioned to recover the data (gluster/ceph, > for example). Right, but they still rely on the filesystem to provide data integrity guarantees to work correctly. While they have "node level" redundancy, operations within the node still need to work correctly and so they'd still need all the kernel/fs side functionality to provide them with error information (like fiemap bad blocks) on top of all the new error detectiona nd correction code they'd need to support this... FWIW, the whole point of DAX on existing filesystems was to not need major changes to existing filesystems to support fast pmem operations. i.e. to get something working quickly while pmem native filesytems are developed to support pmem and all it's quirks in a clean and efficient manner. Instead, what I'm seeing now is a trend towards forcing existing filesystems to support the requirements and quirks of DAX and pmem, without any focus on pmem native solutions. i.e. I'm hearing "we need major surgery to existing filesystems and block devices to make DAX work" rather than "how do we make this efficient for a pmem native solution rather than being bound to block device semantics"? Cheers, Dave. -- Dave Chinner da...@fromorbit.com