RE: atomic write T10 standards
The proposed SCSI atomic commands - WRITE ATOMIC, READ ATOMIC, WRITE SCATTERED, and READ GATHERED - all include FUA (force unit access) bits, just like other WRITE and READ commands. Also, the SYNCHRONIZE CACHE command affects atomic writes just like non-atomic writes. With the FUA bit set to zero (don't force), if logical block data from an atomic write is stuck in a volatile write cache (not yet written to the medium), then: a) reads before a power loss return all of the logical block data from that atomic write; and b) reads after a power loss return none of the logical block data from that atomic write. Someone using a drive with a volatile write cache without setting FUA to one or using SYNCHRONIZE CACHE is accepting that any number of writes (atomic or non-atomic) may be lost on power loss. A common example use case is video editing. Before power loss, the atomic promises are honored; reads won't return part of the logical block data from an atomic write. After power loss, some of those writes will appear to never have happened. The atomic writes that were written to medium must have completely been written to medium, though - power loss is not an excuse to break atomicity. --- Rob ElliottHP Server Storage -Original Message- From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Thursday, 04 July, 2013 7:35 AM To: Vladislav Bolkhovitin Cc: Chris Mason; James Bottomley; Martin K. Petersen; linux- s...@vger.kernel.org Subject: Re: atomic write T10 standards On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote: Ric Wheeler, on 07/03/2013 11:31 AM wrote: Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush No necessary. Consider a case, when you are creating many small files in a big directory. Consider that every such operation needs 3 actions: add new directory entry, get free space and write data there. If 1 atomic write (scattered) command is used for each operation and you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI attribute or queue draining, you don't need any intermediate flushes. Only one final flush would be sufficient. In case of crash simply some of the new files would disappear, but everything would be fully consistent, so the only needed recovery would be to recreate them. The worry I have is that we then have this intermediate state where we have sent the array down a scattered IO which is marked as atomic. Can we trust the array to lose all of those parts on power failure or lose none of them before we send down a queue flush of some kind? Not to mention we still end up having to persist a broader range of data than we would otherwise need. Even worse nightmare would be sending down atomic scattered write A, followed by atomic scattered write B, , scattered atomic write Y - all without a sync followed by a crash. What semantics or ordering promises do we have in this case if the power drops? Is there a promise that they are durable in the sequence sent to the target, or could we end up with a write B and not a write A after a crash? The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? See above. Your above example still had a flush (or use of ORDERED SCSI commands). Can we really trust all devices to make something atomic that is not durable :) ? Sure, if application allows that and the atomicity property itself is durable, why not? Vlad P.S. With atomic writes there's no need in a journal, no? Durable and atomic are not the same - we need to make sure that the specification is clear and that the behaviours are uniform (mandated) if we can make use of them. We have been burnt in the past by things like the TRIM command leaving stale data for example by some vendor and not others (leading to an update
Re: atomic write T10 standards
On 07/05/2013 11:34 AM, Elliott, Robert (Server Storage) wrote: The proposed SCSI atomic commands - WRITE ATOMIC, READ ATOMIC, WRITE SCATTERED, and READ GATHERED - all include FUA (force unit access) bits, just like other WRITE and READ commands. Also, the SYNCHRONIZE CACHE command affects atomic writes just like non-atomic writes. With the FUA bit set to zero (don't force), if logical block data from an atomic write is stuck in a volatile write cache (not yet written to the medium), then: a) reads before a power loss return all of the logical block data from that atomic write; and b) reads after a power loss return none of the logical block data from that atomic write. Someone using a drive with a volatile write cache without setting FUA to one or using SYNCHRONIZE CACHE is accepting that any number of writes (atomic or non-atomic) may be lost on power loss. A common example use case is video editing. Before power loss, the atomic promises are honored; reads won't return part of the logical block data from an atomic write. After power loss, some of those writes will appear to never have happened. The atomic writes that were written to medium must have completely been written to medium, though - power loss is not an excuse to break atomicity. --- Rob ElliottHP Server Storage Thanks for filling in the details of the specification. I think that this answers all of my questions, Ric -Original Message- From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Thursday, 04 July, 2013 7:35 AM To: Vladislav Bolkhovitin Cc: Chris Mason; James Bottomley; Martin K. Petersen; linux- s...@vger.kernel.org Subject: Re: atomic write T10 standards On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote: Ric Wheeler, on 07/03/2013 11:31 AM wrote: Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush No necessary. Consider a case, when you are creating many small files in a big directory. Consider that every such operation needs 3 actions: add new directory entry, get free space and write data there. If 1 atomic write (scattered) command is used for each operation and you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI attribute or queue draining, you don't need any intermediate flushes. Only one final flush would be sufficient. In case of crash simply some of the new files would disappear, but everything would be fully consistent, so the only needed recovery would be to recreate them. The worry I have is that we then have this intermediate state where we have sent the array down a scattered IO which is marked as atomic. Can we trust the array to lose all of those parts on power failure or lose none of them before we send down a queue flush of some kind? Not to mention we still end up having to persist a broader range of data than we would otherwise need. Even worse nightmare would be sending down atomic scattered write A, followed by atomic scattered write B, , scattered atomic write Y - all without a sync followed by a crash. What semantics or ordering promises do we have in this case if the power drops? Is there a promise that they are durable in the sequence sent to the target, or could we end up with a write B and not a write A after a crash? The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? See above. Your above example still had a flush (or use of ORDERED SCSI commands). Can we really trust all devices to make something atomic that is not durable :) ? Sure, if application allows that and the atomicity property itself is durable, why not? Vlad P.S. With atomic writes there's no need in a journal, no? Durable and atomic are not the same - we need to make sure that the specification is clear and that the behaviours are uniform (mandated) if we can make use of them. We have been burnt in the past by things like the TRIM command leaving stale data
Re: atomic write T10 standards
On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote: Ric Wheeler, on 07/03/2013 11:31 AM wrote: Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush No necessary. Consider a case, when you are creating many small files in a big directory. Consider that every such operation needs 3 actions: add new directory entry, get free space and write data there. If 1 atomic write (scattered) command is used for each operation and you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI attribute or queue draining, you don't need any intermediate flushes. Only one final flush would be sufficient. In case of crash simply some of the new files would disappear, but everything would be fully consistent, so the only needed recovery would be to recreate them. The worry I have is that we then have this intermediate state where we have sent the array down a scattered IO which is marked as atomic. Can we trust the array to lose all of those parts on power failure or lose none of them before we send down a queue flush of some kind? Not to mention we still end up having to persist a broader range of data than we would otherwise need. Even worse nightmare would be sending down atomic scattered write A, followed by atomic scattered write B, , scattered atomic write Y - all without a sync followed by a crash. What semantics or ordering promises do we have in this case if the power drops? Is there a promise that they are durable in the sequence sent to the target, or could we end up with a write B and not a write A after a crash? The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? See above. Your above example still had a flush (or use of ORDERED SCSI commands). Can we really trust all devices to make something atomic that is not durable :) ? Sure, if application allows that and the atomicity property itself is durable, why not? Vlad P.S. With atomic writes there's no need in a journal, no? Durable and atomic are not the same - we need to make sure that the specification is clear and that the behaviours are uniform (mandated) if we can make use of them. We have been burnt in the past by things like the TRIM command leaving stale data for example by some vendor and not others (leading to an update of the spec :)) I think that you would need to have durability between the atomic writes in order to do away with the journal. Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
atomic write T10 standards
On 07/03/2013 11:00 AM, James Bottomley wrote: On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote: On 07/03/2013 10:38 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 10:34:04) As I was out walking Skeeter this morning, I was thinking a bit about the new T10 atomic write proposal that Chris spoke about some time back. Specifically, I think that we would see a value only if the atomic write was also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which would mean it really is not effectively more useful than a normal write? Did I understand the proposal correctly? If I did, should we poke the usual T10 posse to nudge them (David Black, Fred Knight, etc?)... I don't think the atomic writes should be a special case here. We've already got the cache flush and fua machinery and should just apply it on top of the atomic constructs... -chris I should have sent this to the linux-scsi list I suppose, but wanted clarity before embarrassing myself :) Yes, it is a better to have a wider audience Adding in linux-scsi If we have to use fua/flush after an atomic write, what makes it atomic? Why not just use a normal write? It does not seem to add anything that write + flush/fua does? It adds the all or nothing that we can use to commit journal entries without having to worry about atomicity. The guarantee is that everything makes it or nothing does. I still don't see the difference in write + SYNC_CACHE versus atomic write + SYNC_CACHE. If the write is atomic and not durable, it is not really usable as a hard promise until after we flush it somehow. In theory, if we got ordered tags working to ensure transaction vs data ordering, this would mean we wouldn't have to flush at all because the disk image would always be journal consistent ... a bit like the old soft update scheme. James Why not have the atomic write actually imply that it is atomic and durable for just that command? Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: On 07/03/2013 11:00 AM, James Bottomley wrote: On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote: On 07/03/2013 10:38 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 10:34:04) As I was out walking Skeeter this morning, I was thinking a bit about the new T10 atomic write proposal that Chris spoke about some time back. Specifically, I think that we would see a value only if the atomic write was also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which would mean it really is not effectively more useful than a normal write? Did I understand the proposal correctly? If I did, should we poke the usual T10 posse to nudge them (David Black, Fred Knight, etc?)... I don't think the atomic writes should be a special case here. We've already got the cache flush and fua machinery and should just apply it on top of the atomic constructs... -chris I should have sent this to the linux-scsi list I suppose, but wanted clarity before embarrassing myself :) Yes, it is a better to have a wider audience Adding in linux-scsi If we have to use fua/flush after an atomic write, what makes it atomic? Why not just use a normal write? It does not seem to add anything that write + flush/fua does? It adds the all or nothing that we can use to commit journal entries without having to worry about atomicity. The guarantee is that everything makes it or nothing does. I still don't see the difference in write + SYNC_CACHE versus atomic write + SYNC_CACHE. If the write is atomic and not durable, it is not really usable as a hard promise until after we flush it somehow. In theory, if we got ordered tags working to ensure transaction vs data ordering, this would mean we wouldn't have to flush at all because the disk image would always be journal consistent ... a bit like the old soft update scheme. James Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Right? Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? thanks! ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Quoting Ric Wheeler (2013-07-03 14:31:59) On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. This is only partially true, since you're extending the sata drive model into atomics, and the devices implementing atomics are (so far anyway) are not sata. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. In sata speak, it could go down as atomic + FUA + NCQ. In practice this is going to be in fusionio, nvme devices and big storage arrays, all of which we can expect to have proper knobs for lies about IO that isn't really done yet. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? Today's usage is mostly O_DIRECT, which really should be FUA. Long term we can hope people will find more interesting uses. Either way the point is that an atomic write is a grouping mechanism, and if the standards people want to control fuaness in a separate bit, that's really fine. -chris -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 02:54 PM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 14:31:59) On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. This is only partially true, since you're extending the sata drive model into atomics, and the devices implementing atomics are (so far anyway) are not sata. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. In sata speak, it could go down as atomic + FUA + NCQ. In practice this is going to be in fusionio, nvme devices and big storage arrays, all of which we can expect to have proper knobs for lies about IO that isn't really done yet. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? Today's usage is mostly O_DIRECT, which really should be FUA. Long term we can hope people will find more interesting uses. Either way the point is that an atomic write is a grouping mechanism, and if the standards people want to control fuaness in a separate bit, that's really fine. -chris That makes sense to me - happy to have that bit a bit to indicate durability in the atomic operation... Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Ric Wheeler, on 07/03/2013 11:31 AM wrote: Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush No necessary. Consider a case, when you are creating many small files in a big directory. Consider that every such operation needs 3 actions: add new directory entry, get free space and write data there. If 1 atomic write (scattered) command is used for each operation and you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI attribute or queue draining, you don't need any intermediate flushes. Only one final flush would be sufficient. In case of crash simply some of the new files would disappear, but everything would be fully consistent, so the only needed recovery would be to recreate them. The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? See above. Can we really trust all devices to make something atomic that is not durable :) ? Sure, if application allows that and the atomicity property itself is durable, why not? Vlad P.S. With atomic writes there's no need in a journal, no? -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html