RE: atomic write T10 standards

2013-07-05 Thread Elliott, Robert (Server Storage)
The proposed SCSI atomic commands - WRITE ATOMIC, READ ATOMIC, WRITE SCATTERED, 
and READ GATHERED - all include FUA (force unit access) bits, just like other 
WRITE and READ commands.  Also, the SYNCHRONIZE CACHE command affects atomic 
writes just like non-atomic writes.

With the FUA bit set to zero (don't force), if logical block data from an 
atomic write is stuck in a volatile write cache (not yet written to the 
medium), then:
a) reads before a power loss return all of the logical block data from that 
atomic write; and
b) reads after a power loss return none of the logical block data from that 
atomic write.

Someone using a drive with a volatile write cache without setting FUA to one or 
using SYNCHRONIZE CACHE is accepting that any number of writes (atomic or 
non-atomic) may be lost on power loss.  A common example use case is video 
editing.  Before power loss, the atomic promises are honored; reads won't 
return part of the logical block data from an atomic write.  After power loss, 
some of those writes will appear to never have happened.  The atomic writes 
that were written to medium must have completely been written to medium, though 
- power loss is not an excuse to break atomicity.

---
Rob ElliottHP Server Storage



 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Ric Wheeler
 Sent: Thursday, 04 July, 2013 7:35 AM
 To: Vladislav Bolkhovitin
 Cc: Chris Mason; James Bottomley; Martin K. Petersen; linux-
 s...@vger.kernel.org
 Subject: Re: atomic write  T10 standards
 
 On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote:
  Ric Wheeler, on 07/03/2013 11:31 AM wrote:
  Journals are normally big (128MB or so?) - I don't think that this is
 unique to xfs.
  We're mixing a bunch of concepts here.  The filesystems have a lot of
  different requirements, and atomics are just one small part.
 
  Creating a new file often uses resources freed by past files.  So
  deleting the old must be ordered against allocating the new.  They are
  really separate atomic units but you can't handle them completely
  independently.
 
  If our existing journal commit is:
 
  * write the data blocks for a transaction
  * flush
  * write the commit block for the transaction
  * flush
 
  Which part of this does and atomic write help?
 
  We would still need at least:
 
  * atomic write of data blocks  commit blocks
  * flush
  No necessary.
 
  Consider a case, when you are creating many small files in a big directory.
 Consider
  that every such operation needs 3 actions: add new directory entry, get
 free space and
  write data there. If 1 atomic write (scattered) command is used for each
 operation and
  you order them between each other, if needed, in some way, e.g. by using
 ORDERED SCSI
  attribute or queue draining, you don't need any intermediate flushes. Only
 one final
  flush would be sufficient. In case of crash simply some of the new files
 would
  disappear, but everything would be fully consistent, so the only needed
 recovery
  would be to recreate them.
 
 The worry I have is that we then have this intermediate state where we have
 sent
 the array down a scattered IO which is marked as atomic. Can we trust the
 array
 to lose all of those parts on power failure or lose none of them before we
 send
 down a queue flush of some kind?
 
 Not to mention we still end up having to persist a broader range of data than
 we
 would otherwise need.
 
 Even worse nightmare would be sending down atomic scattered write A,
 followed by
 atomic scattered write B, , scattered atomic write Y - all without a sync
 followed by a crash. What semantics or ordering promises do we have in this
 case
 if the power drops? Is there a promise that they are durable in the sequence
 sent to the target, or could we end up with a write B and not a write A after 
 a
 crash?
 
 
  The catch is that our current flush mechanisms are still pretty brute force
 and
  act across either the whole device or in a temporal (everything flushed
 before
  this is acked) way.
 
  I still see it would be useful to have the atomic write really be atomic 
  and
  durable just for that IO - no flush needed.
 
  Can you give a sequence for the use case for the non-durable atomic
 write that
  would not need a sync?
  See above.
 
 Your above example still had a flush (or use of ORDERED SCSI commands).
 
 
  Can we really trust all devices to make something atomic
  that is not durable :) ?
  Sure, if application allows that and the atomicity property itself is 
  durable,
 why not?
 
  Vlad
 
  P.S. With atomic writes there's no need in a journal, no?
 
 Durable and atomic are not the same - we need to make sure that the
 specification is clear and that the behaviours are uniform (mandated) if we
 can
 make use of them. We have been burnt in the past by things like the TRIM
 command
 leaving stale data for example by some vendor and not others (leading to an
 update

Re: atomic write T10 standards

2013-07-05 Thread Ric Wheeler

On 07/05/2013 11:34 AM, Elliott, Robert (Server Storage) wrote:

The proposed SCSI atomic commands - WRITE ATOMIC, READ ATOMIC, WRITE SCATTERED, 
and READ GATHERED - all include FUA (force unit access) bits, just like other 
WRITE and READ commands.  Also, the SYNCHRONIZE CACHE command affects atomic 
writes just like non-atomic writes.

With the FUA bit set to zero (don't force), if logical block data from an 
atomic write is stuck in a volatile write cache (not yet written to the 
medium), then:
a) reads before a power loss return all of the logical block data from that 
atomic write; and
b) reads after a power loss return none of the logical block data from that 
atomic write.

Someone using a drive with a volatile write cache without setting FUA to one or 
using SYNCHRONIZE CACHE is accepting that any number of writes (atomic or 
non-atomic) may be lost on power loss.  A common example use case is video 
editing.  Before power loss, the atomic promises are honored; reads won't 
return part of the logical block data from an atomic write.  After power loss, 
some of those writes will appear to never have happened.  The atomic writes 
that were written to medium must have completely been written to medium, though 
- power loss is not an excuse to break atomicity.

---
Rob ElliottHP Server Storage



Thanks for filling in the details of the specification. I think that this 
answers all of my questions,


Ric




-Original Message-
From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Thursday, 04 July, 2013 7:35 AM
To: Vladislav Bolkhovitin
Cc: Chris Mason; James Bottomley; Martin K. Petersen; linux-
s...@vger.kernel.org
Subject: Re: atomic write  T10 standards

On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote:

Ric Wheeler, on 07/03/2013 11:31 AM wrote:

Journals are normally big (128MB or so?) - I don't think that this is

unique to xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

No necessary.

Consider a case, when you are creating many small files in a big directory.

Consider

that every such operation needs 3 actions: add new directory entry, get

free space and

write data there. If 1 atomic write (scattered) command is used for each

operation and

you order them between each other, if needed, in some way, e.g. by using

ORDERED SCSI

attribute or queue draining, you don't need any intermediate flushes. Only

one final

flush would be sufficient. In case of crash simply some of the new files

would

disappear, but everything would be fully consistent, so the only needed

recovery

would be to recreate them.

The worry I have is that we then have this intermediate state where we have
sent
the array down a scattered IO which is marked as atomic. Can we trust the
array
to lose all of those parts on power failure or lose none of them before we
send
down a queue flush of some kind?

Not to mention we still end up having to persist a broader range of data than
we
would otherwise need.

Even worse nightmare would be sending down atomic scattered write A,
followed by
atomic scattered write B, , scattered atomic write Y - all without a sync
followed by a crash. What semantics or ordering promises do we have in this
case
if the power drops? Is there a promise that they are durable in the sequence
sent to the target, or could we end up with a write B and not a write A after a
crash?


The catch is that our current flush mechanisms are still pretty brute force

and

act across either the whole device or in a temporal (everything flushed

before

this is acked) way.

I still see it would be useful to have the atomic write really be atomic and
durable just for that IO - no flush needed.

Can you give a sequence for the use case for the non-durable atomic

write that

would not need a sync?

See above.

Your above example still had a flush (or use of ORDERED SCSI commands).


Can we really trust all devices to make something atomic
that is not durable :) ?

Sure, if application allows that and the atomicity property itself is durable,

why not?

Vlad

P.S. With atomic writes there's no need in a journal, no?

Durable and atomic are not the same - we need to make sure that the
specification is clear and that the behaviours are uniform (mandated) if we
can
make use of them. We have been burnt in the past by things like the TRIM
command
leaving stale data

Re: atomic write T10 standards

2013-07-04 Thread Ric Wheeler

On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote:

Ric Wheeler, on 07/03/2013 11:31 AM wrote:

Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

No necessary.

Consider a case, when you are creating many small files in a big directory. 
Consider
that every such operation needs 3 actions: add new directory entry, get free 
space and
write data there. If 1 atomic write (scattered) command is used for each 
operation and
you order them between each other, if needed, in some way, e.g. by using 
ORDERED SCSI
attribute or queue draining, you don't need any intermediate flushes. Only one 
final
flush would be sufficient. In case of crash simply some of the new files would
disappear, but everything would be fully consistent, so the only needed 
recovery
would be to recreate them.


The worry I have is that we then have this intermediate state where we have sent 
the array down a scattered IO which is marked as atomic. Can we trust the array 
to lose all of those parts on power failure or lose none of them before we send 
down a queue flush of some kind?


Not to mention we still end up having to persist a broader range of data than we 
would otherwise need.


Even worse nightmare would be sending down atomic scattered write A, followed by 
atomic scattered write B, , scattered atomic write Y - all without a sync 
followed by a crash. What semantics or ordering promises do we have in this case 
if the power drops? Is there a promise that they are durable in the sequence 
sent to the target, or could we end up with a write B and not a write A after a 
crash?





The catch is that our current flush mechanisms are still pretty brute force and
act across either the whole device or in a temporal (everything flushed before
this is acked) way.

I still see it would be useful to have the atomic write really be atomic and
durable just for that IO - no flush needed.

Can you give a sequence for the use case for the non-durable atomic write that
would not need a sync?

See above.


Your above example still had a flush (or use of ORDERED SCSI commands).




Can we really trust all devices to make something atomic
that is not durable :) ?

Sure, if application allows that and the atomicity property itself is durable, 
why not?

Vlad

P.S. With atomic writes there's no need in a journal, no?


Durable and atomic are not the same - we need to make sure that the 
specification is clear and that the behaviours are uniform (mandated) if we can 
make use of them. We have been burnt in the past by things like the TRIM command 
leaving stale data for example by some vendor and not others (leading to an 
update of the spec :))


I think that you would need to have durability between the atomic writes in 
order to do away with the journal.


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:00 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote:

On 07/03/2013 10:38 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 10:34:04)

As I was out walking Skeeter this morning, I was thinking a bit about the new
T10 atomic write proposal that Chris spoke about some time back.

Specifically, I think that we would see a value only if the atomic write was
also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which
would mean it really is not effectively more useful than a normal write?

Did I understand the proposal correctly?  If I did, should we poke the usual T10
posse to nudge them (David Black, Fred Knight, etc?)...

I don't think the atomic writes should be a special case here.  We've
already got the cache flush and fua machinery and should just apply it
on top of the atomic constructs...

-chris


I should have sent this to the linux-scsi list I suppose, but wanted clarity
before embarrassing myself :)

Yes, it is a better to have a wider audience


Adding in linux-scsi




If we have to use fua/flush after an atomic write, what makes it atomic?  Why
not just use a normal write?

It does not seem to add anything that write + flush/fua does?

It adds the all or nothing that we can use to commit journal entries
without having to worry about atomicity.  The guarantee is that
everything makes it or nothing does.


I still don't see the difference in write + SYNC_CACHE versus atomic write + 
SYNC_CACHE.


If the write is atomic and not durable, it is not really usable as a hard 
promise until after we flush it somehow.


In theory, if we got ordered tags working to ensure transaction vs data
ordering, this would mean we wouldn't have to flush at all because the
disk image would always be journal consistent ... a bit like the old
soft update scheme.

James



Why not have the atomic write actually imply that it is atomic and durable for 
just that command?


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread James Bottomley
On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
 On 07/03/2013 11:00 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote:
  On 07/03/2013 10:38 AM, Chris Mason wrote:
  Quoting Ric Wheeler (2013-07-03 10:34:04)
  As I was out walking Skeeter this morning, I was thinking a bit about 
  the new
  T10 atomic write proposal that Chris spoke about some time back.
 
  Specifically, I think that we would see a value only if the atomic write 
  was
  also durable - if not, we need to always issue a SYNCHRONIZE_CACHE 
  command which
  would mean it really is not effectively more useful than a normal write?
 
  Did I understand the proposal correctly?  If I did, should we poke the 
  usual T10
  posse to nudge them (David Black, Fred Knight, etc?)...
  I don't think the atomic writes should be a special case here.  We've
  already got the cache flush and fua machinery and should just apply it
  on top of the atomic constructs...
 
  -chris
 
  I should have sent this to the linux-scsi list I suppose, but wanted 
  clarity
  before embarrassing myself :)
  Yes, it is a better to have a wider audience
 
 Adding in linux-scsi
 
 
  If we have to use fua/flush after an atomic write, what makes it atomic?  
  Why
  not just use a normal write?
 
  It does not seem to add anything that write + flush/fua does?
  It adds the all or nothing that we can use to commit journal entries
  without having to worry about atomicity.  The guarantee is that
  everything makes it or nothing does.
 
 I still don't see the difference in write + SYNC_CACHE versus atomic write + 
 SYNC_CACHE.
 
 If the write is atomic and not durable, it is not really usable as a hard 
 promise until after we flush it somehow.
 
  In theory, if we got ordered tags working to ensure transaction vs data
  ordering, this would mean we wouldn't have to flush at all because the
  disk image would always be journal consistent ... a bit like the old
  soft update scheme.
 
  James
 
 
 Why not have the atomic write actually imply that it is atomic and durable 
 for 
 just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread James Bottomley
On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
 On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and durable 
  for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
 
 Why would we use atomic writes for things that don't need to be
 durable?
 
 Avoid a torn page write seems to be the only real difference here if
 you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
 

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
  



Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Right?

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Chris Mason
Quoting Ric Wheeler (2013-07-03 11:42:38)
 On 07/03/2013 11:37 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
  On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and 
  durable for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
  Why would we use atomic writes for things that don't need to be
  durable?
 
  Avoid a torn page write seems to be the only real difference here if
  you use the atomic operations and don't have durability...
  It's not just about torn pages: Journal entries are big complex beasts.
  They can be megabytes big (at least on xfs).  If we can guarantee all or
  nothing atomicity in the entire journal entry write it permits a more
  streaming design of the filesystem writeout path.
 
  James

 
 
 Journals are normally big (128MB or so?) - I don't think that this is unique 
 to xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.

 
 If our existing journal commit is:
 
 * write the data blocks for a transaction
 * flush
 * write the commit block for the transaction
 * flush
 
 Which part of this does and atomic write help?
 
 We would still need at least:
 
 * atomic write of data blocks  commit blocks
 * flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:54 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 11:42:38)

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
   


Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris



The catch is that our current flush mechanisms are still pretty brute force and 
act across either the whole device or in a temporal (everything flushed before 
this is acked) way.


I still see it would be useful to have the atomic write really be atomic and 
durable just for that IO - no flush needed.


Can you give a sequence for the use case for the non-durable atomic write that 
would not need a sync? Can we really trust all devices to make something atomic 
that is not durable :) ?


thanks!

ric


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Chris Mason
Quoting Ric Wheeler (2013-07-03 14:31:59)
 On 07/03/2013 11:54 AM, Chris Mason wrote:
  Quoting Ric Wheeler (2013-07-03 11:42:38)
  On 07/03/2013 11:37 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
  On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and 
  durable for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
  Why would we use atomic writes for things that don't need to be
  durable?
 
  Avoid a torn page write seems to be the only real difference here if
  you use the atomic operations and don't have durability...
  It's not just about torn pages: Journal entries are big complex beasts.
  They can be megabytes big (at least on xfs).  If we can guarantee all or
  nothing atomicity in the entire journal entry write it permits a more
  streaming design of the filesystem writeout path.
 
  James
 
 
  Journals are normally big (128MB or so?) - I don't think that this is 
  unique to xfs.
  We're mixing a bunch of concepts here.  The filesystems have a lot of
  different requirements, and atomics are just one small part.
 
  Creating a new file often uses resources freed by past files.  So
  deleting the old must be ordered against allocating the new.  They are
  really separate atomic units but you can't handle them completely
  independently.
 
  If our existing journal commit is:
 
  * write the data blocks for a transaction
  * flush
  * write the commit block for the transaction
  * flush
 
  Which part of this does and atomic write help?
 
  We would still need at least:
 
  * atomic write of data blocks  commit blocks
  * flush
  Yes.  But just because we need the flush here doesn't mean we need the
  flush for every single atomic write.
 
  -chris
 
 
 The catch is that our current flush mechanisms are still pretty brute force 
 and 
 act across either the whole device or in a temporal (everything flushed 
 before 
 this is acked) way.

This is only partially true, since you're extending the sata drive model
into atomics, and the devices implementing atomics are (so far anyway)
are not sata.

 
 I still see it would be useful to have the atomic write really be atomic and 
 durable just for that IO - no flush needed.

In sata speak, it could go down as atomic + FUA + NCQ.  In practice this
is going to be in fusionio, nvme devices and big storage arrays, all of
which we can expect to have proper knobs for lies about IO that isn't
really done yet.

 
 Can you give a sequence for the use case for the non-durable atomic write 
 that 
 would not need a sync? Can we really trust all devices to make something 
 atomic 
 that is not durable :) ?

Today's usage is mostly O_DIRECT, which really should be FUA.  Long term
we can hope people will find more interesting uses.

Either way the point is that an atomic write is a grouping mechanism,
and if the standards people want to control fuaness in a separate bit,
that's really fine.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 02:54 PM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 14:31:59)

On 07/03/2013 11:54 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 11:42:38)

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James



Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris


The catch is that our current flush mechanisms are still pretty brute force and
act across either the whole device or in a temporal (everything flushed before
this is acked) way.

This is only partially true, since you're extending the sata drive model
into atomics, and the devices implementing atomics are (so far anyway)
are not sata.


I still see it would be useful to have the atomic write really be atomic and
durable just for that IO - no flush needed.

In sata speak, it could go down as atomic + FUA + NCQ.  In practice this
is going to be in fusionio, nvme devices and big storage arrays, all of
which we can expect to have proper knobs for lies about IO that isn't
really done yet.


Can you give a sequence for the use case for the non-durable atomic write that
would not need a sync? Can we really trust all devices to make something atomic
that is not durable :) ?

Today's usage is mostly O_DIRECT, which really should be FUA.  Long term
we can hope people will find more interesting uses.

Either way the point is that an atomic write is a grouping mechanism,
and if the standards people want to control fuaness in a separate bit,
that's really fine.

-chris



That makes sense to me - happy to have that bit a bit to indicate durability in 
the atomic operation...


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Vladislav Bolkhovitin
Ric Wheeler, on 07/03/2013 11:31 AM wrote:
 Journals are normally big (128MB or so?) - I don't think that this is 
 unique to xfs.
 We're mixing a bunch of concepts here.  The filesystems have a lot of
 different requirements, and atomics are just one small part.

 Creating a new file often uses resources freed by past files.  So
 deleting the old must be ordered against allocating the new.  They are
 really separate atomic units but you can't handle them completely
 independently.

 If our existing journal commit is:

 * write the data blocks for a transaction
 * flush
 * write the commit block for the transaction
 * flush

 Which part of this does and atomic write help?

 We would still need at least:

 * atomic write of data blocks  commit blocks
 * flush

No necessary.

Consider a case, when you are creating many small files in a big directory. 
Consider
that every such operation needs 3 actions: add new directory entry, get free 
space and
write data there. If 1 atomic write (scattered) command is used for each 
operation and
you order them between each other, if needed, in some way, e.g. by using 
ORDERED SCSI
attribute or queue draining, you don't need any intermediate flushes. Only one 
final
flush would be sufficient. In case of crash simply some of the new files would
disappear, but everything would be fully consistent, so the only needed 
recovery
would be to recreate them.

 The catch is that our current flush mechanisms are still pretty brute force 
 and 
 act across either the whole device or in a temporal (everything flushed 
 before 
 this is acked) way.
 
 I still see it would be useful to have the atomic write really be atomic and 
 durable just for that IO - no flush needed.
 
 Can you give a sequence for the use case for the non-durable atomic write 
 that 
 would not need a sync?

See above.

 Can we really trust all devices to make something atomic 
 that is not durable :) ?

Sure, if application allows that and the atomicity property itself is durable, 
why not?

Vlad

P.S. With atomic writes there's no need in a journal, no?
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html