Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sage Weil
On Tue, 27 Nov 2012, Sam Lang wrote:
 Hi Noah,
 
 I was able to reproduce your issue with a similar test using the fuse client
 and the clock_offset option for the mds.  This is what I see happening:
 
 clientA's clock is a few seconds behind the mds clock
 
 clientA creates the file
 - the mds sets the mtime from its current time
 - clientA acquires the exclusive capability (cap) for the file
 
 clientA writes to the file
 - the mtime is updated locally (at clientA with its current time)
 
 clientA closes the file
 - the exclusive cap is flushed to the mds, but the mtime is less
   than the create mtime because of the clock skew, so the mds
   doesn't update it to the mtime from clientA's write
 
 clientA stats the file
 - the mtime from the write (still cached) gets returned.  I saw a
   race in my tests, where sometimes the mtime was from the cache
   (if the flush hadn't completed I assume), and sometimes it was
   from the mds.
 
 clientB stats the file
 - the exclusive cap is revoked at clientA, but the mtime returned
   to clientB is from the mds
 
 The goal of the current implementation is to provide an mtime that is
 non-decreasing, but that conflicts with using mtime as a version in this case.
 Using mtime as a version has its own set of problems, but I won't go into that
 here.  I think there are a few alternatives if we want to try to have a more
 consistent mtime value across clients.
 
 1. Let the client set the create mtime.  This avoids the issue that the mds
 and client clocks are out of sync, but in other cases where the client has a
 clock a few seconds ahead of other clients, we run into a similar problem.
 This might be reasonable considering clients that share state will more likely
 have synchronized clocks than the clients and mds.

I like this option the best.  It will clearly break when client clocks are 
out of sync and multiple clients write to the file, but I think that is 
the price you pay for client-driven writeback.  

Noah, is that sufficient to resolve the hadoop race?  Is there a single 
client writer?

 2. Provide a config option to always set the mtime on cap flush/revoke, even
 if its less than the current mtime.  This breaks the non-decreasing behavior,
 and requires the user set a config option across the cluster if they want
 this.

We could also do this... but if the above is sufficient for hadoop I'd 
rather not.  :/
 
 3. When a client acquires the cap for a file, have the mds provide its current
 time as well.  As the client updates the mtime, it uses the timestamp provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows the
 mtime to be based off the mds time, so it will be consistent across clients
 and the mds.  It does however, allow a client to set an mtime to the future
 (based off of its local time), which might be undesirable, but that is more
 like how  NFS behaves.  Message latency probably won't be much of an issue
 either, as the granularity of mtime is a second. Also, the client can set its
 cap acquired timestamp to the time at which the cap was requested, ensuring
 that the relative increment includes the round trip latency so that the mtime
 will always be set further ahead. Of course, this approach would be a lot more
 intrusive to implement. :-)

Yeah, I'm less excited about this one.

I think that giving consistent behavior from a single client despite clock 
skew is a good goal.  That will make things like pjd's test behave 
consistently, for example.

sage

 
 -sam
 
 
 On 11/20/2012 01:44 PM, Noah Watkins wrote:
  This is a description of the clock synchronization issue we are facing
  in Hadoop:
  
  Components of Hadoop use mtime as a versioning mechanism. Here is an
  example where Client B tests the expected 'version' of a file created
  by Client A:
  
 Client A: create file, write data into file.
 Client A: expected_mtime -- lstat(file)
 Client A: broadcast expected_mtime to client B
 ...
 Client B: mtime -- lstat(file)
 Client B: test expected_mtime == mtime
  
  Since mtime may be set in Ceph by both client and MDS, inconsistent
  mtime view is possible when clocks are not adequately synchronized.
  
  Here is a test that reproduces the problem. In the following output,
  issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
  set to an hour earlier than the MDS node.
  
  nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
  Tue Nov 20 11:40:28 PST 2012   // MDS TIME
  local time: Tue Nov 20 10:42:47 2012  // Client TIME
  fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
  creation (MDS time)
  lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
  (client time)
  
  Here is the code used to produce that output.
  
  #include errno.h
  #include sys/fcntl.h
  #include sys/time.h
  #include unistd.h
  #include sys/types.h
  #include 

Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Gregory Farnum
On Tuesday, November 27, 2012 at 8:45 AM, Sam Lang wrote:
  
 Hi Noah,
  
 I was able to reproduce your issue with a similar test using the fuse  
 client and the clock_offset option for the mds. This is what I see  
 happening:
  
 clientA's clock is a few seconds behind the mds clock
  
 clientA creates the file
 - the mds sets the mtime from its current time
 - clientA acquires the exclusive capability (cap) for the file
  
 clientA writes to the file
 - the mtime is updated locally (at clientA with its current time)
  
 clientA closes the file
 - the exclusive cap is flushed to the mds, but the mtime is less
 than the create mtime because of the clock skew, so the mds
 doesn't update it to the mtime from clientA's write
  
 clientA stats the file
 - the mtime from the write (still cached) gets returned. I saw a
 race in my tests, where sometimes the mtime was from the cache
 (if the flush hadn't completed I assume), and sometimes it was
 from the mds.
  
 clientB stats the file
 - the exclusive cap is revoked at clientA, but the mtime returned
 to clientB is from the mds

Hurray, I think we all agree about what's happening now! :)

Have you checked to see if the MDS ever sets mtime after create, or is it 
always dictated by the client following that?
  
  
 The goal of the current implementation is to provide an mtime that is  
 non-decreasing, but that conflicts with using mtime as a version in this  
 case. Using mtime as a version has its own set of problems, but I won't  
 go into that here. I think there are a few alternatives if we want to  
 try to have a more consistent mtime value across clients.
  
 1. Let the client set the create mtime. This avoids the issue that the  
 mds and client clocks are out of sync, but in other cases where the  
 client has a clock a few seconds ahead of other clients, we run into a  
 similar problem. This might be reasonable considering clients that  
 share state will more likely have synchronized clocks than the clients  
 and mds.
  
 2. Provide a config option to always set the mtime on cap flush/revoke,  
 even if its less than the current mtime. This breaks the non-decreasing  
 behavior, and requires the user set a config option across the cluster  
 if they want this.
  
 3. When a client acquires the cap for a file, have the mds provide its  
 current time as well. As the client updates the mtime, it uses the  
 timestamp provided by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows  
 the mtime to be based off the mds time, so it will be consistent across  
 clients and the mds. It does however, allow a client to set an mtime to  
 the future (based off of its local time), which might be undesirable,  
 but that is more like how NFS behaves. Message latency probably won't  
 be much of an issue either, as the granularity of mtime is a second.  
 Also, the client can set its cap acquired timestamp to the time at which  
 the cap was requested, ensuring that the relative increment includes the  
 round trip latency so that the mtime will always be set further ahead.  
 Of course, this approach would be a lot more intrusive to implement. :-)

I actually like this third approach of letting the MDS be authoritative about 
time even if it's not directly involved. Given that, I wonder if perhaps the 
client should just have time translation functions it uses everywhere?
However, the problem with that is that the different MDS daemons might also 
disagree about time. Perhaps they could adopt the master MDS clock or something 
skanky like that… :/

The fundamental issue of time resolution is why things are the way they are 
now, of course — you usually don't want things going backwards in time, but 
clock skews are a real problem in large clusters so we just decided to not let 
things go back on the assumption it would be a temporary and little-noticed 
disparity, with easy-to-understand behavior. Obviously we were wrong about it 
being little-noticed.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:

 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows the
 mtime to be based off the mds time, so it will be consistent across clients
 and the mds.  It does however, allow a client to set an mtime to the future
 (based off of its local time), which might be undesirable, but that is more
 like how  NFS behaves.  Message latency probably won't be much of an issue
 either, as the granularity of mtime is a second. Also, the client can set its
 cap acquired timestamp to the time at which the cap was requested, ensuring
 that the relative increment includes the round trip latency so that the mtime
 will always be set further ahead. Of course, this approach would be a lot 
 more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite clock 
 skew is a good goal.  That will make things like pjd's test behave 
 consistently, for example.
 

My suggestion is that a client writing to a file will try to use it's local 
clock unless it would cause the mtime to go backward.  In that case it will 
simply perform the minimum mtime advance possible (1 second?).  This handles 
the case in which one client created a file using his clock (per previous 
suggested change), then another client writes with a clock that is behind.

David

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sam Lang

On 11/27/2012 11:07 AM, Gregory Farnum wrote:

On Tuesday, November 27, 2012 at 8:45 AM, Sam Lang wrote:


Hi Noah,

I was able to reproduce your issue with a similar test using the fuse
client and the clock_offset option for the mds. This is what I see
happening:

clientA's clock is a few seconds behind the mds clock

clientA creates the file
- the mds sets the mtime from its current time
- clientA acquires the exclusive capability (cap) for the file

clientA writes to the file
- the mtime is updated locally (at clientA with its current time)

clientA closes the file
- the exclusive cap is flushed to the mds, but the mtime is less
than the create mtime because of the clock skew, so the mds
doesn't update it to the mtime from clientA's write

clientA stats the file
- the mtime from the write (still cached) gets returned. I saw a
race in my tests, where sometimes the mtime was from the cache
(if the flush hadn't completed I assume), and sometimes it was
from the mds.

clientB stats the file
- the exclusive cap is revoked at clientA, but the mtime returned
to clientB is from the mds


Hurray, I think we all agree about what's happening now! :)

Have you checked to see if the MDS ever sets mtime after create, or is it 
always dictated by the client following that?


It sets it on truncate as well.
-sam





The goal of the current implementation is to provide an mtime that is
non-decreasing, but that conflicts with using mtime as a version in this
case. Using mtime as a version has its own set of problems, but I won't
go into that here. I think there are a few alternatives if we want to
try to have a more consistent mtime value across clients.

1. Let the client set the create mtime. This avoids the issue that the
mds and client clocks are out of sync, but in other cases where the
client has a clock a few seconds ahead of other clients, we run into a
similar problem. This might be reasonable considering clients that
share state will more likely have synchronized clocks than the clients
and mds.

2. Provide a config option to always set the mtime on cap flush/revoke,
even if its less than the current mtime. This breaks the non-decreasing
behavior, and requires the user set a config option across the cluster
if they want this.

3. When a client acquires the cap for a file, have the mds provide its
current time as well. As the client updates the mtime, it uses the
timestamp provided by the mds and the time since the cap was acquired.
Except for the skew caused by the message latency, this approach allows
the mtime to be based off the mds time, so it will be consistent across
clients and the mds. It does however, allow a client to set an mtime to
the future (based off of its local time), which might be undesirable,
but that is more like how NFS behaves. Message latency probably won't
be much of an issue either, as the granularity of mtime is a second.
Also, the client can set its cap acquired timestamp to the time at which
the cap was requested, ensuring that the relative increment includes the
round trip latency so that the mtime will always be set further ahead.
Of course, this approach would be a lot more intrusive to implement. :-)


I actually like this third approach of letting the MDS be authoritative about 
time even if it's not directly involved. Given that, I wonder if perhaps the 
client should just have time translation functions it uses everywhere?
However, the problem with that is that the different MDS daemons might also 
disagree about time. Perhaps they could adopt the master MDS clock or something 
skanky like that… :/

The fundamental issue of time resolution is why things are the way they are now, of 
course — you usually don't want things going backwards in time, but clock 
skews are a real problem in large clusters so we just decided to not let things go back 
on the assumption it would be a temporary and little-noticed disparity, with 
easy-to-understand behavior. Obviously we were wrong about it being little-noticed.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sam Lang

On 11/27/2012 11:03 AM, Sage Weil wrote:

On Tue, 27 Nov 2012, Sam Lang wrote:

Hi Noah,

I was able to reproduce your issue with a similar test using the fuse client
and the clock_offset option for the mds.  This is what I see happening:

clientA's clock is a few seconds behind the mds clock

clientA creates the file
 - the mds sets the mtime from its current time
 - clientA acquires the exclusive capability (cap) for the file

clientA writes to the file
 - the mtime is updated locally (at clientA with its current time)

clientA closes the file
 - the exclusive cap is flushed to the mds, but the mtime is less
   than the create mtime because of the clock skew, so the mds
   doesn't update it to the mtime from clientA's write

clientA stats the file
 - the mtime from the write (still cached) gets returned.  I saw a
   race in my tests, where sometimes the mtime was from the cache
   (if the flush hadn't completed I assume), and sometimes it was
   from the mds.

clientB stats the file
 - the exclusive cap is revoked at clientA, but the mtime returned
   to clientB is from the mds

The goal of the current implementation is to provide an mtime that is
non-decreasing, but that conflicts with using mtime as a version in this case.
Using mtime as a version has its own set of problems, but I won't go into that
here.  I think there are a few alternatives if we want to try to have a more
consistent mtime value across clients.

1. Let the client set the create mtime.  This avoids the issue that the mds
and client clocks are out of sync, but in other cases where the client has a
clock a few seconds ahead of other clients, we run into a similar problem.
This might be reasonable considering clients that share state will more likely
have synchronized clocks than the clients and mds.


I like this option the best.  It will clearly break when client clocks are
out of sync and multiple clients write to the file, but I think that is
the price you pay for client-driven writeback.

Noah, is that sufficient to resolve the hadoop race?  Is there a single
client writer?


If we're looking to just resolve the hadoop case, there's an even less 
intrusive option:  do a fsync before the stat to flush (and wait for) 
the caps.  The reason we can't just close is that the close flushes the 
caps, but doesn't wait for the ack.  I haven't tested this, but in my 
tests the race between stat and close indicates it should work.


-sam




2. Provide a config option to always set the mtime on cap flush/revoke, even
if its less than the current mtime.  This breaks the non-decreasing behavior,
and requires the user set a config option across the cluster if they want
this.


We could also do this... but if the above is sufficient for hadoop I'd
rather not.  :/


3. When a client acquires the cap for a file, have the mds provide its current
time as well.  As the client updates the mtime, it uses the timestamp provided
by the mds and the time since the cap was acquired.
Except for the skew caused by the message latency, this approach allows the
mtime to be based off the mds time, so it will be consistent across clients
and the mds.  It does however, allow a client to set an mtime to the future
(based off of its local time), which might be undesirable, but that is more
like how  NFS behaves.  Message latency probably won't be much of an issue
either, as the granularity of mtime is a second. Also, the client can set its
cap acquired timestamp to the time at which the cap was requested, ensuring
that the relative increment includes the round trip latency so that the mtime
will always be set further ahead. Of course, this approach would be a lot more
intrusive to implement. :-)


Yeah, I'm less excited about this one.

I think that giving consistent behavior from a single client despite clock
skew is a good goal.  That will make things like pjd's test behave
consistently, for example.

sage



-sam


On 11/20/2012 01:44 PM, Noah Watkins wrote:

This is a description of the clock synchronization issue we are facing
in Hadoop:

Components of Hadoop use mtime as a versioning mechanism. Here is an
example where Client B tests the expected 'version' of a file created
by Client A:

Client A: create file, write data into file.
Client A: expected_mtime -- lstat(file)
Client A: broadcast expected_mtime to client B
...
Client B: mtime -- lstat(file)
Client B: test expected_mtime == mtime

Since mtime may be set in Ceph by both client and MDS, inconsistent
mtime view is possible when clocks are not adequately synchronized.

Here is a test that reproduces the problem. In the following output,
issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
set to an hour earlier than the MDS node.

nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
Tue Nov 20 11:40:28 PST 2012   // MDS TIME
local time: Tue Nov 20 10:42:47 2012  // Client TIME
fstat time: Tue Nov 20 11:40:28 2012  

Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sage Weil
On Tue, 27 Nov 2012, David Zafman wrote:
 
 On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 27 Nov 2012, Sam Lang wrote:
  
  3. When a client acquires the cap for a file, have the mds provide its 
  current
  time as well.  As the client updates the mtime, it uses the timestamp 
  provided
  by the mds and the time since the cap was acquired.
  Except for the skew caused by the message latency, this approach allows the
  mtime to be based off the mds time, so it will be consistent across clients
  and the mds.  It does however, allow a client to set an mtime to the future
  (based off of its local time), which might be undesirable, but that is more
  like how  NFS behaves.  Message latency probably won't be much of an issue
  either, as the granularity of mtime is a second. Also, the client can set 
  its
  cap acquired timestamp to the time at which the cap was requested, ensuring
  that the relative increment includes the round trip latency so that the 
  mtime
  will always be set further ahead. Of course, this approach would be a lot 
  more
  intrusive to implement. :-)
  
  Yeah, I'm less excited about this one.
  
  I think that giving consistent behavior from a single client despite clock 
  skew is a good goal.  That will make things like pjd's test behave 
  consistently, for example.
  
 
 My suggestion is that a client writing to a file will try to use it's 
 local clock unless it would cause the mtime to go backward.  In that 
 case it will simply perform the minimum mtime advance possible (1 
 second?).  This handles the case in which one client created a file 
 using his clock (per previous suggested change), then another client 
 writes with a clock that is behind.

That's a possibility (if it's 1ms or 1ns, at least :). We need to verify 
what POSIX says about that, though: if you utimes(2) an mtime into the 
future, what happens on write(2)?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sam Lang

On 11/27/2012 12:01 PM, Sage Weil wrote:

On Tue, 27 Nov 2012, David Zafman wrote:


On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:


On Tue, 27 Nov 2012, Sam Lang wrote:


3. When a client acquires the cap for a file, have the mds provide its current
time as well.  As the client updates the mtime, it uses the timestamp provided
by the mds and the time since the cap was acquired.
Except for the skew caused by the message latency, this approach allows the
mtime to be based off the mds time, so it will be consistent across clients
and the mds.  It does however, allow a client to set an mtime to the future
(based off of its local time), which might be undesirable, but that is more
like how  NFS behaves.  Message latency probably won't be much of an issue
either, as the granularity of mtime is a second. Also, the client can set its
cap acquired timestamp to the time at which the cap was requested, ensuring
that the relative increment includes the round trip latency so that the mtime
will always be set further ahead. Of course, this approach would be a lot more
intrusive to implement. :-)


Yeah, I'm less excited about this one.

I think that giving consistent behavior from a single client despite clock
skew is a good goal.  That will make things like pjd's test behave
consistently, for example.



My suggestion is that a client writing to a file will try to use it's
local clock unless it would cause the mtime to go backward.  In that
case it will simply perform the minimum mtime advance possible (1
second?).  This handles the case in which one client created a file
using his clock (per previous suggested change), then another client
writes with a clock that is behind.


We can choose to not decrement at the client, but because mtime is a 
time_t (seconds since epoch), we can't increment by 1 for each write. 
1000 writes each taking 0.01s would move the mtime 990 seconds into the 
future.




That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
what POSIX says about that, though: if you utimes(2) an mtime into the
future, what happens on write(2)?


According to http://pubs.opengroup.org/onlinepubs/009695399/, writes 
only require an update to mtime, it doesn't specify what the update 
should be:


Upon successful completion, where nbyte is greater than 0, write() 
shall mark for update the st_ctime and st_mtime fields of the file, and 
if the file is a regular file, the S_ISUID and S_ISGID bits of the file 
mode may be cleared.


In NFS, the server sets the mtime.  Its relatively common to see 
Warning: file 'foo' has modification time in the future if you're 
compiling on nfs and your client and nfs server clocks are skewed.  So 
allowing the mtime to be set in the near future would at least follow 
the principle of least surprise for most folks.


-sam



sage



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote:

 On 11/27/2012 12:01 PM, Sage Weil wrote:
 On Tue, 27 Nov 2012, David Zafman wrote:
 
 On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
 
 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows 
 the
 mtime to be based off the mds time, so it will be consistent across 
 clients
 and the mds.  It does however, allow a client to set an mtime to the 
 future
 (based off of its local time), which might be undesirable, but that is 
 more
 like how  NFS behaves.  Message latency probably won't be much of an issue
 either, as the granularity of mtime is a second. Also, the client can set 
 its
 cap acquired timestamp to the time at which the cap was requested, 
 ensuring
 that the relative increment includes the round trip latency so that the 
 mtime
 will always be set further ahead. Of course, this approach would be a lot 
 more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite clock
 skew is a good goal.  That will make things like pjd's test behave
 consistently, for example.
 
 
 My suggestion is that a client writing to a file will try to use it's
 local clock unless it would cause the mtime to go backward.  In that
 case it will simply perform the minimum mtime advance possible (1
 second?).  This handles the case in which one client created a file
 using his clock (per previous suggested change), then another client
 writes with a clock that is behind.
 
 We can choose to not decrement at the client, but because mtime is a time_t 
 (seconds since epoch), we can't increment by 1 for each write. 1000 writes 
 each taking 0.01s would move the mtime 990 seconds into the future.

The mtime update shouldn't work that way (see below).

 
 
 That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
 what POSIX says about that, though: if you utimes(2) an mtime into the
 future, what happens on write(2)?

On ext4 a write(2) after mtime set into the future with utimes(2) does the time 
go backward.  However, we can notice that if ctime == mtime then only 
create/write/truncate has last been done to the file.  This means that we 
should not let the mtime go backward in that case.  If the ctime != mtime, then 
the mtime has been set by utimes(2), so we can set mtime using our clock even 
if it goes backwards.

 
 According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only 
 require an update to mtime, it doesn't specify what the update should be:
 
 Upon successful completion, where nbyte is greater than 0, write() shall 
 mark for update the st_ctime and st_mtime fields of the file, and if the file 
 is a regular file, the S_ISUID and S_ISGID bits of the file mode may be 
 cleared.

What this really means is that all writes mark mtime for update but not setting 
a specific time in the inode yet.  All writes/truncates will be rolled into a 
single mtime bump.  So even if we only have 1 second granularity (but hopefully 
it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or 
returning capabilities) only then does a new mtime need to be set and it will 
be at most 1 second ahead.

 
 In NFS, the server sets the mtime.  Its relatively common to see Warning: 
 file 'foo' has modification time in the future if you're compiling on nfs 
 and your client and nfs server clocks are skewed.  So allowing the mtime to 
 be set in the near future would at least follow the principle of least 
 surprise for most folks.

So Ceph can see this warning too if different skewed clocks are setting mtime 
and it appears in the future to some clients.

 
 -sam
 
 
 sage
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sage Weil
 On 11/27/2012 12:01 PM, Sage Weil wrote:
  On Tue, 27 Nov 2012, David Zafman wrote:
   
   On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
   
On Tue, 27 Nov 2012, Sam Lang wrote:

 3. When a client acquires the cap for a file, have the mds provide its
 current
 time as well.  As the client updates the mtime, it uses the timestamp
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach
 allows the
 mtime to be based off the mds time, so it will be consistent across
 clients
 and the mds.  It does however, allow a client to set an mtime to the
 future
 (based off of its local time), which might be undesirable, but that is
 more
 like how  NFS behaves.  Message latency probably won't be much of an
 issue
 either, as the granularity of mtime is a second. Also, the client can
 set its
 cap acquired timestamp to the time at which the cap was requested,
 ensuring
 that the relative increment includes the round trip latency so that
 the mtime
 will always be set further ahead. Of course, this approach would be a
 lot more
 intrusive to implement. :-)

Yeah, I'm less excited about this one.

I think that giving consistent behavior from a single client despite
clock
skew is a good goal.  That will make things like pjd's test behave
consistently, for example.

   
   My suggestion is that a client writing to a file will try to use it's
   local clock unless it would cause the mtime to go backward.  In that
   case it will simply perform the minimum mtime advance possible (1
   second?).  This handles the case in which one client created a file
   using his clock (per previous suggested change), then another client
   writes with a clock that is behind.
 
 We can choose to not decrement at the client, but because mtime is a time_t
 (seconds since epoch), we can't increment by 1 for each write. 1000 writes
 each taking 0.01s would move the mtime 990 seconds into the future.

Time resolution is nanoseconds, so this shouldn't be a problem.

  
  That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
  what POSIX says about that, though: if you utimes(2) an mtime into the
  future, what happens on write(2)?
 
 According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only
 require an update to mtime, it doesn't specify what the update should be:
 
 Upon successful completion, where nbyte is greater than 0, write() shall mark
 for update the st_ctime and st_mtime fields of the file, and if the file is a
 regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared.
 
 In NFS, the server sets the mtime.  Its relatively common to see Warning:
 file 'foo' has modification time in the future if you're compiling on nfs and
 your client and nfs server clocks are skewed.  So allowing the mtime to be set
 in the near future would at least follow the principle of least surprise for
 most folks.

We can make this a client config option (set to current time vs 
add epsilon).

I also like the idea of providing the timestamp on file creation.  We 
could do both.

sage

 
 -sam
 
  
  sage
  
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread Sam Lang

On 11/27/2012 01:38 PM, David Zafman wrote:


On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote:


On 11/27/2012 12:01 PM, Sage Weil wrote:

On Tue, 27 Nov 2012, David Zafman wrote:


On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:


On Tue, 27 Nov 2012, Sam Lang wrote:


3. When a client acquires the cap for a file, have the mds provide its current
time as well.  As the client updates the mtime, it uses the timestamp provided
by the mds and the time since the cap was acquired.
Except for the skew caused by the message latency, this approach allows the
mtime to be based off the mds time, so it will be consistent across clients
and the mds.  It does however, allow a client to set an mtime to the future
(based off of its local time), which might be undesirable, but that is more
like how  NFS behaves.  Message latency probably won't be much of an issue
either, as the granularity of mtime is a second. Also, the client can set its
cap acquired timestamp to the time at which the cap was requested, ensuring
that the relative increment includes the round trip latency so that the mtime
will always be set further ahead. Of course, this approach would be a lot more
intrusive to implement. :-)


Yeah, I'm less excited about this one.

I think that giving consistent behavior from a single client despite clock
skew is a good goal.  That will make things like pjd's test behave
consistently, for example.



My suggestion is that a client writing to a file will try to use it's
local clock unless it would cause the mtime to go backward.  In that
case it will simply perform the minimum mtime advance possible (1
second?).  This handles the case in which one client created a file
using his clock (per previous suggested change), then another client
writes with a clock that is behind.


We can choose to not decrement at the client, but because mtime is a time_t 
(seconds since epoch), we can't increment by 1 for each write. 1000 writes each 
taking 0.01s would move the mtime 990 seconds into the future.


The mtime update shouldn't work that way (see below).





That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
what POSIX says about that, though: if you utimes(2) an mtime into the
future, what happens on write(2)?


On ext4 a write(2) after mtime set into the future with utimes(2) does the time 
go backward.  However, we can notice that if ctime == mtime then only 
create/write/truncate has last been done to the file.  This means that we 
should not let the mtime go backward in that case.  If the ctime != mtime, then 
the mtime has been set by utimes(2), so we can set mtime using our clock even 
if it goes backwards.


I'm not sure I follow you here.  utimes(2) can set mtime and ctime to 
same, different, set mtime and/or ctime to current time.  That makes it 
hard to rely on the mtime != ctime conditional.


We might be able to use the time_warp_seq field similar to how its used 
for client and mds mtime updates.  If the time_warp_seq has been 
incremented since we acquired caps, we skip the mtime increment, and 
just use current time.


I've pushed wip-mtime-incr with the mtime increment check.  There's also 
a separate commit that updates the ctime on write.


-sam





According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only 
require an update to mtime, it doesn't specify what the update should be:

Upon successful completion, where nbyte is greater than 0, write() shall mark for 
update the st_ctime and st_mtime fields of the file, and if the file is a regular file, 
the S_ISUID and S_ISGID bits of the file mode may be cleared.


What this really means is that all writes mark mtime for update but not setting 
a specific time in the inode yet.  All writes/truncates will be rolled into a 
single mtime bump.  So even if we only have 1 second granularity (but hopefully 
it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or 
returning capabilities) only then does a new mtime need to be set and it will 
be at most 1 second ahead.



In NFS, the server sets the mtime.  Its relatively common to see Warning: file 
'foo' has modification time in the future if you're compiling on nfs and your 
client and nfs server clocks are skewed.  So allowing the mtime to be set in the near 
future would at least follow the principle of least surprise for most folks.


So Ceph can see this warning too if different skewed clocks are setting mtime 
and it appears in the future to some clients.



-sam



sage







--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 1:14 PM, Sam Lang sam.l...@inktank.com wrote:

 On 11/27/2012 01:38 PM, David Zafman wrote:
 
 On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote:
 
 On 11/27/2012 12:01 PM, Sage Weil wrote:
 On Tue, 27 Nov 2012, David Zafman wrote:
 
 On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
 
 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows 
 the
 mtime to be based off the mds time, so it will be consistent across 
 clients
 and the mds.  It does however, allow a client to set an mtime to the 
 future
 (based off of its local time), which might be undesirable, but that is 
 more
 like how  NFS behaves.  Message latency probably won't be much of an 
 issue
 either, as the granularity of mtime is a second. Also, the client can 
 set its
 cap acquired timestamp to the time at which the cap was requested, 
 ensuring
 that the relative increment includes the round trip latency so that the 
 mtime
 will always be set further ahead. Of course, this approach would be a 
 lot more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite 
 clock
 skew is a good goal.  That will make things like pjd's test behave
 consistently, for example.
 
 
 My suggestion is that a client writing to a file will try to use it's
 local clock unless it would cause the mtime to go backward.  In that
 case it will simply perform the minimum mtime advance possible (1
 second?).  This handles the case in which one client created a file
 using his clock (per previous suggested change), then another client
 writes with a clock that is behind.
 
 We can choose to not decrement at the client, but because mtime is a time_t 
 (seconds since epoch), we can't increment by 1 for each write. 1000 writes 
 each taking 0.01s would move the mtime 990 seconds into the future.
 
 The mtime update shouldn't work that way (see below).
 
 
 
 That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
 what POSIX says about that, though: if you utimes(2) an mtime into the
 future, what happens on write(2)?
 
 On ext4 a write(2) after mtime set into the future with utimes(2) does the 
 time go backward.  However, we can notice that if ctime == mtime then only 
 create/write/truncate has last been done to the file.  This means that we 
 should not let the mtime go backward in that case.  If the ctime != mtime, 
 then the mtime has been set by utimes(2), so we can set mtime using our 
 clock even if it goes backwards.
 
 I'm not sure I follow you here.  utimes(2) can set mtime and ctime to same, 
 different, set mtime and/or ctime to current time.  That makes it hard to 
 rely on the mtime != ctime conditional.

utimes(2) does not allow you to modify ctime.  As a matter of fact if you set 
mtime, ctime will always be set to localtime.  On a single system with only a 
forward moving clock, ctime can never go backwards nor will ever look like it 
is in the future.  Also, when setting mtime to now it is the case that ctime 
== mtime.  Ceph should insure this ctime only moves forward.  Unfortunately, it 
can't do that and prevent ctime from looking like it is in the future, but NFS 
doesn't either because the ctime is always set by the NFS server clock.

ubuntu@client:~$ date ; touch file ; stat file; sleep 60; date ; echo foo  
file ; stat file ; sleep 15; date ; touch -m -t 11281200 file ; stat file ; 
sleep 15 ; date ; touch -m file ; stat file
###CREATE (atime ==  mtime == ctime)
Tue Nov 27 13:44:23 PST 2012
  File: `file'
  Size: 4   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-27 13:44:23.685986005 -0800
Change: 2012-11-27 13:44:23.685986005 -0800
 Birth: -
###WRITE (mtime == ctime)  advanced
Tue Nov 27 13:45:23 PST 2012
  File: `file'
  Size: 8   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-27 13:45:23.701986456 -0800
Change: 2012-11-27 13:45:23.701986456 -0800
 Birth: -
UTIMES(2)  mtime in the future, ctime set to local clock
Tue Nov 27 13:45:38 PST 2012
  File: `file'
  Size: 8   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-28 12:00:00.0 

Re: Hadoop and Ceph client/mds view of modification time

2012-11-21 Thread Noah Watkins
(Sorry for the dupe message. vger rejected due to HTML).

Thanks, I'll try this patch this morning.

Client B should perform a single stat after a notification from Client
A. But, won't Sage's patch still be required, since Client A needs the
MDS time to pass to Client B?

On Tue, Nov 20, 2012 at 12:20 PM, Sam Lang sam.l...@inktank.com wrote:
 On 11/20/2012 01:44 PM, Noah Watkins wrote:

 This is a description of the clock synchronization issue we are facing
 in Hadoop:

 Components of Hadoop use mtime as a versioning mechanism. Here is an
 example where Client B tests the expected 'version' of a file created
 by Client A:

Client A: create file, write data into file.
Client A: expected_mtime -- lstat(file)
Client A: broadcast expected_mtime to client B
...
Client B: mtime -- lstat(file)
Client B: test expected_mtime == mtime


 Here's a patch that might work to push the setattr out to the mds every time
 (the same as Sage's patch for getattr).  This isn't quite writeback, as it
 waits for the setattr at the server to complete before returning, but I
 think that's actually what you want in this case.  It needs to be enabled by
 setting client setattr writethru = true in the config.  Also, I haven't
 tested that it sends the setattr, just a basic test of functionality.

 BTW, if its always client B's first stat of the file, you won't need Sage's
 patch.

 -sam

 diff --git a/src/client/Client.cc b/src/client/Client.cc
 index 8d4a5ac..a7dd8f7 100644
 --- a/src/client/Client.cc
 +++ b/src/client/Client.cc
 @@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, int
 gid)

  int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, int
 gid)
  {
 +  int orig_mask = mask;
int issued = in-caps_issued();

ldout(cct, 10)  _setattr mask   mask   issued  
 ccap_string(issued)  dendl;
 @@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, int
 mask, int uid, int gid)
mask = ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
  }
}
 -  if (!mask)
 +  if (!cct-_conf-client_setattr_writethru  !mask)
  return 0;

MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
 @@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat *attr,
 int mask, int uid, int gid)
req-set_filepath(path);
req-inode = in;

 +  // reset mask back to original if we're meant to do writethru
 +  if (cct-_conf-client_setattr_writethru)
 +mask = orig_mask;
 +
if (mask  CEPH_SETATTR_MODE) {
  req-head.args.setattr.mode = attr-st_mode;
  req-inode_drop |= CEPH_CAP_AUTH_SHARED;
 diff --git a/src/common/config_opts.h b/src/common/config_opts.h
 index cc05095..51a2769 100644
 --- a/src/common/config_opts.h
 +++ b/src/common/config_opts.h
 @@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100)
 // MB * n  (dirty OR tx.
  OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
 this smallish)
  OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)  // max age in cache
 before writeback
  OPTION(client_oc_max_objects, OPT_INT, 1000)  // max objects in cache
 +OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the attributes
 to the mds server
  // note: the max amount of in flight dirty data is roughly (max - target)
  OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ invalidate
 callback to keep page cache consistent
  OPTION(fuse_big_writes, OPT_BOOL, true)



 Since mtime may be set in Ceph by both client and MDS, inconsistent
 mtime view is possible when clocks are not adequately synchronized.

 Here is a test that reproduces the problem. In the following output,
 issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
 set to an hour earlier than the MDS node.

 nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
 Tue Nov 20 11:40:28 PST 2012   // MDS TIME
 local time: Tue Nov 20 10:42:47 2012  // Client TIME
 fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
 creation (MDS time)
 lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
 (client time)

 Here is the code used to produce that output.

 #include errno.h
 #include sys/fcntl.h
 #include sys/time.h
 #include unistd.h
 #include sys/types.h
 #include sys/stat.h
 #include dirent.h
 #include sys/xattr.h
 #include stdio.h
 #include string.h
 #include assert.h
 #include cephfs/libcephfs.h
 #include time.h

 int main(int argc, char **argv)
 {
  struct stat st;
  struct ceph_mount_info *cmount;
  struct timeval tv;

  /* setup */
  ceph_create(cmount, admin);
  ceph_conf_read_file(cmount,
 /users/nwatkins/Projects/ceph.conf);
  ceph_mount(cmount, /);

  /* print local time for reference */
  gettimeofday(tv, NULL);
  printf(local time: %s, ctime(tv.tv_sec));

  /* create a file */
  char buf[256];
  sprintf(buf, /somefile.%d, getpid());
  int fd = 

Hadoop and Ceph client/mds view of modification time

2012-11-20 Thread Noah Watkins
This is a description of the clock synchronization issue we are facing
in Hadoop:

Components of Hadoop use mtime as a versioning mechanism. Here is an
example where Client B tests the expected 'version' of a file created
by Client A:

  Client A: create file, write data into file.
  Client A: expected_mtime -- lstat(file)
  Client A: broadcast expected_mtime to client B
  ...
  Client B: mtime -- lstat(file)
  Client B: test expected_mtime == mtime

Since mtime may be set in Ceph by both client and MDS, inconsistent
mtime view is possible when clocks are not adequately synchronized.

Here is a test that reproduces the problem. In the following output,
issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
set to an hour earlier than the MDS node.

nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
Tue Nov 20 11:40:28 PST 2012   // MDS TIME
local time: Tue Nov 20 10:42:47 2012  // Client TIME
fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
creation (MDS time)
lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
(client time)

Here is the code used to produce that output.

#include errno.h
#include sys/fcntl.h
#include sys/time.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include dirent.h
#include sys/xattr.h
#include stdio.h
#include string.h
#include assert.h
#include cephfs/libcephfs.h
#include time.h

int main(int argc, char **argv)
{
struct stat st;
struct ceph_mount_info *cmount;
struct timeval tv;

/* setup */
ceph_create(cmount, admin);
ceph_conf_read_file(cmount, /users/nwatkins/Projects/ceph.conf);
ceph_mount(cmount, /);

/* print local time for reference */
gettimeofday(tv, NULL);
printf(local time: %s, ctime(tv.tv_sec));

/* create a file */
char buf[256];
sprintf(buf, /somefile.%d, getpid());
int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
assert(fd  0);

/* get mtime for this new file */
memset(st, 0, sizeof(st));
int ret = ceph_fstat(cmount, fd, st);
assert(ret == 0);
printf(fstat time: %s, ctime(st.st_mtime));

/* write some data into the file */
ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
assert(ret == sizeof(buf));
ceph_close(cmount, fd);

memset(st, 0, sizeof(st));
ret = ceph_lstat(cmount, buf, st);
assert(ret == 0);
printf(lstat time: %s, ctime(st.st_mtime));

ceph_shutdown(cmount);
return 0;
}

Note that this output is currently using the short patch from
http://marc.info/?l=ceph-develm=133178637520337w=2 which forces
getattr to always go to the MDS.

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 4a9ae3c..2bb24b7 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char
*buf, loff_t \
size)
 int Client::_getattr(Inode *in, int mask, int uid, int gid)
 {
-  bool yes = in-caps_issued_mask(mask);
+  bool yes = false; //in-caps_issued_mask(mask);

   ldout(cct, 10)  _getattr mask   ccap_string(mask)  
issued=  yes  \
dendl;  if (yes)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-20 Thread Sam Lang

On 11/20/2012 01:44 PM, Noah Watkins wrote:

This is a description of the clock synchronization issue we are facing
in Hadoop:

Components of Hadoop use mtime as a versioning mechanism. Here is an
example where Client B tests the expected 'version' of a file created
by Client A:

   Client A: create file, write data into file.
   Client A: expected_mtime -- lstat(file)
   Client A: broadcast expected_mtime to client B
   ...
   Client B: mtime -- lstat(file)
   Client B: test expected_mtime == mtime


Here's a patch that might work to push the setattr out to the mds every 
time (the same as Sage's patch for getattr).  This isn't quite 
writeback, as it waits for the setattr at the server to complete before 
returning, but I think that's actually what you want in this case.  It 
needs to be enabled by setting client setattr writethru = true in the 
config.  Also, I haven't tested that it sends the setattr, just a basic 
test of functionality.


BTW, if its always client B's first stat of the file, you won't need 
Sage's patch.


-sam

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 8d4a5ac..a7dd8f7 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, 
int gid)


 int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, 
int gid)

 {
+  int orig_mask = mask;
   int issued = in-caps_issued();

   ldout(cct, 10)  _setattr mask   mask   issued   
ccap_string(issued)  dendl;
@@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, 
int mask, int uid, int gid)

   mask = ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
 }
   }
-  if (!mask)
+  if (!cct-_conf-client_setattr_writethru  !mask)
 return 0;

   MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
@@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat 
*attr, int mask, int uid, int gid)

   req-set_filepath(path);
   req-inode = in;

+  // reset mask back to original if we're meant to do writethru
+  if (cct-_conf-client_setattr_writethru)
+mask = orig_mask;
+
   if (mask  CEPH_SETATTR_MODE) {
 req-head.args.setattr.mode = attr-st_mode;
 req-inode_drop |= CEPH_CAP_AUTH_SHARED;
diff --git a/src/common/config_opts.h b/src/common/config_opts.h
index cc05095..51a2769 100644
--- a/src/common/config_opts.h
+++ b/src/common/config_opts.h
@@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100) 
   // MB * n  (dirty OR tx.
 OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty 
(keep this smallish)
 OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)  // max age in 
cache before writeback

 OPTION(client_oc_max_objects, OPT_INT, 1000)  // max objects in cache
+OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the 
attributes to the mds server
 // note: the max amount of in flight dirty data is roughly (max - 
target)
 OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ 
invalidate callback to keep page cache consistent

 OPTION(fuse_big_writes, OPT_BOOL, true)




Since mtime may be set in Ceph by both client and MDS, inconsistent
mtime view is possible when clocks are not adequately synchronized.

Here is a test that reproduces the problem. In the following output,
issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
set to an hour earlier than the MDS node.

nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
Tue Nov 20 11:40:28 PST 2012   // MDS TIME
local time: Tue Nov 20 10:42:47 2012  // Client TIME
fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
creation (MDS time)
lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
(client time)

Here is the code used to produce that output.

#include errno.h
#include sys/fcntl.h
#include sys/time.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include dirent.h
#include sys/xattr.h
#include stdio.h
#include string.h
#include assert.h
#include cephfs/libcephfs.h
#include time.h

int main(int argc, char **argv)
{
 struct stat st;
 struct ceph_mount_info *cmount;
 struct timeval tv;

 /* setup */
 ceph_create(cmount, admin);
 ceph_conf_read_file(cmount, /users/nwatkins/Projects/ceph.conf);
 ceph_mount(cmount, /);

 /* print local time for reference */
 gettimeofday(tv, NULL);
 printf(local time: %s, ctime(tv.tv_sec));

 /* create a file */
 char buf[256];
 sprintf(buf, /somefile.%d, getpid());
 int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
 assert(fd  0);

 /* get mtime for this new file */
 memset(st, 0, sizeof(st));
 int ret = ceph_fstat(cmount, fd, st);
 assert(ret == 0);
 printf(fstat time: %s, ctime(st.st_mtime));

 /* write some data into the file */
 ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
 assert(ret == sizeof(buf));