Re: Hadoop and Ceph client/mds view of modification time

David Zafman Tue, 27 Nov 2012 11:38:29 -0800

On Nov 27, 2012, at 11:05 AM, Sam Lang <sam.l...@inktank.com> wrote:


> On 11/27/2012 12:01 PM, Sage Weil wrote:
>> On Tue, 27 Nov 2012, David Zafman wrote:
>>> 
>>> On Nov 27, 2012, at 9:03 AM, Sage Weil <s...@inktank.com> wrote:
>>> 
>>>> On Tue, 27 Nov 2012, Sam Lang wrote:
>>>> 
>>>>> 3. When a client acquires the cap for a file, have the mds provide its 
>>>>> current
>>>>> time as well.  As the client updates the mtime, it uses the timestamp 
>>>>> provided
>>>>> by the mds and the time since the cap was acquired.
>>>>> Except for the skew caused by the message latency, this approach allows 
>>>>> the
>>>>> mtime to be based off the mds time, so it will be consistent across 
>>>>> clients
>>>>> and the mds.  It does however, allow a client to set an mtime to the 
>>>>> future
>>>>> (based off of its local time), which might be undesirable, but that is 
>>>>> more
>>>>> like how  NFS behaves.  Message latency probably won't be much of an issue
>>>>> either, as the granularity of mtime is a second. Also, the client can set 
>>>>> its
>>>>> cap acquired timestamp to the time at which the cap was requested, 
>>>>> ensuring
>>>>> that the relative increment includes the round trip latency so that the 
>>>>> mtime
>>>>> will always be set further ahead. Of course, this approach would be a lot 
>>>>> more
>>>>> intrusive to implement. :-)
>>>> 
>>>> Yeah, I'm less excited about this one.
>>>> 
>>>> I think that giving consistent behavior from a single client despite clock
>>>> skew is a good goal.  That will make things like pjd's test behave
>>>> consistently, for example.
>>>> 
>>> 
>>> My suggestion is that a client writing to a file will try to use it's
>>> local clock unless it would cause the mtime to go backward.  In that
>>> case it will simply perform the minimum mtime advance possible (1
>>> second?).  This handles the case in which one client created a file
>>> using his clock (per previous suggested change), then another client
>>> writes with a clock that is behind.
> 
> We can choose to not decrement at the client, but because mtime is a time_t 
> (seconds since epoch), we can't increment by 1 for each write. 1000 writes 
> each taking 0.01s would move the mtime 990 seconds into the future.

The mtime update shouldn't work that way (see below).

> 
>> 
>> That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
>> what POSIX says about that, though: if you utimes(2) an mtime into the
>> future, what happens on write(2)?

On ext4 a write(2) after mtime set into the future with utimes(2) does the time 
go backward.  However, we can notice that if ctime == mtime then only 
create/write/truncate has last been done to the file.  This means that we 
should not let the mtime go backward in that case.  If the ctime != mtime, then 
the mtime has been set by utimes(2), so we can set mtime using our clock even 
if it goes backwards.

> 
> According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only 
> require an update to mtime, it doesn't specify what the update should be:
> 
> "Upon successful completion, where nbyte is greater than 0, write() shall 
> mark for update the st_ctime and st_mtime fields of the file, and if the file 
> is a regular file, the S_ISUID and S_ISGID bits of the file mode may be 
> cleared."

What this really means is that all writes mark mtime for update but not setting 
a specific time in the inode yet.  All writes/truncates will be rolled into a 
single mtime bump.  So even if we only have 1 second granularity (but hopefully 
it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or 
returning capabilities) only then does a new mtime need to be set and it will 
be at most 1 second ahead.

> 
> In NFS, the server sets the mtime.  Its relatively common to see "Warning: 
> file 'foo' has modification time in the future" if you're compiling on nfs 
> and your client and nfs server clocks are skewed.  So allowing the mtime to 
> be set in the near future would at least follow the principle of least 
> surprise for most folks.

So Ceph can see this warning too if different skewed clocks are setting mtime 
and it appears in the future to some clients.

> 
> -sam
> 
>> 
>> sage
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hadoop and Ceph client/mds view of modification time

Reply via email to