subject:"Re\: \[RFC 00\/20\] ns\: Introduce Time Namespace"

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-31 Thread Andrei Vagin

On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote:
> Andrei,
> 
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
> > When a container is migrated to another host, we have to restore its
> > monotonic and boottime clocks, but we still expect that the container
> > will continue using the host real-time clock.
> > 
> > Before stating this series, I was thinking about this, I decided that
> > these cases can be solved independently. Probably, the full isolation of
> > the time sub-system will have much higher overhead than just offsets for
> > a few clocks. And the idea that isolation of the real-time clock should
> > be optional gives us another hint that offsets for monotonic and
> > boot-time clocks can be implemented independently.
> > 
> > Eric and Tomas, what do you think about this? If you agree that these
> > two cases can be implemented separately, what should we do with this
> > series to make it ready to be merged?
> > 
> > I know that we need to:
> > 
> > * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> 
> and CLOCK_BOOTTIME and that's quite a few.
> 
> > * forbid changing offsets after creating timers
> 
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?

We didn't find any proc files where boot or monotonic time is reported,
but we will double check this.

> 
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.

It is a good point. We will work on it.

> 
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.

Thomas, there is no rush at all. This functionality is critical for
CRUI, but we have enough time to solve it properly.

The only thing what I want is that this functionality continues moving
forward and will not be put in the back burner.

> 
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
> 
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
> 
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
> 
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

Dmitry and I are going to be there.

Thanks!
Andrei

> 
> Thanks,
> 
>   tglx
> 
>

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-31 Thread Andrei Vagin

On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote:
> Andrei,
> 
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
> > When a container is migrated to another host, we have to restore its
> > monotonic and boottime clocks, but we still expect that the container
> > will continue using the host real-time clock.
> > 
> > Before stating this series, I was thinking about this, I decided that
> > these cases can be solved independently. Probably, the full isolation of
> > the time sub-system will have much higher overhead than just offsets for
> > a few clocks. And the idea that isolation of the real-time clock should
> > be optional gives us another hint that offsets for monotonic and
> > boot-time clocks can be implemented independently.
> > 
> > Eric and Tomas, what do you think about this? If you agree that these
> > two cases can be implemented separately, what should we do with this
> > series to make it ready to be merged?
> > 
> > I know that we need to:
> > 
> > * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> 
> and CLOCK_BOOTTIME and that's quite a few.
> 
> > * forbid changing offsets after creating timers
> 
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?

We didn't find any proc files where boot or monotonic time is reported,
but we will double check this.

> 
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.

It is a good point. We will work on it.

> 
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.

Thomas, there is no rush at all. This functionality is critical for
CRUI, but we have enough time to solve it properly.

The only thing what I want is that this functionality continues moving
forward and will not be put in the back burner.

> 
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
> 
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
> 
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
> 
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

Dmitry and I are going to be there.

Thanks!
Andrei

> 
> Thanks,
> 
>   tglx
> 
>

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Thomas Gleixner

Eric,

On Mon, 29 Oct 2018, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> >
> > I'll try to find some time in the next weeks to look deeper into that, but
> > I can't promise anything before returning from LPC. Btw, LPC would be a
> > great opportunity to discuss that. Are you and the other name space wizards
> > there by any chance?
> 
> I will be and there are going to be both container and CRIU
> mini-conferences.  So there should at least some of us around.

So let's try to find a slot for a BOF or similar (there might be still
slots for the kernel summit available, i'll ask).

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Thomas Gleixner

Eric,

On Mon, 29 Oct 2018, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> >
> > I'll try to find some time in the next weeks to look deeper into that, but
> > I can't promise anything before returning from LPC. Btw, LPC would be a
> > great opportunity to discuss that. Are you and the other name space wizards
> > there by any chance?
> 
> I will be and there are going to be both container and CRIU
> mini-conferences.  So there should at least some of us around.

So let's try to find a slot for a BOF or similar (there might be still
slots for the kernel summit available, i'll ask).

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Thomas Gleixner

Andrei,

On Sat, 20 Oct 2018, Andrei Vagin wrote:
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.

and CLOCK_BOOTTIME and that's quite a few.

> * forbid changing offsets after creating timers

There are more things to think about. What about interfaces which expose
boot time or monotonic time in /proc?

Aside of that (I finally came around to look at the series in more detail)
I'm really unhappy about the unconditional overhead once the Time namespace
config switch is enabled. This applies especially to the VDSO. We spent
quite some time recently to squeeze a few cycles out of those functions and
it would be a pity to pointlessly waste cycles for the !namespace case.

I can see the urge for this, but please let us think it through properly
before rushing anything in which we are going to regret once we want to do
more sophisticated time domain management, e.g. support for isolated clock
real time. I'm worried, that without a clear plan about the overall
picture, we end up with duct tape which is hard to distangle after the
fact.

There have been a few other things brought up versus time management in
general, like the TSN folks utilizing grand clock masters which expose
random time instead of proper TAI. Plus some requirements for exposing some
sort of 'monotonic' clocks which are derived from external synchronization
mechanisms, but should not affect the regular time keeping clocks.

While different issues, these all fall into the category of separate time
domains, so taking a step back to the drawing board is probably the best
thing what we can do now.

There are certainly a few things which can be looked at independently,
e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
kernel with these name space functions applying offsets left and right. I
rather have dedicated core functionality which replaces/amends existing
timer functions to become time namespace aware.

I'll try to find some time in the next weeks to look deeper into that, but
I can't promise anything before returning from LPC. Btw, LPC would be a
great opportunity to discuss that. Are you and the other name space wizards
there by any chance?

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-29 Thread Thomas Gleixner

Andrei,

On Sat, 20 Oct 2018, Andrei Vagin wrote:
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.

and CLOCK_BOOTTIME and that's quite a few.

> * forbid changing offsets after creating timers

There are more things to think about. What about interfaces which expose
boot time or monotonic time in /proc?

Aside of that (I finally came around to look at the series in more detail)
I'm really unhappy about the unconditional overhead once the Time namespace
config switch is enabled. This applies especially to the VDSO. We spent
quite some time recently to squeeze a few cycles out of those functions and
it would be a pity to pointlessly waste cycles for the !namespace case.

I can see the urge for this, but please let us think it through properly
before rushing anything in which we are going to regret once we want to do
more sophisticated time domain management, e.g. support for isolated clock
real time. I'm worried, that without a clear plan about the overall
picture, we end up with duct tape which is hard to distangle after the
fact.

There have been a few other things brought up versus time management in
general, like the TSN folks utilizing grand clock masters which expose
random time instead of proper TAI. Plus some requirements for exposing some
sort of 'monotonic' clocks which are derived from external synchronization
mechanisms, but should not affect the regular time keeping clocks.

While different issues, these all fall into the category of separate time
domains, so taking a step back to the drawing board is probably the best
thing what we can do now.

There are certainly a few things which can be looked at independently,
e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
kernel with these name space functions applying offsets left and right. I
rather have dedicated core functionality which replaces/amends existing
timer functions to become time namespace aware.

I'll try to find some time in the next weeks to look deeper into that, but
I can't promise anything before returning from LPC. Btw, LPC would be a
great opportunity to discuss that. Are you and the other name space wizards
there by any chance?

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner  writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>tick_do_update_jiffies64
> > >>   update_wall_time
> > >>   timekeeping_advance
> > >>  timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, 
> > > but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from 
> > > a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock 
> > > caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >real time.
> > 
> >The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >has not relation ship to each other (except a synchronized length of
> >the second).  So applications that migrate can see CLOCK_MONOTONIC
> >and CLOCK_BOOTTIME go backwards.
> > 
> >This is the truly pressing problem and adding some kind of offset
> >sounds like it would be the solution.  Possibly by allowing a boot
> >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >that needes to deal with both something inside of google where they
> >slew time to avoid leap time seconds and something in the outside
> >world proper UTC time is kept as an offset from TAI with the
> >occasional leap seconds.
> > 
> >In the later case it would fundamentally require having seconds of
> >different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner  writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>tick_do_update_jiffies64
> > >>   update_wall_time
> > >>   timekeeping_advance
> > >>  timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, 
> > > but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from 
> > > a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock 
> > > caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >real time.
> > 
> >The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >has not relation ship to each other (except a synchronized length of
> >the second).  So applications that migrate can see CLOCK_MONOTONIC
> >and CLOCK_BOOTTIME go backwards.
> > 
> >This is the truly pressing problem and adding some kind of offset
> >sounds like it would be the solution.  Possibly by allowing a boot
> >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >that needes to deal with both something inside of google where they
> >slew time to avoid leap time seconds and something in the outside
> >world proper UTC time is kept as an offset from TAI with the
> >occasional leap seconds.
> > 
> >In the later case it would fundamentally require having seconds of
> >different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >>tick_do_update_jiffies64
> >>   update_wall_time
> >>   timekeeping_advance
> >>  timekeepging_update
> >> 
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >> 
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.
> 
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> Then it sounds like this will take some more digging.
> 
> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
> 
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
>real time.
> 
>The problem is that CLOCK_MONOTONIC between nodes in the cluster
>has not relation ship to each other (except a synchronized length of
>the second).  So applications that migrate can see CLOCK_MONOTONIC
>and CLOCK_BOOTTIME go backwards.
> 
>This is the truly pressing problem and adding some kind of offset
>sounds like it would be the solution.  Possibly by allowing a boot
>time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> 
> 2) Dealing with two separate time management domains.  Say a machine
>that needes to deal with both something inside of google where they
>slew time to avoid leap time seconds and something in the outside
>world proper UTC time is kept as an offset from TAI with the
>occasional leap seconds.
> 
>In the later case it would fundamentally require having seconds of
>different length.
> 

I want to add that the second case should be optional.

When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.

Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.

Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >>tick_do_update_jiffies64
> >>   update_wall_time
> >>   timekeeping_advance
> >>  timekeepging_update
> >> 
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >> 
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.
> 
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> Then it sounds like this will take some more digging.
> 
> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
> 
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
>real time.
> 
>The problem is that CLOCK_MONOTONIC between nodes in the cluster
>has not relation ship to each other (except a synchronized length of
>the second).  So applications that migrate can see CLOCK_MONOTONIC
>and CLOCK_BOOTTIME go backwards.
> 
>This is the truly pressing problem and adding some kind of offset
>sounds like it would be the solution.  Possibly by allowing a boot
>time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> 
> 2) Dealing with two separate time management domains.  Say a machine
>that needes to deal with both something inside of google where they
>slew time to avoid leap time seconds and something in the outside
>world proper UTC time is kept as an offset from TAI with the
>occasional leap seconds.
> 
>In the later case it would fundamentally require having seconds of
>different length.
> 

I want to add that the second case should be optional.

When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.

Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.

Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Thomas Gleixner

Dmitry,

On Tue, 2 Oct 2018, Dmitry Safonov wrote:
> On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner  wrote:
> > I explained that in detail in this thread, but it's not about the initial
> > setting of clock mono/boot before any timers have been armed.
> >
> > It's about setting the offset or clock realtime (via settimeofday) when
> > timers are already armed. Also having a entirely different time domain,
> > e.g. separate NTP adjustments, makes that necessary.
> 
> It looks like, there is a bit of misunderstanding each other:
> Andrei was talking about the current RFC version, where we haven't
> introduced offsets for clock realtime. While Thomas IIUC, is looking
> how-to expand time namespace over realtime.
>
> As CLOCK_REALTIME virtualization raises so many complex questions
> like a different length of the second or list of realtime timers in ns we
> haven't added any realization for it.
> 
> It seems like an initial introduction for timens can be expanded after to 
> cover
> realtime clocks too. While it may seem incomplete, it solves issues for
> restoring/migration of real-world applications like nodejs, Oracle DB server
> which fails after being restored if there is a leap in monotonic time.

Well, yes. But you really have to think about the full picture. Just adding
part of the overall solution right now, just because it can be glued into
the code easily, is not the best approach IMO as it might result in
substantial rework of the whole thing sooner than later. I really don't
want to end up with something which is not extensible and has to be
supported forever.

Just for the record, the current approach with name space offsets for
monotonic is also prone to malfunction vs. timers, unless you can prevent
changing the offset _after_ the namespace has been set up and timers have
been armed. I admit, that I did not look close enough to verify that.

> While solving the mentioned issues, it doesn't bring overhead.
> (well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
> which will be done in v1).
> 
> Thomas, thanks much for your input - now we know that we'll need to
> introduce list for timers in namespace when we'll add realtime clocks.
> Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
> concept than offsets per-namespace?

Haven't thought it through. This was just an idea in reaction to Eric's
question whether setting clock monotonic might be feasible. But yes, it
might be worth to think about it.

I think you should really define the long term requirements for time
namespaces and perhaps set some limitations in functionality upfront.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Thomas Gleixner

Dmitry,

On Tue, 2 Oct 2018, Dmitry Safonov wrote:
> On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner  wrote:
> > I explained that in detail in this thread, but it's not about the initial
> > setting of clock mono/boot before any timers have been armed.
> >
> > It's about setting the offset or clock realtime (via settimeofday) when
> > timers are already armed. Also having a entirely different time domain,
> > e.g. separate NTP adjustments, makes that necessary.
> 
> It looks like, there is a bit of misunderstanding each other:
> Andrei was talking about the current RFC version, where we haven't
> introduced offsets for clock realtime. While Thomas IIUC, is looking
> how-to expand time namespace over realtime.
>
> As CLOCK_REALTIME virtualization raises so many complex questions
> like a different length of the second or list of realtime timers in ns we
> haven't added any realization for it.
> 
> It seems like an initial introduction for timens can be expanded after to 
> cover
> realtime clocks too. While it may seem incomplete, it solves issues for
> restoring/migration of real-world applications like nodejs, Oracle DB server
> which fails after being restored if there is a leap in monotonic time.

Well, yes. But you really have to think about the full picture. Just adding
part of the overall solution right now, just because it can be glued into
the code easily, is not the best approach IMO as it might result in
substantial rework of the whole thing sooner than later. I really don't
want to end up with something which is not extensible and has to be
supported forever.

Just for the record, the current approach with name space offsets for
monotonic is also prone to malfunction vs. timers, unless you can prevent
changing the offset _after_ the namespace has been set up and timers have
been armed. I admit, that I did not look close enough to verify that.

> While solving the mentioned issues, it doesn't bring overhead.
> (well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
> which will be done in v1).
> 
> Thomas, thanks much for your input - now we know that we'll need to
> introduce list for timers in namespace when we'll add realtime clocks.
> Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
> concept than offsets per-namespace?

Haven't thought it through. This was just an idea in reaction to Eric's
question whether setting clock monotonic might be feasible. But yes, it
might be worth to think about it.

I think you should really define the long term requirements for time
namespaces and perhaps set some limitations in functionality upfront.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Dmitry Safonov

Hi Thomas, Andrei, Eric,

On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner  wrote:
>
> On Mon, 1 Oct 2018, Andrey Vagin wrote:
>
> > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > > timers as well, because you need to guarantee that they are not expiring
> > > > early.
> > > >
> > > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > > can work at all without introducing subtle issues all over the place.
> > >
> > > And just a quick scan tells me that this is broken. Timers will expire
> > > early or late. The latter is acceptible to some extent, but larger delays
> > > might come with surprise. Expiring early is an absolute nono.
> >
> > Do you mean that we have to adjust all timers after changing offset for
> > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> > monotonic and boot times will be set immediately after creating a time
> > namespace before using any timers.
>
> I explained that in detail in this thread, but it's not about the initial
> setting of clock mono/boot before any timers have been armed.
>
> It's about setting the offset or clock realtime (via settimeofday) when
> timers are already armed. Also having a entirely different time domain,
> e.g. separate NTP adjustments, makes that necessary.

It looks like, there is a bit of misunderstanding each other:
Andrei was talking about the current RFC version, where we haven't
introduced offsets for clock realtime. While Thomas IIUC, is looking
how-to expand time namespace over realtime.

As CLOCK_REALTIME virtualization raises so many complex questions
like a different length of the second or list of realtime timers in ns we
haven't added any realization for it.

It seems like an initial introduction for timens can be expanded after to cover
realtime clocks too. While it may seem incomplete, it solves issues for
restoring/migration of real-world applications like nodejs, Oracle DB server
which fails after being restored if there is a leap in monotonic time.

While solving the mentioned issues, it doesn't bring overhead.
(well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
which will be done in v1).

Thomas, thanks much for your input - now we know that we'll need to
introduce list for timers in namespace when we'll add realtime clocks.
Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
concept than offsets per-namespace?

Thanks,
 Dmitry

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Dmitry Safonov

Hi Thomas, Andrei, Eric,

On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner  wrote:
>
> On Mon, 1 Oct 2018, Andrey Vagin wrote:
>
> > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > > timers as well, because you need to guarantee that they are not expiring
> > > > early.
> > > >
> > > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > > can work at all without introducing subtle issues all over the place.
> > >
> > > And just a quick scan tells me that this is broken. Timers will expire
> > > early or late. The latter is acceptible to some extent, but larger delays
> > > might come with surprise. Expiring early is an absolute nono.
> >
> > Do you mean that we have to adjust all timers after changing offset for
> > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> > monotonic and boot times will be set immediately after creating a time
> > namespace before using any timers.
>
> I explained that in detail in this thread, but it's not about the initial
> setting of clock mono/boot before any timers have been armed.
>
> It's about setting the offset or clock realtime (via settimeofday) when
> timers are already armed. Also having a entirely different time domain,
> e.g. separate NTP adjustments, makes that necessary.

It looks like, there is a bit of misunderstanding each other:
Andrei was talking about the current RFC version, where we haven't
introduced offsets for clock realtime. While Thomas IIUC, is looking
how-to expand time namespace over realtime.

As CLOCK_REALTIME virtualization raises so many complex questions
like a different length of the second or list of realtime timers in ns we
haven't added any realization for it.

It seems like an initial introduction for timens can be expanded after to cover
realtime clocks too. While it may seem incomplete, it solves issues for
restoring/migration of real-world applications like nodejs, Oracle DB server
which fails after being restored if there is a leap in monotonic time.

While solving the mentioned issues, it doesn't bring overhead.
(well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
which will be done in v1).

Thomas, thanks much for your input - now we know that we'll need to
introduce list for timers in namespace when we'll add realtime clocks.
Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
concept than offsets per-namespace?

Thanks,
 Dmitry

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Thomas Gleixner

On Mon, 1 Oct 2018, Andrey Vagin wrote:

> On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > > 
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > And just a quick scan tells me that this is broken. Timers will expire
> > early or late. The latter is acceptible to some extent, but larger delays
> > might come with surprise. Expiring early is an absolute nono.
> 
> Do you mean that we have to adjust all timers after changing offset for
> CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> monotonic and boot times will be set immediately after creating a time
> namespace before using any timers.

I explained that in detail in this thread, but it's not about the initial
setting of clock mono/boot before any timers have been armed.

It's about setting the offset or clock realtime (via settimeofday) when
timers are already armed. Also having a entirely different time domain,
e.g. separate NTP adjustments, makes that necessary.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-02 Thread Thomas Gleixner

On Mon, 1 Oct 2018, Andrey Vagin wrote:

> On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > > 
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > And just a quick scan tells me that this is broken. Timers will expire
> > early or late. The latter is acceptible to some extent, but larger delays
> > might come with surprise. Expiring early is an absolute nono.
> 
> Do you mean that we have to adjust all timers after changing offset for
> CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> monotonic and boot times will be set immediately after creating a time
> namespace before using any timers.

I explained that in detail in this thread, but it's not about the initial
setting of clock mono/boot before any timers have been armed.

It's about setting the offset or clock realtime (via settimeofday) when
timers are already armed. Also having a entirely different time domain,
e.g. separate NTP adjustments, makes that necessary.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Andrey Vagin

On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> > 
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> And just a quick scan tells me that this is broken. Timers will expire
> early or late. The latter is acceptible to some extent, but larger delays
> might come with surprise. Expiring early is an absolute nono.

Do you mean that we have to adjust all timers after changing offset for
CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
monotonic and boot times will be set immediately after creating a time
namespace before using any timers.

It is interesting to think what a use-case for changing these offsets
after creating timers. It may be useful for testing needs. A user sets a
timer in an hour and then change a clock offset forward and check that a
test application handles the timer properly.

> 
> Thanks,
> 
>   tglx
>

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Andrey Vagin

On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> > 
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> And just a quick scan tells me that this is broken. Timers will expire
> early or late. The latter is acceptible to some extent, but larger delays
> might come with surprise. Expiring early is an absolute nono.

Do you mean that we have to adjust all timers after changing offset for
CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
monotonic and boot times will be set immediately after creating a time
namespace before using any timers.

It is interesting to think what a use-case for changing these offsets
after creating timers. It may be useful for testing needs. A user sets a
timer in an hour and then change a clock offset forward and check that a
test application handles the timer properly.

> 
> Thanks,
> 
>   tglx
>

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> thought it through, but it makes me shudder.

Yes.  I can see how this is a dragon that we need to figure out how to
tame.  It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME
but still.

>> I see two fundamental driving cases for a time namespace.
>
> 
>
> I completely understand the problem you are trying to solve and yes, the
> read out of time should be a solvable problem.

There is

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-01 Thread Eric W. Biederman

Thomas Gleixner  writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> thought it through, but it makes me shudder.

Yes.  I can see how this is a dragon that we need to figure out how to
tame.  It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME
but still.

>> I see two fundamental driving cases for a time namespace.
>
> 
>
> I completely understand the problem you are trying to solve and yes, the
> read out of time should be a solvable problem.

There is

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Thomas Gleixner

Eric,

On Fri, 28 Sep 2018, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.

It's done by limiting the NOHZ idle time when all CPUs are going into deep
sleep for a long time, i.e. we make sure that at least one CPU comes back
sufficiently _before_ the wraparound happens and invokes the update
function.

It's not so much a problem for TSC, but not every clocksource the kernel
supports has wraparound times in the range of hundreds of years.

But yes, your idea of keeping track of wraparounds might work. Tricky, but
looks feasible on first sight, but we should be aware of the dragons.

> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.

Yes. That's the readout side. This one is doable. But now look at timers.

If you arm the timer from a name space, then it needs to be converted to
host time in order to sort it into the hrtimer queue and at some point arm
the clockevent device for it. This works as long as host and name space
time have a constant offset and the same skew.

Once the name space time has a different skew this falls apart because the
armed timer will either expire late or early.

Late might be acceptable, early violates the spec. You could do an extra
check for rescheduling it, if it's early, but that requires to store the
name space time accessor in the hrtimer itself because not every timer
expiry happens so that it can be checked in the name space context (think
signal based timers). We need to add this extra magic right into
__hrtimer_run_queues() which is called from the hard and soft interrupt. We
really don't want to touch all relevant callbacks or syscalls. The latter
is not sufficient anyway for signal based timer delivery.

That's going to be interesting in terms of synchronization and might also
cause substantial overhead at least for the timers which belong to name
spaces.

But that also means that anything which is early can and probably will
cause rearming of the timer hardware possibly for a very short delta. We
need to think about whether this can be abused to create interrupt storms.

Now if you accept a bit late, which I'm not really happy about, then you
surely won't accept very late, i.e. hours, days. But that can happen when
settimeofday() comes into play. Right now with a single time domain, this
is easy. When settimeofday() or adjtimex() makes time jump, we just go and
reprogramm the hardware timers accordingly, which might also result in
immediate expiry of timers.

But this does not help for time jumps in name spaces because the timer is
enqueued on the host time base.

And no, we should not think about creating per name space hrtimer queues
and then have to walk through all of them for finding the first expiring
timer in order to arm the hardware. That cannot scale.

Walking all hrtimer bases on all CPUs and check all queued timers whether
they belong to the affected name space does not scale either.

So we'd need to keep track of queued timers belonging to a name space and
then just handle them. Interesting locking problem and also a scalability
issue because this might need to be done on all online CPUs. Haven't
thought it through, but it makes me shudder.

> I see two fundamental driving cases for a time namespace.

I completely understand the problem you are trying to solve and yes, the
read out of time should be a solvable problem.

> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then.   If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.

See above.

> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Thomas Gleixner

Eric,

On Fri, 28 Sep 2018, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.

It's done by limiting the NOHZ idle time when all CPUs are going into deep
sleep for a long time, i.e. we make sure that at least one CPU comes back
sufficiently _before_ the wraparound happens and invokes the update
function.

It's not so much a problem for TSC, but not every clocksource the kernel
supports has wraparound times in the range of hundreds of years.

But yes, your idea of keeping track of wraparounds might work. Tricky, but
looks feasible on first sight, but we should be aware of the dragons.

> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.

Yes. That's the readout side. This one is doable. But now look at timers.

If you arm the timer from a name space, then it needs to be converted to
host time in order to sort it into the hrtimer queue and at some point arm
the clockevent device for it. This works as long as host and name space
time have a constant offset and the same skew.

Once the name space time has a different skew this falls apart because the
armed timer will either expire late or early.

Late might be acceptable, early violates the spec. You could do an extra
check for rescheduling it, if it's early, but that requires to store the
name space time accessor in the hrtimer itself because not every timer
expiry happens so that it can be checked in the name space context (think
signal based timers). We need to add this extra magic right into
__hrtimer_run_queues() which is called from the hard and soft interrupt. We
really don't want to touch all relevant callbacks or syscalls. The latter
is not sufficient anyway for signal based timer delivery.

That's going to be interesting in terms of synchronization and might also
cause substantial overhead at least for the timers which belong to name
spaces.

But that also means that anything which is early can and probably will
cause rearming of the timer hardware possibly for a very short delta. We
need to think about whether this can be abused to create interrupt storms.

Now if you accept a bit late, which I'm not really happy about, then you
surely won't accept very late, i.e. hours, days. But that can happen when
settimeofday() comes into play. Right now with a single time domain, this
is easy. When settimeofday() or adjtimex() makes time jump, we just go and
reprogramm the hardware timers accordingly, which might also result in
immediate expiry of timers.

But this does not help for time jumps in name spaces because the timer is
enqueued on the host time base.

And no, we should not think about creating per name space hrtimer queues
and then have to walk through all of them for finding the first expiring
timer in order to arm the hardware. That cannot scale.

Walking all hrtimer bases on all CPUs and check all queued timers whether
they belong to the affected name space does not scale either.

So we'd need to keep track of queued timers belonging to a name space and
then just handle them. Interesting locking problem and also a scalability
issue because this might need to be done on all online CPUs. Haven't
thought it through, but it makes me shudder.

> I see two fundamental driving cases for a time namespace.

I completely understand the problem you are trying to solve and yes, the
read out of time should be a solvable problem.

> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then.   If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.

See above.

> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>tick_do_update_jiffies64
>>   update_wall_time
>>   timekeeping_advance
>>  timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.

A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has rolled over.  With limited precision arithmetic and various
optimizations I don't think it is that simple to implement but it feels
like it should be very little extra work.

For timers my inclination would be to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-28 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>tick_do_update_jiffies64
>>   update_wall_time
>>   timekeeping_advance
>>  timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.

A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has rolled over.  With limited precision arithmetic and various
optimizations I don't think it is that simple to implement but it feels
like it should be very little extra work.

For timers my inclination would be to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-27 Thread Thomas Gleixner

On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
> 
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

And just a quick scan tells me that this is broken. Timers will expire
early or late. The latter is acceptible to some extent, but larger delays
might come with surprise. Expiring early is an absolute nono.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-27 Thread Thomas Gleixner

On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
> 
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

And just a quick scan tells me that this is broken. Timers will expire
early or late. The latter is acceptible to some extent, but larger delays
might come with surprise. Expiring early is an absolute nono.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-27 Thread Thomas Gleixner

On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> Reading the code the calling sequence there is:
> tick_sched_do_timer
>tick_do_update_jiffies64
>   update_wall_time
>   timekeeping_advance
>  timekeepging_update
> 
> If I read that properly under the right nohz circumstances that update
> can be delayed indefinitely.
> 
> So I think we could prototype a time namespace that was per
> timekeeping_update and just had update_wall_time iterate through
> all of the time namespaces.

Please don't go there. timekeeping_update() is already heavy and walking
through a gazillion of namespaces will just make it horrible,

> I don't think the naive version would scale to very many time
> namespaces.

:)

> At the same time using the techniques from the nohz work and a little
> smarts I expect we could get the code to scale.

You'd need to invoke the update when the namespace is switched in and
hasn't been updated since the last tick happened. That might be doable, but
you also need to take the wraparound constraints of the underlying
clocksources into account, which again can cause walking all name spaces
when they are all idle long enough.

>From there it becomes hairy, because it's not only timekeeping,
i.e. reading time, this is also affecting all timers which are armed from a
namespace.

That gets really ugly because when you do settimeofday() or adjtimex() for
a particular namespace, then you have to search for all armed timers of
that namespace and adjust them.

The original posix timer code had the same issue because it mapped the
clock realtime timers to the timer wheel so any setting of the clock caused
a full walk of all armed timers, disarming, adjusting and requeing
them. That's horrible not only performance wise, it's also a locking
nightmare of all sorts.

Add time skew via NTP/PTP into the picture and you might have to adjust
timers as well, because you need to guarantee that they are not expiring
early.

I haven't looked through Dimitry's patches yet, but I don't see how this
can work at all without introducing subtle issues all over the place.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-27 Thread Thomas Gleixner

On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> Reading the code the calling sequence there is:
> tick_sched_do_timer
>tick_do_update_jiffies64
>   update_wall_time
>   timekeeping_advance
>  timekeepging_update
> 
> If I read that properly under the right nohz circumstances that update
> can be delayed indefinitely.
> 
> So I think we could prototype a time namespace that was per
> timekeeping_update and just had update_wall_time iterate through
> all of the time namespaces.

Please don't go there. timekeeping_update() is already heavy and walking
through a gazillion of namespaces will just make it horrible,

> I don't think the naive version would scale to very many time
> namespaces.

:)

> At the same time using the techniques from the nohz work and a little
> smarts I expect we could get the code to scale.

You'd need to invoke the update when the namespace is switched in and
hasn't been updated since the last tick happened. That might be doable, but
you also need to take the wraparound constraints of the underlying
clocksources into account, which again can cause walking all name spaces
when they are all idle long enough.

>From there it becomes hairy, because it's not only timekeeping,
i.e. reading time, this is also affecting all timers which are armed from a
namespace.

That gets really ugly because when you do settimeofday() or adjtimex() for
a particular namespace, then you have to search for all armed timers of
that namespace and adjust them.

The original posix timer code had the same issue because it mapped the
clock realtime timers to the timer wheel so any setting of the clock caused
a full walk of all armed timers, disarming, adjusting and requeing
them. That's horrible not only performance wise, it's also a locking
nightmare of all sorts.

Add time skew via NTP/PTP into the picture and you might have to adjust
timers as well, because you need to guarantee that they are not expiring
early.

I haven't looked through Dimitry's patches yet, but I don't see how this
can work at all without introducing subtle issues all over the place.

Thanks,

tglx

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-26 Thread Dmitry Safonov

2018-09-26 18:36 GMT+01:00 Eric W. Biederman :
> The advantage of timekeeping_update per time namespace is that it allows
> different lengths of seconds per time namespace.  Which allows testing
> ntp and the kernel in interesting ways while still having a working
> production configuration on the same system.

Just a quick note: the different length of second per namespace sounds
very interesting in my POV, I remember I've seen this article:
http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf

And their realisation with a simulation of time going with different speed
per-pid  (with vdso disabled):
https://github.com/littlepretty/VirtualTimeKernel

Thanks,
 Dmitry

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-26 Thread Dmitry Safonov

2018-09-26 18:36 GMT+01:00 Eric W. Biederman :
> The advantage of timekeeping_update per time namespace is that it allows
> different lengths of seconds per time namespace.  Which allows testing
> ntp and the kernel in interesting ways while still having a working
> production configuration on the same system.

Just a quick note: the different length of second per namespace sounds
very interesting in my POV, I remember I've seen this article:
http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf

And their realisation with a simulation of time going with different speed
per-pid  (with vdso disabled):
https://github.com/littlepretty/VirtualTimeKernel

Thanks,
 Dmitry

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-26 Thread Eric W. Biederman

Andrey Vagin  writes:

> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
>> Andrey Vagin  writes:
>> 
>> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> >> Dmitry Safonov  writes:
>> >> 
>> >> > Discussions around time virtualization are there for a long time.
>> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> >> > From that time, the topic appears on and off in various discussions.
>> >> >
>> >> > There are two main use cases for time namespaces:
>> >> > 1. change date and time inside a container;
>> >> > 2. adjust clocks for a container restored from a checkpoint.
>> >> >
>> >> > “It seems like this might be one of the last major obstacles keeping
>> >> > migration from being used in production systems, given that not all
>> >> > containers and connections can be migrated as long as a time dependency
>> >> > is capable of messing it up.” (by github.com/dav-ell)
>> >> >
>> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> >> > start points for them are not defined and are different for each running
>> >> > system. When a container is migrated from one node to another, all
>> >> > clocks have to be restored into consistent states; in other words, they
>> >> > have to continue running from the same points where they have been
>> >> > dumped.
>> >> >
>> >> > The main idea behind this patch set is adding per-namespace offsets for
>> >> > system clocks. When a process in a non-root time namespace requests
>> >> > time of a clock, a namespace offset is added to the current value of
>> >> > this clock on a host and the sum is returned.
>> >> >
>> >> > All offsets are placed on a separate page, this allows up to map it as 
>> >> > part of vvar into user processes and use offsets from vdso calls.
>> >> >
>> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> >> > clocks.
>> >> >
>> >> > Questions to discuss:
>> >> >
>> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> >> > bit left, and it may be worth to use it to extend arguments of the clone
>> >> > system call.
>> >> >
>> >> > * Realtime clock implementation details:
>> >> >   Is having a simple offset enough?
>> >> >   What to do when date and time is changed on the host?
>> >> >   Is there a need to adjust vfs modification and creation times? 
>> >> >   Implementation for adjtime() syscall.
>> >> 
>> >> Overall I support this effort.  In my quick skim this code looked good.
>> >
>> > Hi Eric,
>> >
>> > Thank you for the feedback.
>> >
>> >> 
>> >> My feeling is that we need to be able to support running ntpd and
>> >> support one namespace doing googles smoothing of leap seconds while
>> >> another namespace takes the leap second.
>> >> 
>> >> What I was imagining when I was last thinking about this was one
>> >> instance of struct timekeeper aka tk_core per time namespace.  That
>> >> structure already keeps offsets for all of the various clocks from
>> >> the kerne internal time sources.  What would be needed would be to
>> >> pass in an appropriate time namespace pointer.
>> >> 
>> >> I could be completely wrong as I have not take the time to completely
>> >> trace through the code.  Have you looked at pushing the time namespace
>> >> down as far as tk_core?
>> >> 
>> >> What I think would be the big advantage (besides ntp working) is that
>> >> the bulk of the code could be reused.  Allowing testing of the kernel's
>> >> time code by setting up a new time namespace.  So a person in production
>> >> could setup a time namespace with the time set ahead a little  bit and
>> >> be able to verify that the kernel handles the upcoming leap second
>> >> properly.
>> >>
>> >
>> > It is an interesting idea, but I have a few questions:
>> >
>> > 1. Does it mean that timekeeping_update() will be called for each
>> > namespace? This functions is called periodically, it updates times on the
>> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
>> > overhead of this?
>> 
>> I don't know if periodically is a proper characterization.  There may be
>> a code path that does that.  But from what I can see timekeeping_update
>> is the guts of settimeofday (and a few related functions).
>> 
>> So it appears to make sense for timekeeping_update to be per namespace.
>> 
>> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
>> look like you would have to periodically update things, but I don't know
>> big that period would be.  As long as the period is reasonably large,
>> or the time namespaces were sufficiently deschronized it should not
>> be a problem.  But that is the class of problem that could make
>> my ideal impractical if there is measuarable overhead.
>> 
>> Where were you seeing timekeeping_update being called periodically?
>
> timekeeping_update() is called HZ times per-second:
>
> [   67.912858]

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-26 Thread Eric W. Biederman

Andrey Vagin  writes:

> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
>> Andrey Vagin  writes:
>> 
>> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> >> Dmitry Safonov  writes:
>> >> 
>> >> > Discussions around time virtualization are there for a long time.
>> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> >> > From that time, the topic appears on and off in various discussions.
>> >> >
>> >> > There are two main use cases for time namespaces:
>> >> > 1. change date and time inside a container;
>> >> > 2. adjust clocks for a container restored from a checkpoint.
>> >> >
>> >> > “It seems like this might be one of the last major obstacles keeping
>> >> > migration from being used in production systems, given that not all
>> >> > containers and connections can be migrated as long as a time dependency
>> >> > is capable of messing it up.” (by github.com/dav-ell)
>> >> >
>> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> >> > start points for them are not defined and are different for each running
>> >> > system. When a container is migrated from one node to another, all
>> >> > clocks have to be restored into consistent states; in other words, they
>> >> > have to continue running from the same points where they have been
>> >> > dumped.
>> >> >
>> >> > The main idea behind this patch set is adding per-namespace offsets for
>> >> > system clocks. When a process in a non-root time namespace requests
>> >> > time of a clock, a namespace offset is added to the current value of
>> >> > this clock on a host and the sum is returned.
>> >> >
>> >> > All offsets are placed on a separate page, this allows up to map it as 
>> >> > part of vvar into user processes and use offsets from vdso calls.
>> >> >
>> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> >> > clocks.
>> >> >
>> >> > Questions to discuss:
>> >> >
>> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> >> > bit left, and it may be worth to use it to extend arguments of the clone
>> >> > system call.
>> >> >
>> >> > * Realtime clock implementation details:
>> >> >   Is having a simple offset enough?
>> >> >   What to do when date and time is changed on the host?
>> >> >   Is there a need to adjust vfs modification and creation times? 
>> >> >   Implementation for adjtime() syscall.
>> >> 
>> >> Overall I support this effort.  In my quick skim this code looked good.
>> >
>> > Hi Eric,
>> >
>> > Thank you for the feedback.
>> >
>> >> 
>> >> My feeling is that we need to be able to support running ntpd and
>> >> support one namespace doing googles smoothing of leap seconds while
>> >> another namespace takes the leap second.
>> >> 
>> >> What I was imagining when I was last thinking about this was one
>> >> instance of struct timekeeper aka tk_core per time namespace.  That
>> >> structure already keeps offsets for all of the various clocks from
>> >> the kerne internal time sources.  What would be needed would be to
>> >> pass in an appropriate time namespace pointer.
>> >> 
>> >> I could be completely wrong as I have not take the time to completely
>> >> trace through the code.  Have you looked at pushing the time namespace
>> >> down as far as tk_core?
>> >> 
>> >> What I think would be the big advantage (besides ntp working) is that
>> >> the bulk of the code could be reused.  Allowing testing of the kernel's
>> >> time code by setting up a new time namespace.  So a person in production
>> >> could setup a time namespace with the time set ahead a little  bit and
>> >> be able to verify that the kernel handles the upcoming leap second
>> >> properly.
>> >>
>> >
>> > It is an interesting idea, but I have a few questions:
>> >
>> > 1. Does it mean that timekeeping_update() will be called for each
>> > namespace? This functions is called periodically, it updates times on the
>> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
>> > overhead of this?
>> 
>> I don't know if periodically is a proper characterization.  There may be
>> a code path that does that.  But from what I can see timekeeping_update
>> is the guts of settimeofday (and a few related functions).
>> 
>> So it appears to make sense for timekeeping_update to be per namespace.
>> 
>> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
>> look like you would have to periodically update things, but I don't know
>> big that period would be.  As long as the period is reasonably large,
>> or the time namespaces were sufficiently deschronized it should not
>> be a problem.  But that is the class of problem that could make
>> my ideal impractical if there is measuarable overhead.
>> 
>> Where were you seeing timekeeping_update being called periodically?
>
> timekeeping_update() is called HZ times per-second:
>
> [   67.912858]

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Andrey Vagin

On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin  writes:
> 
> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov  writes:
> >> 
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as 
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> >   Is having a simple offset enough?
> >> >   What to do when date and time is changed on the host?
> >> >   Is there a need to adjust vfs modification and creation times? 
> >> >   Implementation for adjtime() syscall.
> >> 
> >> Overall I support this effort.  In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >> 
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >> 
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace.  That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources.  What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >> 
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code.  Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >> 
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused.  Allowing testing of the kernel's
> >> time code by setting up a new time namespace.  So a person in production
> >> could setup a time namespace with the time set ahead a little  bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
> 
> I don't know if periodically is a proper characterization.  There may be
> a code path that does that.  But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
> 
> So it appears to make sense for timekeeping_update to be per namespace.
> 
> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be.  As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem.  But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
> 
> Where were you seeing timekeeping_update being called periodically?

timekeeping_update() is called HZ times per-second:

[   67.912858]  timekeeping_update.cold.26+0x5/0xa
[   67.913332]  timekeeping_advance+0x361/0x5c0
[   67.913857]  ? tick_sched_do_timer+0x55/0x70
[

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Andrey Vagin

On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin  writes:
> 
> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov  writes:
> >> 
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as 
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> >   Is having a simple offset enough?
> >> >   What to do when date and time is changed on the host?
> >> >   Is there a need to adjust vfs modification and creation times? 
> >> >   Implementation for adjtime() syscall.
> >> 
> >> Overall I support this effort.  In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >> 
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >> 
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace.  That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources.  What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >> 
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code.  Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >> 
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused.  Allowing testing of the kernel's
> >> time code by setting up a new time namespace.  So a person in production
> >> could setup a time namespace with the time set ahead a little  bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
> 
> I don't know if periodically is a proper characterization.  There may be
> a code path that does that.  But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
> 
> So it appears to make sense for timekeeping_update to be per namespace.
> 
> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be.  As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem.  But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
> 
> Where were you seeing timekeeping_update being called periodically?

timekeeping_update() is called HZ times per-second:

[   67.912858]  timekeeping_update.cold.26+0x5/0xa
[   67.913332]  timekeeping_advance+0x361/0x5c0
[   67.913857]  ? tick_sched_do_timer+0x55/0x70
[

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Eric W. Biederman

Andrey Vagin  writes:

> On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> Dmitry Safonov  writes:
>> 
>> > Discussions around time virtualization are there for a long time.
>> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> > From that time, the topic appears on and off in various discussions.
>> >
>> > There are two main use cases for time namespaces:
>> > 1. change date and time inside a container;
>> > 2. adjust clocks for a container restored from a checkpoint.
>> >
>> > “It seems like this might be one of the last major obstacles keeping
>> > migration from being used in production systems, given that not all
>> > containers and connections can be migrated as long as a time dependency
>> > is capable of messing it up.” (by github.com/dav-ell)
>> >
>> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> > start points for them are not defined and are different for each running
>> > system. When a container is migrated from one node to another, all
>> > clocks have to be restored into consistent states; in other words, they
>> > have to continue running from the same points where they have been
>> > dumped.
>> >
>> > The main idea behind this patch set is adding per-namespace offsets for
>> > system clocks. When a process in a non-root time namespace requests
>> > time of a clock, a namespace offset is added to the current value of
>> > this clock on a host and the sum is returned.
>> >
>> > All offsets are placed on a separate page, this allows up to map it as 
>> > part of vvar into user processes and use offsets from vdso calls.
>> >
>> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> > clocks.
>> >
>> > Questions to discuss:
>> >
>> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> > bit left, and it may be worth to use it to extend arguments of the clone
>> > system call.
>> >
>> > * Realtime clock implementation details:
>> >   Is having a simple offset enough?
>> >   What to do when date and time is changed on the host?
>> >   Is there a need to adjust vfs modification and creation times? 
>> >   Implementation for adjtime() syscall.
>> 
>> Overall I support this effort.  In my quick skim this code looked good.
>
> Hi Eric,
>
> Thank you for the feedback.
>
>> 
>> My feeling is that we need to be able to support running ntpd and
>> support one namespace doing googles smoothing of leap seconds while
>> another namespace takes the leap second.
>> 
>> What I was imagining when I was last thinking about this was one
>> instance of struct timekeeper aka tk_core per time namespace.  That
>> structure already keeps offsets for all of the various clocks from
>> the kerne internal time sources.  What would be needed would be to
>> pass in an appropriate time namespace pointer.
>> 
>> I could be completely wrong as I have not take the time to completely
>> trace through the code.  Have you looked at pushing the time namespace
>> down as far as tk_core?
>> 
>> What I think would be the big advantage (besides ntp working) is that
>> the bulk of the code could be reused.  Allowing testing of the kernel's
>> time code by setting up a new time namespace.  So a person in production
>> could setup a time namespace with the time set ahead a little  bit and
>> be able to verify that the kernel handles the upcoming leap second
>> properly.
>>
>
> It is an interesting idea, but I have a few questions:
>
> 1. Does it mean that timekeeping_update() will be called for each
> namespace? This functions is called periodically, it updates times on the
> timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> overhead of this?

I don't know if periodically is a proper characterization.  There may be
a code path that does that.  But from what I can see timekeeping_update
is the guts of settimeofday (and a few related functions).

So it appears to make sense for timekeeping_update to be per namespace.

Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
look like you would have to periodically update things, but I don't know
big that period would be.  As long as the period is reasonably large,
or the time namespaces were sufficiently deschronized it should not
be a problem.  But that is the class of problem that could make
my ideal impractical if there is measuarable overhead.

Where were you seeing timekeeping_update being called periodically?

> 2. What will we do with vdso? It looks like we will have to have a
> separate vsyscall_gtod_data for each ns and update each of them
> separately.

Yes.  But you don't have to have introduce another variable just make
certain vsyscall_gtod_data is a page aligned thing per time namespace.

If I read the summary of the existing patchset something very similiar
is already going on.

Each process would only map one.  And unshare of the time namespace
would need to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Eric W. Biederman

Andrey Vagin  writes:

> On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> Dmitry Safonov  writes:
>> 
>> > Discussions around time virtualization are there for a long time.
>> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> > From that time, the topic appears on and off in various discussions.
>> >
>> > There are two main use cases for time namespaces:
>> > 1. change date and time inside a container;
>> > 2. adjust clocks for a container restored from a checkpoint.
>> >
>> > “It seems like this might be one of the last major obstacles keeping
>> > migration from being used in production systems, given that not all
>> > containers and connections can be migrated as long as a time dependency
>> > is capable of messing it up.” (by github.com/dav-ell)
>> >
>> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> > start points for them are not defined and are different for each running
>> > system. When a container is migrated from one node to another, all
>> > clocks have to be restored into consistent states; in other words, they
>> > have to continue running from the same points where they have been
>> > dumped.
>> >
>> > The main idea behind this patch set is adding per-namespace offsets for
>> > system clocks. When a process in a non-root time namespace requests
>> > time of a clock, a namespace offset is added to the current value of
>> > this clock on a host and the sum is returned.
>> >
>> > All offsets are placed on a separate page, this allows up to map it as 
>> > part of vvar into user processes and use offsets from vdso calls.
>> >
>> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> > clocks.
>> >
>> > Questions to discuss:
>> >
>> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> > bit left, and it may be worth to use it to extend arguments of the clone
>> > system call.
>> >
>> > * Realtime clock implementation details:
>> >   Is having a simple offset enough?
>> >   What to do when date and time is changed on the host?
>> >   Is there a need to adjust vfs modification and creation times? 
>> >   Implementation for adjtime() syscall.
>> 
>> Overall I support this effort.  In my quick skim this code looked good.
>
> Hi Eric,
>
> Thank you for the feedback.
>
>> 
>> My feeling is that we need to be able to support running ntpd and
>> support one namespace doing googles smoothing of leap seconds while
>> another namespace takes the leap second.
>> 
>> What I was imagining when I was last thinking about this was one
>> instance of struct timekeeper aka tk_core per time namespace.  That
>> structure already keeps offsets for all of the various clocks from
>> the kerne internal time sources.  What would be needed would be to
>> pass in an appropriate time namespace pointer.
>> 
>> I could be completely wrong as I have not take the time to completely
>> trace through the code.  Have you looked at pushing the time namespace
>> down as far as tk_core?
>> 
>> What I think would be the big advantage (besides ntp working) is that
>> the bulk of the code could be reused.  Allowing testing of the kernel's
>> time code by setting up a new time namespace.  So a person in production
>> could setup a time namespace with the time set ahead a little  bit and
>> be able to verify that the kernel handles the upcoming leap second
>> properly.
>>
>
> It is an interesting idea, but I have a few questions:
>
> 1. Does it mean that timekeeping_update() will be called for each
> namespace? This functions is called periodically, it updates times on the
> timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> overhead of this?

I don't know if periodically is a proper characterization.  There may be
a code path that does that.  But from what I can see timekeeping_update
is the guts of settimeofday (and a few related functions).

So it appears to make sense for timekeeping_update to be per namespace.

Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
look like you would have to periodically update things, but I don't know
big that period would be.  As long as the period is reasonably large,
or the time namespaces were sufficiently deschronized it should not
be a problem.  But that is the class of problem that could make
my ideal impractical if there is measuarable overhead.

Where were you seeing timekeeping_update being called periodically?

> 2. What will we do with vdso? It looks like we will have to have a
> separate vsyscall_gtod_data for each ns and update each of them
> separately.

Yes.  But you don't have to have introduce another variable just make
certain vsyscall_gtod_data is a page aligned thing per time namespace.

If I read the summary of the existing patchset something very similiar
is already going on.

Each process would only map one.  And unshare of the time namespace
would need to

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Andrey Vagin

On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> Dmitry Safonov  writes:
> 
> > Discussions around time virtualization are there for a long time.
> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> > From that time, the topic appears on and off in various discussions.
> >
> > There are two main use cases for time namespaces:
> > 1. change date and time inside a container;
> > 2. adjust clocks for a container restored from a checkpoint.
> >
> > “It seems like this might be one of the last major obstacles keeping
> > migration from being used in production systems, given that not all
> > containers and connections can be migrated as long as a time dependency
> > is capable of messing it up.” (by github.com/dav-ell)
> >
> > The kernel provides access to several clocks: CLOCK_REALTIME,
> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> > start points for them are not defined and are different for each running
> > system. When a container is migrated from one node to another, all
> > clocks have to be restored into consistent states; in other words, they
> > have to continue running from the same points where they have been
> > dumped.
> >
> > The main idea behind this patch set is adding per-namespace offsets for
> > system clocks. When a process in a non-root time namespace requests
> > time of a clock, a namespace offset is added to the current value of
> > this clock on a host and the sum is returned.
> >
> > All offsets are placed on a separate page, this allows up to map it as 
> > part of vvar into user processes and use offsets from vdso calls.
> >
> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> > clocks.
> >
> > Questions to discuss:
> >
> > * Clone flags exhaustion. Currently there is only one unused clone flag
> > bit left, and it may be worth to use it to extend arguments of the clone
> > system call.
> >
> > * Realtime clock implementation details:
> >   Is having a simple offset enough?
> >   What to do when date and time is changed on the host?
> >   Is there a need to adjust vfs modification and creation times? 
> >   Implementation for adjtime() syscall.
> 
> Overall I support this effort.  In my quick skim this code looked good.

Hi Eric,

Thank you for the feedback.

> 
> My feeling is that we need to be able to support running ntpd and
> support one namespace doing googles smoothing of leap seconds while
> another namespace takes the leap second.
> 
> What I was imagining when I was last thinking about this was one
> instance of struct timekeeper aka tk_core per time namespace.  That
> structure already keeps offsets for all of the various clocks from
> the kerne internal time sources.  What would be needed would be to
> pass in an appropriate time namespace pointer.
> 
> I could be completely wrong as I have not take the time to completely
> trace through the code.  Have you looked at pushing the time namespace
> down as far as tk_core?
> 
> What I think would be the big advantage (besides ntp working) is that
> the bulk of the code could be reused.  Allowing testing of the kernel's
> time code by setting up a new time namespace.  So a person in production
> could setup a time namespace with the time set ahead a little  bit and
> be able to verify that the kernel handles the upcoming leap second
> properly.
>

It is an interesting idea, but I have a few questions:

1. Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the
timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
overhead of this?

2. What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them
separately.

> 
> 
> I don't know about the vfs.  I think the danger is being able to write
> dates in the future or in the past.  It appears that utimes(2) and
> utimesnat(2) already allow this except for status change.  So it is
> possible we simply don't care.  I seem to remember that what nfs does
> is take the time stamp from the host writing to the file.
> 
> I think the guide for filesystem timestamps should be to first ensure
> we don't introduce security issues, and then do what distributed
> filesystems do when dealing with hosts with different clocks.
> 
> Given those those two guidlines above I don't think there is a need to
> change timestamsp the way the user namespace changes uid when displayed.
> 
> 
> 
> As for the hardware like the real time clock we definitely should not
> let a root in a time namespace change it.  We might even be able to get
> away with leaving the real time clock out of the time namespace.  If not
> we need to be very careful how the real time clock is abstracted.  I
> would start by leaving the real time clock hardware out of the time
> namespace and see if there is any part of userspace that cares.
> 
> Eric
> 
> > Cc: Dmitry Safonov

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-24 Thread Andrey Vagin

On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> Dmitry Safonov  writes:
> 
> > Discussions around time virtualization are there for a long time.
> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> > From that time, the topic appears on and off in various discussions.
> >
> > There are two main use cases for time namespaces:
> > 1. change date and time inside a container;
> > 2. adjust clocks for a container restored from a checkpoint.
> >
> > “It seems like this might be one of the last major obstacles keeping
> > migration from being used in production systems, given that not all
> > containers and connections can be migrated as long as a time dependency
> > is capable of messing it up.” (by github.com/dav-ell)
> >
> > The kernel provides access to several clocks: CLOCK_REALTIME,
> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> > start points for them are not defined and are different for each running
> > system. When a container is migrated from one node to another, all
> > clocks have to be restored into consistent states; in other words, they
> > have to continue running from the same points where they have been
> > dumped.
> >
> > The main idea behind this patch set is adding per-namespace offsets for
> > system clocks. When a process in a non-root time namespace requests
> > time of a clock, a namespace offset is added to the current value of
> > this clock on a host and the sum is returned.
> >
> > All offsets are placed on a separate page, this allows up to map it as 
> > part of vvar into user processes and use offsets from vdso calls.
> >
> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> > clocks.
> >
> > Questions to discuss:
> >
> > * Clone flags exhaustion. Currently there is only one unused clone flag
> > bit left, and it may be worth to use it to extend arguments of the clone
> > system call.
> >
> > * Realtime clock implementation details:
> >   Is having a simple offset enough?
> >   What to do when date and time is changed on the host?
> >   Is there a need to adjust vfs modification and creation times? 
> >   Implementation for adjtime() syscall.
> 
> Overall I support this effort.  In my quick skim this code looked good.

Hi Eric,

Thank you for the feedback.

> 
> My feeling is that we need to be able to support running ntpd and
> support one namespace doing googles smoothing of leap seconds while
> another namespace takes the leap second.
> 
> What I was imagining when I was last thinking about this was one
> instance of struct timekeeper aka tk_core per time namespace.  That
> structure already keeps offsets for all of the various clocks from
> the kerne internal time sources.  What would be needed would be to
> pass in an appropriate time namespace pointer.
> 
> I could be completely wrong as I have not take the time to completely
> trace through the code.  Have you looked at pushing the time namespace
> down as far as tk_core?
> 
> What I think would be the big advantage (besides ntp working) is that
> the bulk of the code could be reused.  Allowing testing of the kernel's
> time code by setting up a new time namespace.  So a person in production
> could setup a time namespace with the time set ahead a little  bit and
> be able to verify that the kernel handles the upcoming leap second
> properly.
>

It is an interesting idea, but I have a few questions:

1. Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the
timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
overhead of this?

2. What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them
separately.

> 
> 
> I don't know about the vfs.  I think the danger is being able to write
> dates in the future or in the past.  It appears that utimes(2) and
> utimesnat(2) already allow this except for status change.  So it is
> possible we simply don't care.  I seem to remember that what nfs does
> is take the time stamp from the host writing to the file.
> 
> I think the guide for filesystem timestamps should be to first ensure
> we don't introduce security issues, and then do what distributed
> filesystems do when dealing with hosts with different clocks.
> 
> Given those those two guidlines above I don't think there is a need to
> change timestamsp the way the user namespace changes uid when displayed.
> 
> 
> 
> As for the hardware like the real time clock we definitely should not
> let a root in a time namespace change it.  We might even be able to get
> away with leaving the real time clock out of the time namespace.  If not
> we need to be very careful how the real time clock is abstracted.  I
> would start by leaving the real time clock hardware out of the time
> namespace and see if there is any part of userspace that cares.
> 
> Eric
> 
> > Cc: Dmitry Safonov

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-21 Thread Eric W. Biederman

Dmitry Safonov  writes:

> Discussions around time virtualization are there for a long time.
> The first attempt to implement time namespace was in 2006 by Jeff Dike.
> From that time, the topic appears on and off in various discussions.
>
> There are two main use cases for time namespaces:
> 1. change date and time inside a container;
> 2. adjust clocks for a container restored from a checkpoint.
>
> “It seems like this might be one of the last major obstacles keeping
> migration from being used in production systems, given that not all
> containers and connections can be migrated as long as a time dependency
> is capable of messing it up.” (by github.com/dav-ell)
>
> The kernel provides access to several clocks: CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> start points for them are not defined and are different for each running
> system. When a container is migrated from one node to another, all
> clocks have to be restored into consistent states; in other words, they
> have to continue running from the same points where they have been
> dumped.
>
> The main idea behind this patch set is adding per-namespace offsets for
> system clocks. When a process in a non-root time namespace requests
> time of a clock, a namespace offset is added to the current value of
> this clock on a host and the sum is returned.
>
> All offsets are placed on a separate page, this allows up to map it as 
> part of vvar into user processes and use offsets from vdso calls.
>
> Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> clocks.
>
> Questions to discuss:
>
> * Clone flags exhaustion. Currently there is only one unused clone flag
> bit left, and it may be worth to use it to extend arguments of the clone
> system call.
>
> * Realtime clock implementation details:
>   Is having a simple offset enough?
>   What to do when date and time is changed on the host?
>   Is there a need to adjust vfs modification and creation times? 
>   Implementation for adjtime() syscall.

Overall I support this effort.  In my quick skim this code looked good.

My feeling is that we need to be able to support running ntpd and
support one namespace doing googles smoothing of leap seconds while
another namespace takes the leap second.

What I was imagining when I was last thinking about this was one
instance of struct timekeeper aka tk_core per time namespace.  That
structure already keeps offsets for all of the various clocks from
the kerne internal time sources.  What would be needed would be to
pass in an appropriate time namespace pointer.

I could be completely wrong as I have not take the time to completely
trace through the code.  Have you looked at pushing the time namespace
down as far as tk_core?

What I think would be the big advantage (besides ntp working) is that
the bulk of the code could be reused.  Allowing testing of the kernel's
time code by setting up a new time namespace.  So a person in production
could setup a time namespace with the time set ahead a little  bit and
be able to verify that the kernel handles the upcoming leap second
properly.

I don't know about the vfs.  I think the danger is being able to write
dates in the future or in the past.  It appears that utimes(2) and
utimesnat(2) already allow this except for status change.  So it is
possible we simply don't care.  I seem to remember that what nfs does
is take the time stamp from the host writing to the file.

I think the guide for filesystem timestamps should be to first ensure
we don't introduce security issues, and then do what distributed
filesystems do when dealing with hosts with different clocks.

Given those those two guidlines above I don't think there is a need to
change timestamsp the way the user namespace changes uid when displayed.

As for the hardware like the real time clock we definitely should not
let a root in a time namespace change it.  We might even be able to get
away with leaving the real time clock out of the time namespace.  If not
we need to be very careful how the real time clock is abstracted.  I
would start by leaving the real time clock hardware out of the time
namespace and see if there is any part of userspace that cares.

Eric

> Cc: Dmitry Safonov <0x7f454...@gmail.com>
> Cc: Adrian Reber 
> Cc: Andrei Vagin 
> Cc: Andy Lutomirski 
> Cc: Christian Brauner 
> Cc: Cyrill Gorcunov 
> Cc: "Eric W. Biederman" 
> Cc: "H. Peter Anvin"  
> Cc: Ingo Molnar 
> Cc: Jeff Dike 
> Cc: Oleg Nesterov 
> Cc: Pavel Emelyanov 
> Cc: Shuah Khan 
> Cc: Thomas Gleixner 
> Cc: contain...@lists.linux-foundation.org
> Cc: c...@openvz.org
> Cc: linux-...@vger.kernel.org
> Cc: x...@kernel.org
>
> Andrei Vagin (12):
>   ns: Introduce Time Namespace
>   timens: Add timens_offsets
>   timens: Introduce CLOCK_MONOTONIC offsets
>   timens: Introduce CLOCK_BOOTTIME offset
>   timerfd/timens: Take into account ns clock offsets
>   kernel: Take into account timens clock offsets in clock_nanosleep
>

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-09-21 Thread Eric W. Biederman

Dmitry Safonov  writes:

> Discussions around time virtualization are there for a long time.
> The first attempt to implement time namespace was in 2006 by Jeff Dike.
> From that time, the topic appears on and off in various discussions.
>
> There are two main use cases for time namespaces:
> 1. change date and time inside a container;
> 2. adjust clocks for a container restored from a checkpoint.
>
> “It seems like this might be one of the last major obstacles keeping
> migration from being used in production systems, given that not all
> containers and connections can be migrated as long as a time dependency
> is capable of messing it up.” (by github.com/dav-ell)
>
> The kernel provides access to several clocks: CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> start points for them are not defined and are different for each running
> system. When a container is migrated from one node to another, all
> clocks have to be restored into consistent states; in other words, they
> have to continue running from the same points where they have been
> dumped.
>
> The main idea behind this patch set is adding per-namespace offsets for
> system clocks. When a process in a non-root time namespace requests
> time of a clock, a namespace offset is added to the current value of
> this clock on a host and the sum is returned.
>
> All offsets are placed on a separate page, this allows up to map it as 
> part of vvar into user processes and use offsets from vdso calls.
>
> Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> clocks.
>
> Questions to discuss:
>
> * Clone flags exhaustion. Currently there is only one unused clone flag
> bit left, and it may be worth to use it to extend arguments of the clone
> system call.
>
> * Realtime clock implementation details:
>   Is having a simple offset enough?
>   What to do when date and time is changed on the host?
>   Is there a need to adjust vfs modification and creation times? 
>   Implementation for adjtime() syscall.

Overall I support this effort.  In my quick skim this code looked good.

My feeling is that we need to be able to support running ntpd and
support one namespace doing googles smoothing of leap seconds while
another namespace takes the leap second.

What I was imagining when I was last thinking about this was one
instance of struct timekeeper aka tk_core per time namespace.  That
structure already keeps offsets for all of the various clocks from
the kerne internal time sources.  What would be needed would be to
pass in an appropriate time namespace pointer.

I could be completely wrong as I have not take the time to completely
trace through the code.  Have you looked at pushing the time namespace
down as far as tk_core?

What I think would be the big advantage (besides ntp working) is that
the bulk of the code could be reused.  Allowing testing of the kernel's
time code by setting up a new time namespace.  So a person in production
could setup a time namespace with the time set ahead a little  bit and
be able to verify that the kernel handles the upcoming leap second
properly.

I don't know about the vfs.  I think the danger is being able to write
dates in the future or in the past.  It appears that utimes(2) and
utimesnat(2) already allow this except for status change.  So it is
possible we simply don't care.  I seem to remember that what nfs does
is take the time stamp from the host writing to the file.

I think the guide for filesystem timestamps should be to first ensure
we don't introduce security issues, and then do what distributed
filesystems do when dealing with hosts with different clocks.

Given those those two guidlines above I don't think there is a need to
change timestamsp the way the user namespace changes uid when displayed.

As for the hardware like the real time clock we definitely should not
let a root in a time namespace change it.  We might even be able to get
away with leaving the real time clock out of the time namespace.  If not
we need to be very careful how the real time clock is abstracted.  I
would start by leaving the real time clock hardware out of the time
namespace and see if there is any part of userspace that cares.

Eric

> Cc: Dmitry Safonov <0x7f454...@gmail.com>
> Cc: Adrian Reber 
> Cc: Andrei Vagin 
> Cc: Andy Lutomirski 
> Cc: Christian Brauner 
> Cc: Cyrill Gorcunov 
> Cc: "Eric W. Biederman" 
> Cc: "H. Peter Anvin"  
> Cc: Ingo Molnar 
> Cc: Jeff Dike 
> Cc: Oleg Nesterov 
> Cc: Pavel Emelyanov 
> Cc: Shuah Khan 
> Cc: Thomas Gleixner 
> Cc: contain...@lists.linux-foundation.org
> Cc: c...@openvz.org
> Cc: linux-...@vger.kernel.org
> Cc: x...@kernel.org
>
> Andrei Vagin (12):
>   ns: Introduce Time Namespace
>   timens: Add timens_offsets
>   timens: Introduce CLOCK_MONOTONIC offsets
>   timens: Introduce CLOCK_BOOTTIME offset
>   timerfd/timens: Take into account ns clock offsets
>   kernel: Take into account timens clock offsets in clock_nanosleep
>

42 matches

Mail list logo