Re: [RFC 00/20] ns: Introduce Time Namespace
On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote: > Andrei, > > On Sat, 20 Oct 2018, Andrei Vagin wrote: > > When a container is migrated to another host, we have to restore its > > monotonic and boottime clocks, but we still expect that the container > > will continue using the host real-time clock. > > > > Before stating this series, I was thinking about this, I decided that > > these cases can be solved independently. Probably, the full isolation of > > the time sub-system will have much higher overhead than just offsets for > > a few clocks. And the idea that isolation of the real-time clock should > > be optional gives us another hint that offsets for monotonic and > > boot-time clocks can be implemented independently. > > > > Eric and Tomas, what do you think about this? If you agree that these > > two cases can be implemented separately, what should we do with this > > series to make it ready to be merged? > > > > I know that we need to: > > > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > > and CLOCK_BOOTTIME and that's quite a few. > > > * forbid changing offsets after creating timers > > There are more things to think about. What about interfaces which expose > boot time or monotonic time in /proc? We didn't find any proc files where boot or monotonic time is reported, but we will double check this. > > Aside of that (I finally came around to look at the series in more detail) > I'm really unhappy about the unconditional overhead once the Time namespace > config switch is enabled. This applies especially to the VDSO. We spent > quite some time recently to squeeze a few cycles out of those functions and > it would be a pity to pointlessly waste cycles for the !namespace case. It is a good point. We will work on it. > > I can see the urge for this, but please let us think it through properly > before rushing anything in which we are going to regret once we want to do > more sophisticated time domain management, e.g. support for isolated clock > real time. I'm worried, that without a clear plan about the overall > picture, we end up with duct tape which is hard to distangle after the > fact. Thomas, there is no rush at all. This functionality is critical for CRUI, but we have enough time to solve it properly. The only thing what I want is that this functionality continues moving forward and will not be put in the back burner. > > There have been a few other things brought up versus time management in > general, like the TSN folks utilizing grand clock masters which expose > random time instead of proper TAI. Plus some requirements for exposing some > sort of 'monotonic' clocks which are derived from external synchronization > mechanisms, but should not affect the regular time keeping clocks. > > While different issues, these all fall into the category of separate time > domains, so taking a step back to the drawing board is probably the best > thing what we can do now. > > There are certainly a few things which can be looked at independently, > e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole > kernel with these name space functions applying offsets left and right. I > rather have dedicated core functionality which replaces/amends existing > timer functions to become time namespace aware. > > I'll try to find some time in the next weeks to look deeper into that, but > I can't promise anything before returning from LPC. Btw, LPC would be a > great opportunity to discuss that. Are you and the other name space wizards > there by any chance? Dmitry and I are going to be there. Thanks! Andrei > > Thanks, > > tglx > >
Re: [RFC 00/20] ns: Introduce Time Namespace
On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote: > Andrei, > > On Sat, 20 Oct 2018, Andrei Vagin wrote: > > When a container is migrated to another host, we have to restore its > > monotonic and boottime clocks, but we still expect that the container > > will continue using the host real-time clock. > > > > Before stating this series, I was thinking about this, I decided that > > these cases can be solved independently. Probably, the full isolation of > > the time sub-system will have much higher overhead than just offsets for > > a few clocks. And the idea that isolation of the real-time clock should > > be optional gives us another hint that offsets for monotonic and > > boot-time clocks can be implemented independently. > > > > Eric and Tomas, what do you think about this? If you agree that these > > two cases can be implemented separately, what should we do with this > > series to make it ready to be merged? > > > > I know that we need to: > > > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > > and CLOCK_BOOTTIME and that's quite a few. > > > * forbid changing offsets after creating timers > > There are more things to think about. What about interfaces which expose > boot time or monotonic time in /proc? We didn't find any proc files where boot or monotonic time is reported, but we will double check this. > > Aside of that (I finally came around to look at the series in more detail) > I'm really unhappy about the unconditional overhead once the Time namespace > config switch is enabled. This applies especially to the VDSO. We spent > quite some time recently to squeeze a few cycles out of those functions and > it would be a pity to pointlessly waste cycles for the !namespace case. It is a good point. We will work on it. > > I can see the urge for this, but please let us think it through properly > before rushing anything in which we are going to regret once we want to do > more sophisticated time domain management, e.g. support for isolated clock > real time. I'm worried, that without a clear plan about the overall > picture, we end up with duct tape which is hard to distangle after the > fact. Thomas, there is no rush at all. This functionality is critical for CRUI, but we have enough time to solve it properly. The only thing what I want is that this functionality continues moving forward and will not be put in the back burner. > > There have been a few other things brought up versus time management in > general, like the TSN folks utilizing grand clock masters which expose > random time instead of proper TAI. Plus some requirements for exposing some > sort of 'monotonic' clocks which are derived from external synchronization > mechanisms, but should not affect the regular time keeping clocks. > > While different issues, these all fall into the category of separate time > domains, so taking a step back to the drawing board is probably the best > thing what we can do now. > > There are certainly a few things which can be looked at independently, > e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole > kernel with these name space functions applying offsets left and right. I > rather have dedicated core functionality which replaces/amends existing > timer functions to become time namespace aware. > > I'll try to find some time in the next weeks to look deeper into that, but > I can't promise anything before returning from LPC. Btw, LPC would be a > great opportunity to discuss that. Are you and the other name space wizards > there by any chance? Dmitry and I are going to be there. Thanks! Andrei > > Thanks, > > tglx > >
Re: [RFC 00/20] ns: Introduce Time Namespace
Eric, On Mon, 29 Oct 2018, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > > I'll try to find some time in the next weeks to look deeper into that, but > > I can't promise anything before returning from LPC. Btw, LPC would be a > > great opportunity to discuss that. Are you and the other name space wizards > > there by any chance? > > I will be and there are going to be both container and CRIU > mini-conferences. So there should at least some of us around. So let's try to find a slot for a BOF or similar (there might be still slots for the kernel summit available, i'll ask). Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
Eric, On Mon, 29 Oct 2018, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > > I'll try to find some time in the next weeks to look deeper into that, but > > I can't promise anything before returning from LPC. Btw, LPC would be a > > great opportunity to discuss that. Are you and the other name space wizards > > there by any chance? > > I will be and there are going to be both container and CRIU > mini-conferences. So there should at least some of us around. So let's try to find a slot for a BOF or similar (there might be still slots for the kernel summit available, i'll ask). Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > Andrei, > > On Sat, 20 Oct 2018, Andrei Vagin wrote: >> When a container is migrated to another host, we have to restore its >> monotonic and boottime clocks, but we still expect that the container >> will continue using the host real-time clock. >> >> Before stating this series, I was thinking about this, I decided that >> these cases can be solved independently. Probably, the full isolation of >> the time sub-system will have much higher overhead than just offsets for >> a few clocks. And the idea that isolation of the real-time clock should >> be optional gives us another hint that offsets for monotonic and >> boot-time clocks can be implemented independently. >> >> Eric and Tomas, what do you think about this? If you agree that these >> two cases can be implemented separately, what should we do with this >> series to make it ready to be merged? >> >> I know that we need to: >> >> * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > > and CLOCK_BOOTTIME and that's quite a few. > >> * forbid changing offsets after creating timers > > There are more things to think about. What about interfaces which expose > boot time or monotonic time in /proc? > > Aside of that (I finally came around to look at the series in more detail) > I'm really unhappy about the unconditional overhead once the Time namespace > config switch is enabled. This applies especially to the VDSO. We spent > quite some time recently to squeeze a few cycles out of those functions and > it would be a pity to pointlessly waste cycles for the !namespace case. > > I can see the urge for this, but please let us think it through properly > before rushing anything in which we are going to regret once we want to do > more sophisticated time domain management, e.g. support for isolated clock > real time. I'm worried, that without a clear plan about the overall > picture, we end up with duct tape which is hard to distangle after the > fact. > > There have been a few other things brought up versus time management in > general, like the TSN folks utilizing grand clock masters which expose > random time instead of proper TAI. Plus some requirements for exposing some > sort of 'monotonic' clocks which are derived from external synchronization > mechanisms, but should not affect the regular time keeping clocks. > > While different issues, these all fall into the category of separate time > domains, so taking a step back to the drawing board is probably the best > thing what we can do now. > > There are certainly a few things which can be looked at independently, > e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole > kernel with these name space functions applying offsets left and right. I > rather have dedicated core functionality which replaces/amends existing > timer functions to become time namespace aware. > > I'll try to find some time in the next weeks to look deeper into that, but > I can't promise anything before returning from LPC. Btw, LPC would be a > great opportunity to discuss that. Are you and the other name space wizards > there by any chance? I will be and there are going to be both container and CRIU mini-conferences. So there should at least some of us around. Eric
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > Andrei, > > On Sat, 20 Oct 2018, Andrei Vagin wrote: >> When a container is migrated to another host, we have to restore its >> monotonic and boottime clocks, but we still expect that the container >> will continue using the host real-time clock. >> >> Before stating this series, I was thinking about this, I decided that >> these cases can be solved independently. Probably, the full isolation of >> the time sub-system will have much higher overhead than just offsets for >> a few clocks. And the idea that isolation of the real-time clock should >> be optional gives us another hint that offsets for monotonic and >> boot-time clocks can be implemented independently. >> >> Eric and Tomas, what do you think about this? If you agree that these >> two cases can be implemented separately, what should we do with this >> series to make it ready to be merged? >> >> I know that we need to: >> >> * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > > and CLOCK_BOOTTIME and that's quite a few. > >> * forbid changing offsets after creating timers > > There are more things to think about. What about interfaces which expose > boot time or monotonic time in /proc? > > Aside of that (I finally came around to look at the series in more detail) > I'm really unhappy about the unconditional overhead once the Time namespace > config switch is enabled. This applies especially to the VDSO. We spent > quite some time recently to squeeze a few cycles out of those functions and > it would be a pity to pointlessly waste cycles for the !namespace case. > > I can see the urge for this, but please let us think it through properly > before rushing anything in which we are going to regret once we want to do > more sophisticated time domain management, e.g. support for isolated clock > real time. I'm worried, that without a clear plan about the overall > picture, we end up with duct tape which is hard to distangle after the > fact. > > There have been a few other things brought up versus time management in > general, like the TSN folks utilizing grand clock masters which expose > random time instead of proper TAI. Plus some requirements for exposing some > sort of 'monotonic' clocks which are derived from external synchronization > mechanisms, but should not affect the regular time keeping clocks. > > While different issues, these all fall into the category of separate time > domains, so taking a step back to the drawing board is probably the best > thing what we can do now. > > There are certainly a few things which can be looked at independently, > e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole > kernel with these name space functions applying offsets left and right. I > rather have dedicated core functionality which replaces/amends existing > timer functions to become time namespace aware. > > I'll try to find some time in the next weeks to look deeper into that, but > I can't promise anything before returning from LPC. Btw, LPC would be a > great opportunity to discuss that. Are you and the other name space wizards > there by any chance? I will be and there are going to be both container and CRIU mini-conferences. So there should at least some of us around. Eric
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrei, On Sat, 20 Oct 2018, Andrei Vagin wrote: > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea that isolation of the real-time clock should > be optional gives us another hint that offsets for monotonic and > boot-time clocks can be implemented independently. > > Eric and Tomas, what do you think about this? If you agree that these > two cases can be implemented separately, what should we do with this > series to make it ready to be merged? > > I know that we need to: > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. and CLOCK_BOOTTIME and that's quite a few. > * forbid changing offsets after creating timers There are more things to think about. What about interfaces which expose boot time or monotonic time in /proc? Aside of that (I finally came around to look at the series in more detail) I'm really unhappy about the unconditional overhead once the Time namespace config switch is enabled. This applies especially to the VDSO. We spent quite some time recently to squeeze a few cycles out of those functions and it would be a pity to pointlessly waste cycles for the !namespace case. I can see the urge for this, but please let us think it through properly before rushing anything in which we are going to regret once we want to do more sophisticated time domain management, e.g. support for isolated clock real time. I'm worried, that without a clear plan about the overall picture, we end up with duct tape which is hard to distangle after the fact. There have been a few other things brought up versus time management in general, like the TSN folks utilizing grand clock masters which expose random time instead of proper TAI. Plus some requirements for exposing some sort of 'monotonic' clocks which are derived from external synchronization mechanisms, but should not affect the regular time keeping clocks. While different issues, these all fall into the category of separate time domains, so taking a step back to the drawing board is probably the best thing what we can do now. There are certainly a few things which can be looked at independently, e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole kernel with these name space functions applying offsets left and right. I rather have dedicated core functionality which replaces/amends existing timer functions to become time namespace aware. I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance? Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrei, On Sat, 20 Oct 2018, Andrei Vagin wrote: > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea that isolation of the real-time clock should > be optional gives us another hint that offsets for monotonic and > boot-time clocks can be implemented independently. > > Eric and Tomas, what do you think about this? If you agree that these > two cases can be implemented separately, what should we do with this > series to make it ready to be merged? > > I know that we need to: > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. and CLOCK_BOOTTIME and that's quite a few. > * forbid changing offsets after creating timers There are more things to think about. What about interfaces which expose boot time or monotonic time in /proc? Aside of that (I finally came around to look at the series in more detail) I'm really unhappy about the unconditional overhead once the Time namespace config switch is enabled. This applies especially to the VDSO. We spent quite some time recently to squeeze a few cycles out of those functions and it would be a pity to pointlessly waste cycles for the !namespace case. I can see the urge for this, but please let us think it through properly before rushing anything in which we are going to regret once we want to do more sophisticated time domain management, e.g. support for isolated clock real time. I'm worried, that without a clear plan about the overall picture, we end up with duct tape which is hard to distangle after the fact. There have been a few other things brought up versus time management in general, like the TSN folks utilizing grand clock masters which expose random time instead of proper TAI. Plus some requirements for exposing some sort of 'monotonic' clocks which are derived from external synchronization mechanisms, but should not affect the regular time keeping clocks. While different issues, these all fall into the category of separate time domains, so taking a step back to the drawing board is probably the best thing what we can do now. There are certainly a few things which can be looked at independently, e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole kernel with these name space functions applying offsets left and right. I rather have dedicated core functionality which replaces/amends existing timer functions to become time namespace aware. I'll try to find some time in the next weeks to look deeper into that, but I can't promise anything before returning from LPC. Btw, LPC would be a great opportunity to discuss that. Are you and the other name space wizards there by any chance? Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >>tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, > > > but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from > > > a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock > > > caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > >real time. > > > >The problem is that CLOCK_MONOTONIC between nodes in the cluster > >has not relation ship to each other (except a synchronized length of > >the second). So applications that migrate can see CLOCK_MONOTONIC > >and CLOCK_BOOTTIME go backwards. > > > >This is the truly pressing problem and adding some kind of offset > >sounds like it would be the solution. Possibly by allowing a boot > >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > >that needes to deal with both something inside of google where they > >slew time to avoid leap time seconds and something in the outside > >world proper UTC time is kept as an offset from TAI with the > >occasional leap seconds. > > > >In the later case it would fundamentally require having seconds of > >different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea
Re: [RFC 00/20] ns: Introduce Time Namespace
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >>tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, > > > but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from > > > a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock > > > caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > >real time. > > > >The problem is that CLOCK_MONOTONIC between nodes in the cluster > >has not relation ship to each other (except a synchronized length of > >the second). So applications that migrate can see CLOCK_MONOTONIC > >and CLOCK_BOOTTIME go backwards. > > > >This is the truly pressing problem and adding some kind of offset > >sounds like it would be the solution. Possibly by allowing a boot > >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > >that needes to deal with both something inside of google where they > >slew time to avoid leap time seconds and something in the outside > >world proper UTC time is kept as an offset from TAI with the > >occasional leap seconds. > > > >In the later case it would fundamentally require having seconds of > >different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> Reading the code the calling sequence there is: > >> tick_sched_do_timer > >>tick_do_update_jiffies64 > >> update_wall_time > >> timekeeping_advance > >> timekeepging_update > >> > >> If I read that properly under the right nohz circumstances that update > >> can be delayed indefinitely. > >> > >> So I think we could prototype a time namespace that was per > >> timekeeping_update and just had update_wall_time iterate through > >> all of the time namespaces. > > > > Please don't go there. timekeeping_update() is already heavy and walking > > through a gazillion of namespaces will just make it horrible, > > > >> I don't think the naive version would scale to very many time > >> namespaces. > > > > :) > > > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. > > > From there it becomes hairy, because it's not only timekeeping, > > i.e. reading time, this is also affecting all timers which are armed from a > > namespace. > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > a particular namespace, then you have to search for all armed timers of > > that namespace and adjust them. > > > > The original posix timer code had the same issue because it mapped the > > clock realtime timers to the timer wheel so any setting of the clock caused > > a full walk of all armed timers, disarming, adjusting and requeing > > them. That's horrible not only performance wise, it's also a locking > > nightmare of all sorts. > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > Then it sounds like this will take some more digging. > > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. > > I see two fundamental driving cases for a time namespace. > 1) Migration from one node to another node in a cluster in almost >real time. > >The problem is that CLOCK_MONOTONIC between nodes in the cluster >has not relation ship to each other (except a synchronized length of >the second). So applications that migrate can see CLOCK_MONOTONIC >and CLOCK_BOOTTIME go backwards. > >This is the truly pressing problem and adding some kind of offset >sounds like it would be the solution. Possibly by allowing a boot >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > 2) Dealing with two separate time management domains. Say a machine >that needes to deal with both something inside of google where they >slew time to avoid leap time seconds and something in the outside >world proper UTC time is kept as an offset from TAI with the >occasional leap seconds. > >In the later case it would fundamentally require having seconds of >different length. > I want to add that the second case should be optional. When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock. Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently. Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> Reading the code the calling sequence there is: > >> tick_sched_do_timer > >>tick_do_update_jiffies64 > >> update_wall_time > >> timekeeping_advance > >> timekeepging_update > >> > >> If I read that properly under the right nohz circumstances that update > >> can be delayed indefinitely. > >> > >> So I think we could prototype a time namespace that was per > >> timekeeping_update and just had update_wall_time iterate through > >> all of the time namespaces. > > > > Please don't go there. timekeeping_update() is already heavy and walking > > through a gazillion of namespaces will just make it horrible, > > > >> I don't think the naive version would scale to very many time > >> namespaces. > > > > :) > > > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. > > > From there it becomes hairy, because it's not only timekeeping, > > i.e. reading time, this is also affecting all timers which are armed from a > > namespace. > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > a particular namespace, then you have to search for all armed timers of > > that namespace and adjust them. > > > > The original posix timer code had the same issue because it mapped the > > clock realtime timers to the timer wheel so any setting of the clock caused > > a full walk of all armed timers, disarming, adjusting and requeing > > them. That's horrible not only performance wise, it's also a locking > > nightmare of all sorts. > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > Then it sounds like this will take some more digging. > > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. > > I see two fundamental driving cases for a time namespace. > 1) Migration from one node to another node in a cluster in almost >real time. > >The problem is that CLOCK_MONOTONIC between nodes in the cluster >has not relation ship to each other (except a synchronized length of >the second). So applications that migrate can see CLOCK_MONOTONIC >and CLOCK_BOOTTIME go backwards. > >This is the truly pressing problem and adding some kind of offset >sounds like it would be the solution. Possibly by allowing a boot >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > 2) Dealing with two separate time management domains. Say a machine >that needes to deal with both something inside of google where they >slew time to avoid leap time seconds and something in the outside >world proper UTC time is kept as an offset from TAI with the >occasional leap seconds. > >In the later case it would fundamentally require having seconds of >different length. > I want to add that the second case should be optional. When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock. Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently. Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this
Re: [RFC 00/20] ns: Introduce Time Namespace
Dmitry, On Tue, 2 Oct 2018, Dmitry Safonov wrote: > On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner wrote: > > I explained that in detail in this thread, but it's not about the initial > > setting of clock mono/boot before any timers have been armed. > > > > It's about setting the offset or clock realtime (via settimeofday) when > > timers are already armed. Also having a entirely different time domain, > > e.g. separate NTP adjustments, makes that necessary. > > It looks like, there is a bit of misunderstanding each other: > Andrei was talking about the current RFC version, where we haven't > introduced offsets for clock realtime. While Thomas IIUC, is looking > how-to expand time namespace over realtime. > > As CLOCK_REALTIME virtualization raises so many complex questions > like a different length of the second or list of realtime timers in ns we > haven't added any realization for it. > > It seems like an initial introduction for timens can be expanded after to > cover > realtime clocks too. While it may seem incomplete, it solves issues for > restoring/migration of real-world applications like nodejs, Oracle DB server > which fails after being restored if there is a leap in monotonic time. Well, yes. But you really have to think about the full picture. Just adding part of the overall solution right now, just because it can be glued into the code easily, is not the best approach IMO as it might result in substantial rework of the whole thing sooner than later. I really don't want to end up with something which is not extensible and has to be supported forever. Just for the record, the current approach with name space offsets for monotonic is also prone to malfunction vs. timers, unless you can prevent changing the offset _after_ the namespace has been set up and timers have been armed. I admit, that I did not look close enough to verify that. > While solving the mentioned issues, it doesn't bring overhead. > (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, > which will be done in v1). > > Thomas, thanks much for your input - now we know that we'll need to > introduce list for timers in namespace when we'll add realtime clocks. > Do you believe that CLOCK_MONOTONIC_SYNC would be an easier > concept than offsets per-namespace? Haven't thought it through. This was just an idea in reaction to Eric's question whether setting clock monotonic might be feasible. But yes, it might be worth to think about it. I think you should really define the long term requirements for time namespaces and perhaps set some limitations in functionality upfront. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
Dmitry, On Tue, 2 Oct 2018, Dmitry Safonov wrote: > On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner wrote: > > I explained that in detail in this thread, but it's not about the initial > > setting of clock mono/boot before any timers have been armed. > > > > It's about setting the offset or clock realtime (via settimeofday) when > > timers are already armed. Also having a entirely different time domain, > > e.g. separate NTP adjustments, makes that necessary. > > It looks like, there is a bit of misunderstanding each other: > Andrei was talking about the current RFC version, where we haven't > introduced offsets for clock realtime. While Thomas IIUC, is looking > how-to expand time namespace over realtime. > > As CLOCK_REALTIME virtualization raises so many complex questions > like a different length of the second or list of realtime timers in ns we > haven't added any realization for it. > > It seems like an initial introduction for timens can be expanded after to > cover > realtime clocks too. While it may seem incomplete, it solves issues for > restoring/migration of real-world applications like nodejs, Oracle DB server > which fails after being restored if there is a leap in monotonic time. Well, yes. But you really have to think about the full picture. Just adding part of the overall solution right now, just because it can be glued into the code easily, is not the best approach IMO as it might result in substantial rework of the whole thing sooner than later. I really don't want to end up with something which is not extensible and has to be supported forever. Just for the record, the current approach with name space offsets for monotonic is also prone to malfunction vs. timers, unless you can prevent changing the offset _after_ the namespace has been set up and timers have been armed. I admit, that I did not look close enough to verify that. > While solving the mentioned issues, it doesn't bring overhead. > (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, > which will be done in v1). > > Thomas, thanks much for your input - now we know that we'll need to > introduce list for timers in namespace when we'll add realtime clocks. > Do you believe that CLOCK_MONOTONIC_SYNC would be an easier > concept than offsets per-namespace? Haven't thought it through. This was just an idea in reaction to Eric's question whether setting clock monotonic might be feasible. But yes, it might be worth to think about it. I think you should really define the long term requirements for time namespaces and perhaps set some limitations in functionality upfront. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
Hi Thomas, Andrei, Eric, On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner wrote: > > On Mon, 1 Oct 2018, Andrey Vagin wrote: > > > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > > > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > > timers as well, because you need to guarantee that they are not expiring > > > > early. > > > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > > can work at all without introducing subtle issues all over the place. > > > > > > And just a quick scan tells me that this is broken. Timers will expire > > > early or late. The latter is acceptible to some extent, but larger delays > > > might come with surprise. Expiring early is an absolute nono. > > > > Do you mean that we have to adjust all timers after changing offset for > > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for > > monotonic and boot times will be set immediately after creating a time > > namespace before using any timers. > > I explained that in detail in this thread, but it's not about the initial > setting of clock mono/boot before any timers have been armed. > > It's about setting the offset or clock realtime (via settimeofday) when > timers are already armed. Also having a entirely different time domain, > e.g. separate NTP adjustments, makes that necessary. It looks like, there is a bit of misunderstanding each other: Andrei was talking about the current RFC version, where we haven't introduced offsets for clock realtime. While Thomas IIUC, is looking how-to expand time namespace over realtime. As CLOCK_REALTIME virtualization raises so many complex questions like a different length of the second or list of realtime timers in ns we haven't added any realization for it. It seems like an initial introduction for timens can be expanded after to cover realtime clocks too. While it may seem incomplete, it solves issues for restoring/migration of real-world applications like nodejs, Oracle DB server which fails after being restored if there is a leap in monotonic time. While solving the mentioned issues, it doesn't bring overhead. (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, which will be done in v1). Thomas, thanks much for your input - now we know that we'll need to introduce list for timers in namespace when we'll add realtime clocks. Do you believe that CLOCK_MONOTONIC_SYNC would be an easier concept than offsets per-namespace? Thanks, Dmitry
Re: [RFC 00/20] ns: Introduce Time Namespace
Hi Thomas, Andrei, Eric, On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner wrote: > > On Mon, 1 Oct 2018, Andrey Vagin wrote: > > > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > > > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > > timers as well, because you need to guarantee that they are not expiring > > > > early. > > > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > > can work at all without introducing subtle issues all over the place. > > > > > > And just a quick scan tells me that this is broken. Timers will expire > > > early or late. The latter is acceptible to some extent, but larger delays > > > might come with surprise. Expiring early is an absolute nono. > > > > Do you mean that we have to adjust all timers after changing offset for > > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for > > monotonic and boot times will be set immediately after creating a time > > namespace before using any timers. > > I explained that in detail in this thread, but it's not about the initial > setting of clock mono/boot before any timers have been armed. > > It's about setting the offset or clock realtime (via settimeofday) when > timers are already armed. Also having a entirely different time domain, > e.g. separate NTP adjustments, makes that necessary. It looks like, there is a bit of misunderstanding each other: Andrei was talking about the current RFC version, where we haven't introduced offsets for clock realtime. While Thomas IIUC, is looking how-to expand time namespace over realtime. As CLOCK_REALTIME virtualization raises so many complex questions like a different length of the second or list of realtime timers in ns we haven't added any realization for it. It seems like an initial introduction for timens can be expanded after to cover realtime clocks too. While it may seem incomplete, it solves issues for restoring/migration of real-world applications like nodejs, Oracle DB server which fails after being restored if there is a leap in monotonic time. While solving the mentioned issues, it doesn't bring overhead. (well, Andy noted that cmp for zero-offsets on vdso can be optimized too, which will be done in v1). Thomas, thanks much for your input - now we know that we'll need to introduce list for timers in namespace when we'll add realtime clocks. Do you believe that CLOCK_MONOTONIC_SYNC would be an easier concept than offsets per-namespace? Thanks, Dmitry
Re: [RFC 00/20] ns: Introduce Time Namespace
On Mon, 1 Oct 2018, Andrey Vagin wrote: > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > And just a quick scan tells me that this is broken. Timers will expire > > early or late. The latter is acceptible to some extent, but larger delays > > might come with surprise. Expiring early is an absolute nono. > > Do you mean that we have to adjust all timers after changing offset for > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for > monotonic and boot times will be set immediately after creating a time > namespace before using any timers. I explained that in detail in this thread, but it's not about the initial setting of clock mono/boot before any timers have been armed. It's about setting the offset or clock realtime (via settimeofday) when timers are already armed. Also having a entirely different time domain, e.g. separate NTP adjustments, makes that necessary. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Mon, 1 Oct 2018, Andrey Vagin wrote: > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > And just a quick scan tells me that this is broken. Timers will expire > > early or late. The latter is acceptible to some extent, but larger delays > > might come with surprise. Expiring early is an absolute nono. > > Do you mean that we have to adjust all timers after changing offset for > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for > monotonic and boot times will be set immediately after creating a time > namespace before using any timers. I explained that in detail in this thread, but it's not about the initial setting of clock mono/boot before any timers have been armed. It's about setting the offset or clock realtime (via settimeofday) when timers are already armed. Also having a entirely different time domain, e.g. separate NTP adjustments, makes that necessary. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > And just a quick scan tells me that this is broken. Timers will expire > early or late. The latter is acceptible to some extent, but larger delays > might come with surprise. Expiring early is an absolute nono. Do you mean that we have to adjust all timers after changing offset for CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for monotonic and boot times will be set immediately after creating a time namespace before using any timers. It is interesting to think what a use-case for changing these offsets after creating timers. It may be useful for testing needs. A user sets a timer in an hour and then change a clock offset forward and check that a test application handles the timer properly. > > Thanks, > > tglx >
Re: [RFC 00/20] ns: Introduce Time Namespace
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote: > On Thu, 27 Sep 2018, Thomas Gleixner wrote: > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > And just a quick scan tells me that this is broken. Timers will expire > early or late. The latter is acceptible to some extent, but larger delays > might come with surprise. Expiring early is an absolute nono. Do you mean that we have to adjust all timers after changing offset for CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for monotonic and boot times will be set immediately after creating a time namespace before using any timers. It is interesting to think what a use-case for changing these offsets after creating timers. It may be useful for testing needs. A user sets a timer in an hour and then change a clock offset forward and check that a test application handles the timer properly. > > Thanks, > > tglx >
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > Eric, > > On Fri, 28 Sep 2018, Eric W. Biederman wrote: >> Thomas Gleixner writes: >> > On Wed, 26 Sep 2018, Eric W. Biederman wrote: >> >> At the same time using the techniques from the nohz work and a little >> >> smarts I expect we could get the code to scale. >> > >> > You'd need to invoke the update when the namespace is switched in and >> > hasn't been updated since the last tick happened. That might be doable, but >> > you also need to take the wraparound constraints of the underlying >> > clocksources into account, which again can cause walking all name spaces >> > when they are all idle long enough. >> >> The wrap around constraints being how long before the time sources wrap >> around so you have to read them once per wrap around? I have not dug >> deeply enough into the code to see that yet. > > It's done by limiting the NOHZ idle time when all CPUs are going into deep > sleep for a long time, i.e. we make sure that at least one CPU comes back > sufficiently _before_ the wraparound happens and invokes the update > function. > > It's not so much a problem for TSC, but not every clocksource the kernel > supports has wraparound times in the range of hundreds of years. > > But yes, your idea of keeping track of wraparounds might work. Tricky, but > looks feasible on first sight, but we should be aware of the dragons. Oh. Yes. Definitely. A key enabler of any namespace implementation is figuring out how to tame the dragons. >> Please pardon me for thinking out load. >> >> There are one or more time sources that we use to compute the time >> and for each time source we have a conversion from ticks of the >> time source to nanoseconds. >> >> Each time source needs to be sampled at least once per wrap-around >> and something incremented so that we don't loose time when looking >> at that time source. >> >> There are several clocks presented to userspace and they all share the >> same length of second and are all fundamentally offsets from >> CLOCK_MONOTONIC. > > Yes. That's the readout side. This one is doable. But now look at timers. > > If you arm the timer from a name space, then it needs to be converted to > host time in order to sort it into the hrtimer queue and at some point arm > the clockevent device for it. This works as long as host and name space > time have a constant offset and the same skew. > > Once the name space time has a different skew this falls apart because the > armed timer will either expire late or early. > > Late might be acceptable, early violates the spec. You could do an extra > check for rescheduling it, if it's early, but that requires to store the > name space time accessor in the hrtimer itself because not every timer > expiry happens so that it can be checked in the name space context (think > signal based timers). We need to add this extra magic right into > __hrtimer_run_queues() which is called from the hard and soft interrupt. We > really don't want to touch all relevant callbacks or syscalls. The latter > is not sufficient anyway for signal based timer delivery. > > That's going to be interesting in terms of synchronization and might also > cause substantial overhead at least for the timers which belong to name > spaces. > > But that also means that anything which is early can and probably will > cause rearming of the timer hardware possibly for a very short delta. We > need to think about whether this can be abused to create interrupt storms. > > Now if you accept a bit late, which I'm not really happy about, then you > surely won't accept very late, i.e. hours, days. But that can happen when > settimeofday() comes into play. Right now with a single time domain, this > is easy. When settimeofday() or adjtimex() makes time jump, we just go and > reprogramm the hardware timers accordingly, which might also result in > immediate expiry of timers. > > But this does not help for time jumps in name spaces because the timer is > enqueued on the host time base. > > And no, we should not think about creating per name space hrtimer queues > and then have to walk through all of them for finding the first expiring > timer in order to arm the hardware. That cannot scale. > > Walking all hrtimer bases on all CPUs and check all queued timers whether > they belong to the affected name space does not scale either. > > So we'd need to keep track of queued timers belonging to a name space and > then just handle them. Interesting locking problem and also a scalability > issue because this might need to be done on all online CPUs. Haven't > thought it through, but it makes me shudder. Yes. I can see how this is a dragon that we need to figure out how to tame. It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME but still. >> I see two fundamental driving cases for a time namespace. > > > > I completely understand the problem you are trying to solve and yes, the > read out of time should be a solvable problem. There is
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > Eric, > > On Fri, 28 Sep 2018, Eric W. Biederman wrote: >> Thomas Gleixner writes: >> > On Wed, 26 Sep 2018, Eric W. Biederman wrote: >> >> At the same time using the techniques from the nohz work and a little >> >> smarts I expect we could get the code to scale. >> > >> > You'd need to invoke the update when the namespace is switched in and >> > hasn't been updated since the last tick happened. That might be doable, but >> > you also need to take the wraparound constraints of the underlying >> > clocksources into account, which again can cause walking all name spaces >> > when they are all idle long enough. >> >> The wrap around constraints being how long before the time sources wrap >> around so you have to read them once per wrap around? I have not dug >> deeply enough into the code to see that yet. > > It's done by limiting the NOHZ idle time when all CPUs are going into deep > sleep for a long time, i.e. we make sure that at least one CPU comes back > sufficiently _before_ the wraparound happens and invokes the update > function. > > It's not so much a problem for TSC, but not every clocksource the kernel > supports has wraparound times in the range of hundreds of years. > > But yes, your idea of keeping track of wraparounds might work. Tricky, but > looks feasible on first sight, but we should be aware of the dragons. Oh. Yes. Definitely. A key enabler of any namespace implementation is figuring out how to tame the dragons. >> Please pardon me for thinking out load. >> >> There are one or more time sources that we use to compute the time >> and for each time source we have a conversion from ticks of the >> time source to nanoseconds. >> >> Each time source needs to be sampled at least once per wrap-around >> and something incremented so that we don't loose time when looking >> at that time source. >> >> There are several clocks presented to userspace and they all share the >> same length of second and are all fundamentally offsets from >> CLOCK_MONOTONIC. > > Yes. That's the readout side. This one is doable. But now look at timers. > > If you arm the timer from a name space, then it needs to be converted to > host time in order to sort it into the hrtimer queue and at some point arm > the clockevent device for it. This works as long as host and name space > time have a constant offset and the same skew. > > Once the name space time has a different skew this falls apart because the > armed timer will either expire late or early. > > Late might be acceptable, early violates the spec. You could do an extra > check for rescheduling it, if it's early, but that requires to store the > name space time accessor in the hrtimer itself because not every timer > expiry happens so that it can be checked in the name space context (think > signal based timers). We need to add this extra magic right into > __hrtimer_run_queues() which is called from the hard and soft interrupt. We > really don't want to touch all relevant callbacks or syscalls. The latter > is not sufficient anyway for signal based timer delivery. > > That's going to be interesting in terms of synchronization and might also > cause substantial overhead at least for the timers which belong to name > spaces. > > But that also means that anything which is early can and probably will > cause rearming of the timer hardware possibly for a very short delta. We > need to think about whether this can be abused to create interrupt storms. > > Now if you accept a bit late, which I'm not really happy about, then you > surely won't accept very late, i.e. hours, days. But that can happen when > settimeofday() comes into play. Right now with a single time domain, this > is easy. When settimeofday() or adjtimex() makes time jump, we just go and > reprogramm the hardware timers accordingly, which might also result in > immediate expiry of timers. > > But this does not help for time jumps in name spaces because the timer is > enqueued on the host time base. > > And no, we should not think about creating per name space hrtimer queues > and then have to walk through all of them for finding the first expiring > timer in order to arm the hardware. That cannot scale. > > Walking all hrtimer bases on all CPUs and check all queued timers whether > they belong to the affected name space does not scale either. > > So we'd need to keep track of queued timers belonging to a name space and > then just handle them. Interesting locking problem and also a scalability > issue because this might need to be done on all online CPUs. Haven't > thought it through, but it makes me shudder. Yes. I can see how this is a dragon that we need to figure out how to tame. It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME but still. >> I see two fundamental driving cases for a time namespace. > > > > I completely understand the problem you are trying to solve and yes, the > read out of time should be a solvable problem. There is
Re: [RFC 00/20] ns: Introduce Time Namespace
Eric, On Fri, 28 Sep 2018, Eric W. Biederman wrote: > Thomas Gleixner writes: > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. It's done by limiting the NOHZ idle time when all CPUs are going into deep sleep for a long time, i.e. we make sure that at least one CPU comes back sufficiently _before_ the wraparound happens and invokes the update function. It's not so much a problem for TSC, but not every clocksource the kernel supports has wraparound times in the range of hundreds of years. But yes, your idea of keeping track of wraparounds might work. Tricky, but looks feasible on first sight, but we should be aware of the dragons. > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. Yes. That's the readout side. This one is doable. But now look at timers. If you arm the timer from a name space, then it needs to be converted to host time in order to sort it into the hrtimer queue and at some point arm the clockevent device for it. This works as long as host and name space time have a constant offset and the same skew. Once the name space time has a different skew this falls apart because the armed timer will either expire late or early. Late might be acceptable, early violates the spec. You could do an extra check for rescheduling it, if it's early, but that requires to store the name space time accessor in the hrtimer itself because not every timer expiry happens so that it can be checked in the name space context (think signal based timers). We need to add this extra magic right into __hrtimer_run_queues() which is called from the hard and soft interrupt. We really don't want to touch all relevant callbacks or syscalls. The latter is not sufficient anyway for signal based timer delivery. That's going to be interesting in terms of synchronization and might also cause substantial overhead at least for the timers which belong to name spaces. But that also means that anything which is early can and probably will cause rearming of the timer hardware possibly for a very short delta. We need to think about whether this can be abused to create interrupt storms. Now if you accept a bit late, which I'm not really happy about, then you surely won't accept very late, i.e. hours, days. But that can happen when settimeofday() comes into play. Right now with a single time domain, this is easy. When settimeofday() or adjtimex() makes time jump, we just go and reprogramm the hardware timers accordingly, which might also result in immediate expiry of timers. But this does not help for time jumps in name spaces because the timer is enqueued on the host time base. And no, we should not think about creating per name space hrtimer queues and then have to walk through all of them for finding the first expiring timer in order to arm the hardware. That cannot scale. Walking all hrtimer bases on all CPUs and check all queued timers whether they belong to the affected name space does not scale either. So we'd need to keep track of queued timers belonging to a name space and then just handle them. Interesting locking problem and also a scalability issue because this might need to be done on all online CPUs. Haven't thought it through, but it makes me shudder. > I see two fundamental driving cases for a time namespace. I completely understand the problem you are trying to solve and yes, the read out of time should be a solvable problem. > For timers my inclination would be to assume no adjustments to the > current time parameters and set the timer to go off then. If the time > on the appropriate clock has been changed since the timer was set and > the timer is going off early reschedule so the timer fires at the > appropriate time. See above. > Not that I think a final implementation would necessary look like what I > have described. I just think it is possible with extreme care to
Re: [RFC 00/20] ns: Introduce Time Namespace
Eric, On Fri, 28 Sep 2018, Eric W. Biederman wrote: > Thomas Gleixner writes: > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. It's done by limiting the NOHZ idle time when all CPUs are going into deep sleep for a long time, i.e. we make sure that at least one CPU comes back sufficiently _before_ the wraparound happens and invokes the update function. It's not so much a problem for TSC, but not every clocksource the kernel supports has wraparound times in the range of hundreds of years. But yes, your idea of keeping track of wraparounds might work. Tricky, but looks feasible on first sight, but we should be aware of the dragons. > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. Yes. That's the readout side. This one is doable. But now look at timers. If you arm the timer from a name space, then it needs to be converted to host time in order to sort it into the hrtimer queue and at some point arm the clockevent device for it. This works as long as host and name space time have a constant offset and the same skew. Once the name space time has a different skew this falls apart because the armed timer will either expire late or early. Late might be acceptable, early violates the spec. You could do an extra check for rescheduling it, if it's early, but that requires to store the name space time accessor in the hrtimer itself because not every timer expiry happens so that it can be checked in the name space context (think signal based timers). We need to add this extra magic right into __hrtimer_run_queues() which is called from the hard and soft interrupt. We really don't want to touch all relevant callbacks or syscalls. The latter is not sufficient anyway for signal based timer delivery. That's going to be interesting in terms of synchronization and might also cause substantial overhead at least for the timers which belong to name spaces. But that also means that anything which is early can and probably will cause rearming of the timer hardware possibly for a very short delta. We need to think about whether this can be abused to create interrupt storms. Now if you accept a bit late, which I'm not really happy about, then you surely won't accept very late, i.e. hours, days. But that can happen when settimeofday() comes into play. Right now with a single time domain, this is easy. When settimeofday() or adjtimex() makes time jump, we just go and reprogramm the hardware timers accordingly, which might also result in immediate expiry of timers. But this does not help for time jumps in name spaces because the timer is enqueued on the host time base. And no, we should not think about creating per name space hrtimer queues and then have to walk through all of them for finding the first expiring timer in order to arm the hardware. That cannot scale. Walking all hrtimer bases on all CPUs and check all queued timers whether they belong to the affected name space does not scale either. So we'd need to keep track of queued timers belonging to a name space and then just handle them. Interesting locking problem and also a scalability issue because this might need to be done on all online CPUs. Haven't thought it through, but it makes me shudder. > I see two fundamental driving cases for a time namespace. I completely understand the problem you are trying to solve and yes, the read out of time should be a solvable problem. > For timers my inclination would be to assume no adjustments to the > current time parameters and set the timer to go off then. If the time > on the appropriate clock has been changed since the timer was set and > the timer is going off early reschedule so the timer fires at the > appropriate time. See above. > Not that I think a final implementation would necessary look like what I > have described. I just think it is possible with extreme care to
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > On Wed, 26 Sep 2018, Eric W. Biederman wrote: >> Reading the code the calling sequence there is: >> tick_sched_do_timer >>tick_do_update_jiffies64 >> update_wall_time >> timekeeping_advance >> timekeepging_update >> >> If I read that properly under the right nohz circumstances that update >> can be delayed indefinitely. >> >> So I think we could prototype a time namespace that was per >> timekeeping_update and just had update_wall_time iterate through >> all of the time namespaces. > > Please don't go there. timekeeping_update() is already heavy and walking > through a gazillion of namespaces will just make it horrible, > >> I don't think the naive version would scale to very many time >> namespaces. > > :) > >> At the same time using the techniques from the nohz work and a little >> smarts I expect we could get the code to scale. > > You'd need to invoke the update when the namespace is switched in and > hasn't been updated since the last tick happened. That might be doable, but > you also need to take the wraparound constraints of the underlying > clocksources into account, which again can cause walking all name spaces > when they are all idle long enough. The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet. > From there it becomes hairy, because it's not only timekeeping, > i.e. reading time, this is also affecting all timers which are armed from a > namespace. > > That gets really ugly because when you do settimeofday() or adjtimex() for > a particular namespace, then you have to search for all armed timers of > that namespace and adjust them. > > The original posix timer code had the same issue because it mapped the > clock realtime timers to the timer wheel so any setting of the clock caused > a full walk of all armed timers, disarming, adjusting and requeing > them. That's horrible not only performance wise, it's also a locking > nightmare of all sorts. > > Add time skew via NTP/PTP into the picture and you might have to adjust > timers as well, because you need to guarantee that they are not expiring > early. > > I haven't looked through Dimitry's patches yet, but I don't see how this > can work at all without introducing subtle issues all over the place. Then it sounds like this will take some more digging. Please pardon me for thinking out load. There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds. Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source. There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC. I see two fundamental driving cases for a time namespace. 1) Migration from one node to another node in a cluster in almost real time. The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards. This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. 2) Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds. In the later case it would fundamentally require having seconds of different length. A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there. This suggests we can for ticks have two values. - The number of ticks from the time source. - The number of times the ticks would have rolled over. That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter. If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover. Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work. For timers my inclination would be to
Re: [RFC 00/20] ns: Introduce Time Namespace
Thomas Gleixner writes: > On Wed, 26 Sep 2018, Eric W. Biederman wrote: >> Reading the code the calling sequence there is: >> tick_sched_do_timer >>tick_do_update_jiffies64 >> update_wall_time >> timekeeping_advance >> timekeepging_update >> >> If I read that properly under the right nohz circumstances that update >> can be delayed indefinitely. >> >> So I think we could prototype a time namespace that was per >> timekeeping_update and just had update_wall_time iterate through >> all of the time namespaces. > > Please don't go there. timekeeping_update() is already heavy and walking > through a gazillion of namespaces will just make it horrible, > >> I don't think the naive version would scale to very many time >> namespaces. > > :) > >> At the same time using the techniques from the nohz work and a little >> smarts I expect we could get the code to scale. > > You'd need to invoke the update when the namespace is switched in and > hasn't been updated since the last tick happened. That might be doable, but > you also need to take the wraparound constraints of the underlying > clocksources into account, which again can cause walking all name spaces > when they are all idle long enough. The wrap around constraints being how long before the time sources wrap around so you have to read them once per wrap around? I have not dug deeply enough into the code to see that yet. > From there it becomes hairy, because it's not only timekeeping, > i.e. reading time, this is also affecting all timers which are armed from a > namespace. > > That gets really ugly because when you do settimeofday() or adjtimex() for > a particular namespace, then you have to search for all armed timers of > that namespace and adjust them. > > The original posix timer code had the same issue because it mapped the > clock realtime timers to the timer wheel so any setting of the clock caused > a full walk of all armed timers, disarming, adjusting and requeing > them. That's horrible not only performance wise, it's also a locking > nightmare of all sorts. > > Add time skew via NTP/PTP into the picture and you might have to adjust > timers as well, because you need to guarantee that they are not expiring > early. > > I haven't looked through Dimitry's patches yet, but I don't see how this > can work at all without introducing subtle issues all over the place. Then it sounds like this will take some more digging. Please pardon me for thinking out load. There are one or more time sources that we use to compute the time and for each time source we have a conversion from ticks of the time source to nanoseconds. Each time source needs to be sampled at least once per wrap-around and something incremented so that we don't loose time when looking at that time source. There are several clocks presented to userspace and they all share the same length of second and are all fundamentally offsets from CLOCK_MONOTONIC. I see two fundamental driving cases for a time namespace. 1) Migration from one node to another node in a cluster in almost real time. The problem is that CLOCK_MONOTONIC between nodes in the cluster has not relation ship to each other (except a synchronized length of the second). So applications that migrate can see CLOCK_MONOTONIC and CLOCK_BOOTTIME go backwards. This is the truly pressing problem and adding some kind of offset sounds like it would be the solution. Possibly by allowing a boot time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. 2) Dealing with two separate time management domains. Say a machine that needes to deal with both something inside of google where they slew time to avoid leap time seconds and something in the outside world proper UTC time is kept as an offset from TAI with the occasional leap seconds. In the later case it would fundamentally require having seconds of different length. A pure 64bit nanoseond counter is good for 500 years. So 64bit variables can be used to hold time, and everything can be converted from there. This suggests we can for ticks have two values. - The number of ticks from the time source. - The number of times the ticks would have rolled over. That sounds like it may be a little simplistic as it would require being very diligent about firing a timer exactly at rollover and not losing that, but for a handwaving argument is probably enough to generate a 64bit tick counter. If the focus is on a 64bit tick counter then what update_wall_time has to do is very limited. Just deal the accounting needed to cope with tick rollover. Getting the actual time looks like it would be as simple as now, with perhaps an extra addition to account for the number of times the tick counter has rolled over. With limited precision arithmetic and various optimizations I don't think it is that simple to implement but it feels like it should be very little extra work. For timers my inclination would be to
Re: [RFC 00/20] ns: Introduce Time Namespace
On Thu, 27 Sep 2018, Thomas Gleixner wrote: > Add time skew via NTP/PTP into the picture and you might have to adjust > timers as well, because you need to guarantee that they are not expiring > early. > > I haven't looked through Dimitry's patches yet, but I don't see how this > can work at all without introducing subtle issues all over the place. And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Thu, 27 Sep 2018, Thomas Gleixner wrote: > Add time skew via NTP/PTP into the picture and you might have to adjust > timers as well, because you need to guarantee that they are not expiring > early. > > I haven't looked through Dimitry's patches yet, but I don't see how this > can work at all without introducing subtle issues all over the place. And just a quick scan tells me that this is broken. Timers will expire early or late. The latter is acceptible to some extent, but larger delays might come with surprise. Expiring early is an absolute nono. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Wed, 26 Sep 2018, Eric W. Biederman wrote: > Reading the code the calling sequence there is: > tick_sched_do_timer >tick_do_update_jiffies64 > update_wall_time > timekeeping_advance > timekeepging_update > > If I read that properly under the right nohz circumstances that update > can be delayed indefinitely. > > So I think we could prototype a time namespace that was per > timekeeping_update and just had update_wall_time iterate through > all of the time namespaces. Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible, > I don't think the naive version would scale to very many time > namespaces. :) > At the same time using the techniques from the nohz work and a little > smarts I expect we could get the code to scale. You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough. >From there it becomes hairy, because it's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace. That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them. The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts. Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early. I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
On Wed, 26 Sep 2018, Eric W. Biederman wrote: > Reading the code the calling sequence there is: > tick_sched_do_timer >tick_do_update_jiffies64 > update_wall_time > timekeeping_advance > timekeepging_update > > If I read that properly under the right nohz circumstances that update > can be delayed indefinitely. > > So I think we could prototype a time namespace that was per > timekeeping_update and just had update_wall_time iterate through > all of the time namespaces. Please don't go there. timekeeping_update() is already heavy and walking through a gazillion of namespaces will just make it horrible, > I don't think the naive version would scale to very many time > namespaces. :) > At the same time using the techniques from the nohz work and a little > smarts I expect we could get the code to scale. You'd need to invoke the update when the namespace is switched in and hasn't been updated since the last tick happened. That might be doable, but you also need to take the wraparound constraints of the underlying clocksources into account, which again can cause walking all name spaces when they are all idle long enough. >From there it becomes hairy, because it's not only timekeeping, i.e. reading time, this is also affecting all timers which are armed from a namespace. That gets really ugly because when you do settimeofday() or adjtimex() for a particular namespace, then you have to search for all armed timers of that namespace and adjust them. The original posix timer code had the same issue because it mapped the clock realtime timers to the timer wheel so any setting of the clock caused a full walk of all armed timers, disarming, adjusting and requeing them. That's horrible not only performance wise, it's also a locking nightmare of all sorts. Add time skew via NTP/PTP into the picture and you might have to adjust timers as well, because you need to guarantee that they are not expiring early. I haven't looked through Dimitry's patches yet, but I don't see how this can work at all without introducing subtle issues all over the place. Thanks, tglx
Re: [RFC 00/20] ns: Introduce Time Namespace
2018-09-26 18:36 GMT+01:00 Eric W. Biederman : > The advantage of timekeeping_update per time namespace is that it allows > different lengths of seconds per time namespace. Which allows testing > ntp and the kernel in interesting ways while still having a working > production configuration on the same system. Just a quick note: the different length of second per namespace sounds very interesting in my POV, I remember I've seen this article: http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf And their realisation with a simulation of time going with different speed per-pid (with vdso disabled): https://github.com/littlepretty/VirtualTimeKernel Thanks, Dmitry
Re: [RFC 00/20] ns: Introduce Time Namespace
2018-09-26 18:36 GMT+01:00 Eric W. Biederman : > The advantage of timekeeping_update per time namespace is that it allows > different lengths of seconds per time namespace. Which allows testing > ntp and the kernel in interesting ways while still having a working > production configuration on the same system. Just a quick note: the different length of second per namespace sounds very interesting in my POV, I remember I've seen this article: http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf And their realisation with a simulation of time going with different speed per-pid (with vdso disabled): https://github.com/littlepretty/VirtualTimeKernel Thanks, Dmitry
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrey Vagin writes: > On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: >> Andrey Vagin writes: >> >> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: >> >> Dmitry Safonov writes: >> >> >> >> > Discussions around time virtualization are there for a long time. >> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. >> >> > From that time, the topic appears on and off in various discussions. >> >> > >> >> > There are two main use cases for time namespaces: >> >> > 1. change date and time inside a container; >> >> > 2. adjust clocks for a container restored from a checkpoint. >> >> > >> >> > “It seems like this might be one of the last major obstacles keeping >> >> > migration from being used in production systems, given that not all >> >> > containers and connections can be migrated as long as a time dependency >> >> > is capable of messing it up.” (by github.com/dav-ell) >> >> > >> >> > The kernel provides access to several clocks: CLOCK_REALTIME, >> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the >> >> > start points for them are not defined and are different for each running >> >> > system. When a container is migrated from one node to another, all >> >> > clocks have to be restored into consistent states; in other words, they >> >> > have to continue running from the same points where they have been >> >> > dumped. >> >> > >> >> > The main idea behind this patch set is adding per-namespace offsets for >> >> > system clocks. When a process in a non-root time namespace requests >> >> > time of a clock, a namespace offset is added to the current value of >> >> > this clock on a host and the sum is returned. >> >> > >> >> > All offsets are placed on a separate page, this allows up to map it as >> >> > part of vvar into user processes and use offsets from vdso calls. >> >> > >> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME >> >> > clocks. >> >> > >> >> > Questions to discuss: >> >> > >> >> > * Clone flags exhaustion. Currently there is only one unused clone flag >> >> > bit left, and it may be worth to use it to extend arguments of the clone >> >> > system call. >> >> > >> >> > * Realtime clock implementation details: >> >> > Is having a simple offset enough? >> >> > What to do when date and time is changed on the host? >> >> > Is there a need to adjust vfs modification and creation times? >> >> > Implementation for adjtime() syscall. >> >> >> >> Overall I support this effort. In my quick skim this code looked good. >> > >> > Hi Eric, >> > >> > Thank you for the feedback. >> > >> >> >> >> My feeling is that we need to be able to support running ntpd and >> >> support one namespace doing googles smoothing of leap seconds while >> >> another namespace takes the leap second. >> >> >> >> What I was imagining when I was last thinking about this was one >> >> instance of struct timekeeper aka tk_core per time namespace. That >> >> structure already keeps offsets for all of the various clocks from >> >> the kerne internal time sources. What would be needed would be to >> >> pass in an appropriate time namespace pointer. >> >> >> >> I could be completely wrong as I have not take the time to completely >> >> trace through the code. Have you looked at pushing the time namespace >> >> down as far as tk_core? >> >> >> >> What I think would be the big advantage (besides ntp working) is that >> >> the bulk of the code could be reused. Allowing testing of the kernel's >> >> time code by setting up a new time namespace. So a person in production >> >> could setup a time namespace with the time set ahead a little bit and >> >> be able to verify that the kernel handles the upcoming leap second >> >> properly. >> >> >> > >> > It is an interesting idea, but I have a few questions: >> > >> > 1. Does it mean that timekeeping_update() will be called for each >> > namespace? This functions is called periodically, it updates times on the >> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an >> > overhead of this? >> >> I don't know if periodically is a proper characterization. There may be >> a code path that does that. But from what I can see timekeeping_update >> is the guts of settimeofday (and a few related functions). >> >> So it appears to make sense for timekeeping_update to be per namespace. >> >> Hmm. Looking at what is updated in the vsyscall_gtod_data it does >> look like you would have to periodically update things, but I don't know >> big that period would be. As long as the period is reasonably large, >> or the time namespaces were sufficiently deschronized it should not >> be a problem. But that is the class of problem that could make >> my ideal impractical if there is measuarable overhead. >> >> Where were you seeing timekeeping_update being called periodically? > > timekeeping_update() is called HZ times per-second: > > [ 67.912858]
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrey Vagin writes: > On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: >> Andrey Vagin writes: >> >> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: >> >> Dmitry Safonov writes: >> >> >> >> > Discussions around time virtualization are there for a long time. >> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. >> >> > From that time, the topic appears on and off in various discussions. >> >> > >> >> > There are two main use cases for time namespaces: >> >> > 1. change date and time inside a container; >> >> > 2. adjust clocks for a container restored from a checkpoint. >> >> > >> >> > “It seems like this might be one of the last major obstacles keeping >> >> > migration from being used in production systems, given that not all >> >> > containers and connections can be migrated as long as a time dependency >> >> > is capable of messing it up.” (by github.com/dav-ell) >> >> > >> >> > The kernel provides access to several clocks: CLOCK_REALTIME, >> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the >> >> > start points for them are not defined and are different for each running >> >> > system. When a container is migrated from one node to another, all >> >> > clocks have to be restored into consistent states; in other words, they >> >> > have to continue running from the same points where they have been >> >> > dumped. >> >> > >> >> > The main idea behind this patch set is adding per-namespace offsets for >> >> > system clocks. When a process in a non-root time namespace requests >> >> > time of a clock, a namespace offset is added to the current value of >> >> > this clock on a host and the sum is returned. >> >> > >> >> > All offsets are placed on a separate page, this allows up to map it as >> >> > part of vvar into user processes and use offsets from vdso calls. >> >> > >> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME >> >> > clocks. >> >> > >> >> > Questions to discuss: >> >> > >> >> > * Clone flags exhaustion. Currently there is only one unused clone flag >> >> > bit left, and it may be worth to use it to extend arguments of the clone >> >> > system call. >> >> > >> >> > * Realtime clock implementation details: >> >> > Is having a simple offset enough? >> >> > What to do when date and time is changed on the host? >> >> > Is there a need to adjust vfs modification and creation times? >> >> > Implementation for adjtime() syscall. >> >> >> >> Overall I support this effort. In my quick skim this code looked good. >> > >> > Hi Eric, >> > >> > Thank you for the feedback. >> > >> >> >> >> My feeling is that we need to be able to support running ntpd and >> >> support one namespace doing googles smoothing of leap seconds while >> >> another namespace takes the leap second. >> >> >> >> What I was imagining when I was last thinking about this was one >> >> instance of struct timekeeper aka tk_core per time namespace. That >> >> structure already keeps offsets for all of the various clocks from >> >> the kerne internal time sources. What would be needed would be to >> >> pass in an appropriate time namespace pointer. >> >> >> >> I could be completely wrong as I have not take the time to completely >> >> trace through the code. Have you looked at pushing the time namespace >> >> down as far as tk_core? >> >> >> >> What I think would be the big advantage (besides ntp working) is that >> >> the bulk of the code could be reused. Allowing testing of the kernel's >> >> time code by setting up a new time namespace. So a person in production >> >> could setup a time namespace with the time set ahead a little bit and >> >> be able to verify that the kernel handles the upcoming leap second >> >> properly. >> >> >> > >> > It is an interesting idea, but I have a few questions: >> > >> > 1. Does it mean that timekeeping_update() will be called for each >> > namespace? This functions is called periodically, it updates times on the >> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an >> > overhead of this? >> >> I don't know if periodically is a proper characterization. There may be >> a code path that does that. But from what I can see timekeeping_update >> is the guts of settimeofday (and a few related functions). >> >> So it appears to make sense for timekeeping_update to be per namespace. >> >> Hmm. Looking at what is updated in the vsyscall_gtod_data it does >> look like you would have to periodically update things, but I don't know >> big that period would be. As long as the period is reasonably large, >> or the time namespaces were sufficiently deschronized it should not >> be a problem. But that is the class of problem that could make >> my ideal impractical if there is measuarable overhead. >> >> Where were you seeing timekeeping_update being called periodically? > > timekeeping_update() is called HZ times per-second: > > [ 67.912858]
Re: [RFC 00/20] ns: Introduce Time Namespace
On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: > Andrey Vagin writes: > > > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: > >> Dmitry Safonov writes: > >> > >> > Discussions around time virtualization are there for a long time. > >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. > >> > From that time, the topic appears on and off in various discussions. > >> > > >> > There are two main use cases for time namespaces: > >> > 1. change date and time inside a container; > >> > 2. adjust clocks for a container restored from a checkpoint. > >> > > >> > “It seems like this might be one of the last major obstacles keeping > >> > migration from being used in production systems, given that not all > >> > containers and connections can be migrated as long as a time dependency > >> > is capable of messing it up.” (by github.com/dav-ell) > >> > > >> > The kernel provides access to several clocks: CLOCK_REALTIME, > >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > >> > start points for them are not defined and are different for each running > >> > system. When a container is migrated from one node to another, all > >> > clocks have to be restored into consistent states; in other words, they > >> > have to continue running from the same points where they have been > >> > dumped. > >> > > >> > The main idea behind this patch set is adding per-namespace offsets for > >> > system clocks. When a process in a non-root time namespace requests > >> > time of a clock, a namespace offset is added to the current value of > >> > this clock on a host and the sum is returned. > >> > > >> > All offsets are placed on a separate page, this allows up to map it as > >> > part of vvar into user processes and use offsets from vdso calls. > >> > > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > >> > clocks. > >> > > >> > Questions to discuss: > >> > > >> > * Clone flags exhaustion. Currently there is only one unused clone flag > >> > bit left, and it may be worth to use it to extend arguments of the clone > >> > system call. > >> > > >> > * Realtime clock implementation details: > >> > Is having a simple offset enough? > >> > What to do when date and time is changed on the host? > >> > Is there a need to adjust vfs modification and creation times? > >> > Implementation for adjtime() syscall. > >> > >> Overall I support this effort. In my quick skim this code looked good. > > > > Hi Eric, > > > > Thank you for the feedback. > > > >> > >> My feeling is that we need to be able to support running ntpd and > >> support one namespace doing googles smoothing of leap seconds while > >> another namespace takes the leap second. > >> > >> What I was imagining when I was last thinking about this was one > >> instance of struct timekeeper aka tk_core per time namespace. That > >> structure already keeps offsets for all of the various clocks from > >> the kerne internal time sources. What would be needed would be to > >> pass in an appropriate time namespace pointer. > >> > >> I could be completely wrong as I have not take the time to completely > >> trace through the code. Have you looked at pushing the time namespace > >> down as far as tk_core? > >> > >> What I think would be the big advantage (besides ntp working) is that > >> the bulk of the code could be reused. Allowing testing of the kernel's > >> time code by setting up a new time namespace. So a person in production > >> could setup a time namespace with the time set ahead a little bit and > >> be able to verify that the kernel handles the upcoming leap second > >> properly. > >> > > > > It is an interesting idea, but I have a few questions: > > > > 1. Does it mean that timekeeping_update() will be called for each > > namespace? This functions is called periodically, it updates times on the > > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > > overhead of this? > > I don't know if periodically is a proper characterization. There may be > a code path that does that. But from what I can see timekeeping_update > is the guts of settimeofday (and a few related functions). > > So it appears to make sense for timekeeping_update to be per namespace. > > Hmm. Looking at what is updated in the vsyscall_gtod_data it does > look like you would have to periodically update things, but I don't know > big that period would be. As long as the period is reasonably large, > or the time namespaces were sufficiently deschronized it should not > be a problem. But that is the class of problem that could make > my ideal impractical if there is measuarable overhead. > > Where were you seeing timekeeping_update being called periodically? timekeeping_update() is called HZ times per-second: [ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [
Re: [RFC 00/20] ns: Introduce Time Namespace
On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: > Andrey Vagin writes: > > > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: > >> Dmitry Safonov writes: > >> > >> > Discussions around time virtualization are there for a long time. > >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. > >> > From that time, the topic appears on and off in various discussions. > >> > > >> > There are two main use cases for time namespaces: > >> > 1. change date and time inside a container; > >> > 2. adjust clocks for a container restored from a checkpoint. > >> > > >> > “It seems like this might be one of the last major obstacles keeping > >> > migration from being used in production systems, given that not all > >> > containers and connections can be migrated as long as a time dependency > >> > is capable of messing it up.” (by github.com/dav-ell) > >> > > >> > The kernel provides access to several clocks: CLOCK_REALTIME, > >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > >> > start points for them are not defined and are different for each running > >> > system. When a container is migrated from one node to another, all > >> > clocks have to be restored into consistent states; in other words, they > >> > have to continue running from the same points where they have been > >> > dumped. > >> > > >> > The main idea behind this patch set is adding per-namespace offsets for > >> > system clocks. When a process in a non-root time namespace requests > >> > time of a clock, a namespace offset is added to the current value of > >> > this clock on a host and the sum is returned. > >> > > >> > All offsets are placed on a separate page, this allows up to map it as > >> > part of vvar into user processes and use offsets from vdso calls. > >> > > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > >> > clocks. > >> > > >> > Questions to discuss: > >> > > >> > * Clone flags exhaustion. Currently there is only one unused clone flag > >> > bit left, and it may be worth to use it to extend arguments of the clone > >> > system call. > >> > > >> > * Realtime clock implementation details: > >> > Is having a simple offset enough? > >> > What to do when date and time is changed on the host? > >> > Is there a need to adjust vfs modification and creation times? > >> > Implementation for adjtime() syscall. > >> > >> Overall I support this effort. In my quick skim this code looked good. > > > > Hi Eric, > > > > Thank you for the feedback. > > > >> > >> My feeling is that we need to be able to support running ntpd and > >> support one namespace doing googles smoothing of leap seconds while > >> another namespace takes the leap second. > >> > >> What I was imagining when I was last thinking about this was one > >> instance of struct timekeeper aka tk_core per time namespace. That > >> structure already keeps offsets for all of the various clocks from > >> the kerne internal time sources. What would be needed would be to > >> pass in an appropriate time namespace pointer. > >> > >> I could be completely wrong as I have not take the time to completely > >> trace through the code. Have you looked at pushing the time namespace > >> down as far as tk_core? > >> > >> What I think would be the big advantage (besides ntp working) is that > >> the bulk of the code could be reused. Allowing testing of the kernel's > >> time code by setting up a new time namespace. So a person in production > >> could setup a time namespace with the time set ahead a little bit and > >> be able to verify that the kernel handles the upcoming leap second > >> properly. > >> > > > > It is an interesting idea, but I have a few questions: > > > > 1. Does it mean that timekeeping_update() will be called for each > > namespace? This functions is called periodically, it updates times on the > > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > > overhead of this? > > I don't know if periodically is a proper characterization. There may be > a code path that does that. But from what I can see timekeeping_update > is the guts of settimeofday (and a few related functions). > > So it appears to make sense for timekeeping_update to be per namespace. > > Hmm. Looking at what is updated in the vsyscall_gtod_data it does > look like you would have to periodically update things, but I don't know > big that period would be. As long as the period is reasonably large, > or the time namespaces were sufficiently deschronized it should not > be a problem. But that is the class of problem that could make > my ideal impractical if there is measuarable overhead. > > Where were you seeing timekeeping_update being called periodically? timekeeping_update() is called HZ times per-second: [ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrey Vagin writes: > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: >> Dmitry Safonov writes: >> >> > Discussions around time virtualization are there for a long time. >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. >> > From that time, the topic appears on and off in various discussions. >> > >> > There are two main use cases for time namespaces: >> > 1. change date and time inside a container; >> > 2. adjust clocks for a container restored from a checkpoint. >> > >> > “It seems like this might be one of the last major obstacles keeping >> > migration from being used in production systems, given that not all >> > containers and connections can be migrated as long as a time dependency >> > is capable of messing it up.” (by github.com/dav-ell) >> > >> > The kernel provides access to several clocks: CLOCK_REALTIME, >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the >> > start points for them are not defined and are different for each running >> > system. When a container is migrated from one node to another, all >> > clocks have to be restored into consistent states; in other words, they >> > have to continue running from the same points where they have been >> > dumped. >> > >> > The main idea behind this patch set is adding per-namespace offsets for >> > system clocks. When a process in a non-root time namespace requests >> > time of a clock, a namespace offset is added to the current value of >> > this clock on a host and the sum is returned. >> > >> > All offsets are placed on a separate page, this allows up to map it as >> > part of vvar into user processes and use offsets from vdso calls. >> > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME >> > clocks. >> > >> > Questions to discuss: >> > >> > * Clone flags exhaustion. Currently there is only one unused clone flag >> > bit left, and it may be worth to use it to extend arguments of the clone >> > system call. >> > >> > * Realtime clock implementation details: >> > Is having a simple offset enough? >> > What to do when date and time is changed on the host? >> > Is there a need to adjust vfs modification and creation times? >> > Implementation for adjtime() syscall. >> >> Overall I support this effort. In my quick skim this code looked good. > > Hi Eric, > > Thank you for the feedback. > >> >> My feeling is that we need to be able to support running ntpd and >> support one namespace doing googles smoothing of leap seconds while >> another namespace takes the leap second. >> >> What I was imagining when I was last thinking about this was one >> instance of struct timekeeper aka tk_core per time namespace. That >> structure already keeps offsets for all of the various clocks from >> the kerne internal time sources. What would be needed would be to >> pass in an appropriate time namespace pointer. >> >> I could be completely wrong as I have not take the time to completely >> trace through the code. Have you looked at pushing the time namespace >> down as far as tk_core? >> >> What I think would be the big advantage (besides ntp working) is that >> the bulk of the code could be reused. Allowing testing of the kernel's >> time code by setting up a new time namespace. So a person in production >> could setup a time namespace with the time set ahead a little bit and >> be able to verify that the kernel handles the upcoming leap second >> properly. >> > > It is an interesting idea, but I have a few questions: > > 1. Does it mean that timekeeping_update() will be called for each > namespace? This functions is called periodically, it updates times on the > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > overhead of this? I don't know if periodically is a proper characterization. There may be a code path that does that. But from what I can see timekeeping_update is the guts of settimeofday (and a few related functions). So it appears to make sense for timekeeping_update to be per namespace. Hmm. Looking at what is updated in the vsyscall_gtod_data it does look like you would have to periodically update things, but I don't know big that period would be. As long as the period is reasonably large, or the time namespaces were sufficiently deschronized it should not be a problem. But that is the class of problem that could make my ideal impractical if there is measuarable overhead. Where were you seeing timekeeping_update being called periodically? > 2. What will we do with vdso? It looks like we will have to have a > separate vsyscall_gtod_data for each ns and update each of them > separately. Yes. But you don't have to have introduce another variable just make certain vsyscall_gtod_data is a page aligned thing per time namespace. If I read the summary of the existing patchset something very similiar is already going on. Each process would only map one. And unshare of the time namespace would need to
Re: [RFC 00/20] ns: Introduce Time Namespace
Andrey Vagin writes: > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: >> Dmitry Safonov writes: >> >> > Discussions around time virtualization are there for a long time. >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. >> > From that time, the topic appears on and off in various discussions. >> > >> > There are two main use cases for time namespaces: >> > 1. change date and time inside a container; >> > 2. adjust clocks for a container restored from a checkpoint. >> > >> > “It seems like this might be one of the last major obstacles keeping >> > migration from being used in production systems, given that not all >> > containers and connections can be migrated as long as a time dependency >> > is capable of messing it up.” (by github.com/dav-ell) >> > >> > The kernel provides access to several clocks: CLOCK_REALTIME, >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the >> > start points for them are not defined and are different for each running >> > system. When a container is migrated from one node to another, all >> > clocks have to be restored into consistent states; in other words, they >> > have to continue running from the same points where they have been >> > dumped. >> > >> > The main idea behind this patch set is adding per-namespace offsets for >> > system clocks. When a process in a non-root time namespace requests >> > time of a clock, a namespace offset is added to the current value of >> > this clock on a host and the sum is returned. >> > >> > All offsets are placed on a separate page, this allows up to map it as >> > part of vvar into user processes and use offsets from vdso calls. >> > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME >> > clocks. >> > >> > Questions to discuss: >> > >> > * Clone flags exhaustion. Currently there is only one unused clone flag >> > bit left, and it may be worth to use it to extend arguments of the clone >> > system call. >> > >> > * Realtime clock implementation details: >> > Is having a simple offset enough? >> > What to do when date and time is changed on the host? >> > Is there a need to adjust vfs modification and creation times? >> > Implementation for adjtime() syscall. >> >> Overall I support this effort. In my quick skim this code looked good. > > Hi Eric, > > Thank you for the feedback. > >> >> My feeling is that we need to be able to support running ntpd and >> support one namespace doing googles smoothing of leap seconds while >> another namespace takes the leap second. >> >> What I was imagining when I was last thinking about this was one >> instance of struct timekeeper aka tk_core per time namespace. That >> structure already keeps offsets for all of the various clocks from >> the kerne internal time sources. What would be needed would be to >> pass in an appropriate time namespace pointer. >> >> I could be completely wrong as I have not take the time to completely >> trace through the code. Have you looked at pushing the time namespace >> down as far as tk_core? >> >> What I think would be the big advantage (besides ntp working) is that >> the bulk of the code could be reused. Allowing testing of the kernel's >> time code by setting up a new time namespace. So a person in production >> could setup a time namespace with the time set ahead a little bit and >> be able to verify that the kernel handles the upcoming leap second >> properly. >> > > It is an interesting idea, but I have a few questions: > > 1. Does it mean that timekeeping_update() will be called for each > namespace? This functions is called periodically, it updates times on the > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > overhead of this? I don't know if periodically is a proper characterization. There may be a code path that does that. But from what I can see timekeeping_update is the guts of settimeofday (and a few related functions). So it appears to make sense for timekeeping_update to be per namespace. Hmm. Looking at what is updated in the vsyscall_gtod_data it does look like you would have to periodically update things, but I don't know big that period would be. As long as the period is reasonably large, or the time namespaces were sufficiently deschronized it should not be a problem. But that is the class of problem that could make my ideal impractical if there is measuarable overhead. Where were you seeing timekeeping_update being called periodically? > 2. What will we do with vdso? It looks like we will have to have a > separate vsyscall_gtod_data for each ns and update each of them > separately. Yes. But you don't have to have introduce another variable just make certain vsyscall_gtod_data is a page aligned thing per time namespace. If I read the summary of the existing patchset something very similiar is already going on. Each process would only map one. And unshare of the time namespace would need to
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: > Dmitry Safonov writes: > > > Discussions around time virtualization are there for a long time. > > The first attempt to implement time namespace was in 2006 by Jeff Dike. > > From that time, the topic appears on and off in various discussions. > > > > There are two main use cases for time namespaces: > > 1. change date and time inside a container; > > 2. adjust clocks for a container restored from a checkpoint. > > > > “It seems like this might be one of the last major obstacles keeping > > migration from being used in production systems, given that not all > > containers and connections can be migrated as long as a time dependency > > is capable of messing it up.” (by github.com/dav-ell) > > > > The kernel provides access to several clocks: CLOCK_REALTIME, > > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > > start points for them are not defined and are different for each running > > system. When a container is migrated from one node to another, all > > clocks have to be restored into consistent states; in other words, they > > have to continue running from the same points where they have been > > dumped. > > > > The main idea behind this patch set is adding per-namespace offsets for > > system clocks. When a process in a non-root time namespace requests > > time of a clock, a namespace offset is added to the current value of > > this clock on a host and the sum is returned. > > > > All offsets are placed on a separate page, this allows up to map it as > > part of vvar into user processes and use offsets from vdso calls. > > > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > > clocks. > > > > Questions to discuss: > > > > * Clone flags exhaustion. Currently there is only one unused clone flag > > bit left, and it may be worth to use it to extend arguments of the clone > > system call. > > > > * Realtime clock implementation details: > > Is having a simple offset enough? > > What to do when date and time is changed on the host? > > Is there a need to adjust vfs modification and creation times? > > Implementation for adjtime() syscall. > > Overall I support this effort. In my quick skim this code looked good. Hi Eric, Thank you for the feedback. > > My feeling is that we need to be able to support running ntpd and > support one namespace doing googles smoothing of leap seconds while > another namespace takes the leap second. > > What I was imagining when I was last thinking about this was one > instance of struct timekeeper aka tk_core per time namespace. That > structure already keeps offsets for all of the various clocks from > the kerne internal time sources. What would be needed would be to > pass in an appropriate time namespace pointer. > > I could be completely wrong as I have not take the time to completely > trace through the code. Have you looked at pushing the time namespace > down as far as tk_core? > > What I think would be the big advantage (besides ntp working) is that > the bulk of the code could be reused. Allowing testing of the kernel's > time code by setting up a new time namespace. So a person in production > could setup a time namespace with the time set ahead a little bit and > be able to verify that the kernel handles the upcoming leap second > properly. > It is an interesting idea, but I have a few questions: 1. Does it mean that timekeeping_update() will be called for each namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this? 2. What will we do with vdso? It looks like we will have to have a separate vsyscall_gtod_data for each ns and update each of them separately. > > > I don't know about the vfs. I think the danger is being able to write > dates in the future or in the past. It appears that utimes(2) and > utimesnat(2) already allow this except for status change. So it is > possible we simply don't care. I seem to remember that what nfs does > is take the time stamp from the host writing to the file. > > I think the guide for filesystem timestamps should be to first ensure > we don't introduce security issues, and then do what distributed > filesystems do when dealing with hosts with different clocks. > > Given those those two guidlines above I don't think there is a need to > change timestamsp the way the user namespace changes uid when displayed. > > > > As for the hardware like the real time clock we definitely should not > let a root in a time namespace change it. We might even be able to get > away with leaving the real time clock out of the time namespace. If not > we need to be very careful how the real time clock is abstracted. I > would start by leaving the real time clock hardware out of the time > namespace and see if there is any part of userspace that cares. > > Eric > > > Cc: Dmitry Safonov
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: > Dmitry Safonov writes: > > > Discussions around time virtualization are there for a long time. > > The first attempt to implement time namespace was in 2006 by Jeff Dike. > > From that time, the topic appears on and off in various discussions. > > > > There are two main use cases for time namespaces: > > 1. change date and time inside a container; > > 2. adjust clocks for a container restored from a checkpoint. > > > > “It seems like this might be one of the last major obstacles keeping > > migration from being used in production systems, given that not all > > containers and connections can be migrated as long as a time dependency > > is capable of messing it up.” (by github.com/dav-ell) > > > > The kernel provides access to several clocks: CLOCK_REALTIME, > > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > > start points for them are not defined and are different for each running > > system. When a container is migrated from one node to another, all > > clocks have to be restored into consistent states; in other words, they > > have to continue running from the same points where they have been > > dumped. > > > > The main idea behind this patch set is adding per-namespace offsets for > > system clocks. When a process in a non-root time namespace requests > > time of a clock, a namespace offset is added to the current value of > > this clock on a host and the sum is returned. > > > > All offsets are placed on a separate page, this allows up to map it as > > part of vvar into user processes and use offsets from vdso calls. > > > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > > clocks. > > > > Questions to discuss: > > > > * Clone flags exhaustion. Currently there is only one unused clone flag > > bit left, and it may be worth to use it to extend arguments of the clone > > system call. > > > > * Realtime clock implementation details: > > Is having a simple offset enough? > > What to do when date and time is changed on the host? > > Is there a need to adjust vfs modification and creation times? > > Implementation for adjtime() syscall. > > Overall I support this effort. In my quick skim this code looked good. Hi Eric, Thank you for the feedback. > > My feeling is that we need to be able to support running ntpd and > support one namespace doing googles smoothing of leap seconds while > another namespace takes the leap second. > > What I was imagining when I was last thinking about this was one > instance of struct timekeeper aka tk_core per time namespace. That > structure already keeps offsets for all of the various clocks from > the kerne internal time sources. What would be needed would be to > pass in an appropriate time namespace pointer. > > I could be completely wrong as I have not take the time to completely > trace through the code. Have you looked at pushing the time namespace > down as far as tk_core? > > What I think would be the big advantage (besides ntp working) is that > the bulk of the code could be reused. Allowing testing of the kernel's > time code by setting up a new time namespace. So a person in production > could setup a time namespace with the time set ahead a little bit and > be able to verify that the kernel handles the upcoming leap second > properly. > It is an interesting idea, but I have a few questions: 1. Does it mean that timekeeping_update() will be called for each namespace? This functions is called periodically, it updates times on the timekeeper structure, updates vsyscall_gtod_data, etc. What will be an overhead of this? 2. What will we do with vdso? It looks like we will have to have a separate vsyscall_gtod_data for each ns and update each of them separately. > > > I don't know about the vfs. I think the danger is being able to write > dates in the future or in the past. It appears that utimes(2) and > utimesnat(2) already allow this except for status change. So it is > possible we simply don't care. I seem to remember that what nfs does > is take the time stamp from the host writing to the file. > > I think the guide for filesystem timestamps should be to first ensure > we don't introduce security issues, and then do what distributed > filesystems do when dealing with hosts with different clocks. > > Given those those two guidlines above I don't think there is a need to > change timestamsp the way the user namespace changes uid when displayed. > > > > As for the hardware like the real time clock we definitely should not > let a root in a time namespace change it. We might even be able to get > away with leaving the real time clock out of the time namespace. If not > we need to be very careful how the real time clock is abstracted. I > would start by leaving the real time clock hardware out of the time > namespace and see if there is any part of userspace that cares. > > Eric > > > Cc: Dmitry Safonov
Re: [RFC 00/20] ns: Introduce Time Namespace
Dmitry Safonov writes: > Discussions around time virtualization are there for a long time. > The first attempt to implement time namespace was in 2006 by Jeff Dike. > From that time, the topic appears on and off in various discussions. > > There are two main use cases for time namespaces: > 1. change date and time inside a container; > 2. adjust clocks for a container restored from a checkpoint. > > “It seems like this might be one of the last major obstacles keeping > migration from being used in production systems, given that not all > containers and connections can be migrated as long as a time dependency > is capable of messing it up.” (by github.com/dav-ell) > > The kernel provides access to several clocks: CLOCK_REALTIME, > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > start points for them are not defined and are different for each running > system. When a container is migrated from one node to another, all > clocks have to be restored into consistent states; in other words, they > have to continue running from the same points where they have been > dumped. > > The main idea behind this patch set is adding per-namespace offsets for > system clocks. When a process in a non-root time namespace requests > time of a clock, a namespace offset is added to the current value of > this clock on a host and the sum is returned. > > All offsets are placed on a separate page, this allows up to map it as > part of vvar into user processes and use offsets from vdso calls. > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > clocks. > > Questions to discuss: > > * Clone flags exhaustion. Currently there is only one unused clone flag > bit left, and it may be worth to use it to extend arguments of the clone > system call. > > * Realtime clock implementation details: > Is having a simple offset enough? > What to do when date and time is changed on the host? > Is there a need to adjust vfs modification and creation times? > Implementation for adjtime() syscall. Overall I support this effort. In my quick skim this code looked good. My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second. What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer. I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core? What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly. I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file. I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks. Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed. As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares. Eric > Cc: Dmitry Safonov <0x7f454...@gmail.com> > Cc: Adrian Reber > Cc: Andrei Vagin > Cc: Andy Lutomirski > Cc: Christian Brauner > Cc: Cyrill Gorcunov > Cc: "Eric W. Biederman" > Cc: "H. Peter Anvin" > Cc: Ingo Molnar > Cc: Jeff Dike > Cc: Oleg Nesterov > Cc: Pavel Emelyanov > Cc: Shuah Khan > Cc: Thomas Gleixner > Cc: contain...@lists.linux-foundation.org > Cc: c...@openvz.org > Cc: linux-...@vger.kernel.org > Cc: x...@kernel.org > > Andrei Vagin (12): > ns: Introduce Time Namespace > timens: Add timens_offsets > timens: Introduce CLOCK_MONOTONIC offsets > timens: Introduce CLOCK_BOOTTIME offset > timerfd/timens: Take into account ns clock offsets > kernel: Take into account timens clock offsets in clock_nanosleep >
Re: [RFC 00/20] ns: Introduce Time Namespace
Dmitry Safonov writes: > Discussions around time virtualization are there for a long time. > The first attempt to implement time namespace was in 2006 by Jeff Dike. > From that time, the topic appears on and off in various discussions. > > There are two main use cases for time namespaces: > 1. change date and time inside a container; > 2. adjust clocks for a container restored from a checkpoint. > > “It seems like this might be one of the last major obstacles keeping > migration from being used in production systems, given that not all > containers and connections can be migrated as long as a time dependency > is capable of messing it up.” (by github.com/dav-ell) > > The kernel provides access to several clocks: CLOCK_REALTIME, > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > start points for them are not defined and are different for each running > system. When a container is migrated from one node to another, all > clocks have to be restored into consistent states; in other words, they > have to continue running from the same points where they have been > dumped. > > The main idea behind this patch set is adding per-namespace offsets for > system clocks. When a process in a non-root time namespace requests > time of a clock, a namespace offset is added to the current value of > this clock on a host and the sum is returned. > > All offsets are placed on a separate page, this allows up to map it as > part of vvar into user processes and use offsets from vdso calls. > > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > clocks. > > Questions to discuss: > > * Clone flags exhaustion. Currently there is only one unused clone flag > bit left, and it may be worth to use it to extend arguments of the clone > system call. > > * Realtime clock implementation details: > Is having a simple offset enough? > What to do when date and time is changed on the host? > Is there a need to adjust vfs modification and creation times? > Implementation for adjtime() syscall. Overall I support this effort. In my quick skim this code looked good. My feeling is that we need to be able to support running ntpd and support one namespace doing googles smoothing of leap seconds while another namespace takes the leap second. What I was imagining when I was last thinking about this was one instance of struct timekeeper aka tk_core per time namespace. That structure already keeps offsets for all of the various clocks from the kerne internal time sources. What would be needed would be to pass in an appropriate time namespace pointer. I could be completely wrong as I have not take the time to completely trace through the code. Have you looked at pushing the time namespace down as far as tk_core? What I think would be the big advantage (besides ntp working) is that the bulk of the code could be reused. Allowing testing of the kernel's time code by setting up a new time namespace. So a person in production could setup a time namespace with the time set ahead a little bit and be able to verify that the kernel handles the upcoming leap second properly. I don't know about the vfs. I think the danger is being able to write dates in the future or in the past. It appears that utimes(2) and utimesnat(2) already allow this except for status change. So it is possible we simply don't care. I seem to remember that what nfs does is take the time stamp from the host writing to the file. I think the guide for filesystem timestamps should be to first ensure we don't introduce security issues, and then do what distributed filesystems do when dealing with hosts with different clocks. Given those those two guidlines above I don't think there is a need to change timestamsp the way the user namespace changes uid when displayed. As for the hardware like the real time clock we definitely should not let a root in a time namespace change it. We might even be able to get away with leaving the real time clock out of the time namespace. If not we need to be very careful how the real time clock is abstracted. I would start by leaving the real time clock hardware out of the time namespace and see if there is any part of userspace that cares. Eric > Cc: Dmitry Safonov <0x7f454...@gmail.com> > Cc: Adrian Reber > Cc: Andrei Vagin > Cc: Andy Lutomirski > Cc: Christian Brauner > Cc: Cyrill Gorcunov > Cc: "Eric W. Biederman" > Cc: "H. Peter Anvin" > Cc: Ingo Molnar > Cc: Jeff Dike > Cc: Oleg Nesterov > Cc: Pavel Emelyanov > Cc: Shuah Khan > Cc: Thomas Gleixner > Cc: contain...@lists.linux-foundation.org > Cc: c...@openvz.org > Cc: linux-...@vger.kernel.org > Cc: x...@kernel.org > > Andrei Vagin (12): > ns: Introduce Time Namespace > timens: Add timens_offsets > timens: Introduce CLOCK_MONOTONIC offsets > timens: Introduce CLOCK_BOOTTIME offset > timerfd/timens: Take into account ns clock offsets > kernel: Take into account timens clock offsets in clock_nanosleep >