Re: [PATCH 1/3] Add TX sending hardware timestamp.

Willem de Bruijn Thu, 10 Dec 2020 15:35:21 -0800

> > If I understand correctly, you are trying to achieve a single delivery time.
> > The need for two separate timestamps passed along is only because the
> > kernel is unable to do the time base conversion.
>
> Yes, a correct point.
>
> >
> > Else, ETF could program the qdisc watchdog in system time and later,
> > on dequeue, convert skb->tstamp to the h/w time base before
> > passing it to the device.
>
> Or the skb->tstamp is HW time-stamp and the ETF convert it to system clock 
> based.
>
> >
> > It's still not entirely clear to me why the packet has to be held by
> > ETF initially first, if it is held until delivery time by hardware
> > later. But more on that below.
>
> Let plot a simple scenario.
> App A send a packet with time-stamp 100.
> After arrive a second packet from App B with time-stamp 90.
> Without ETF, the second packet will have to wait till the interface hardware 
> send the first packet on 100.
> Making the second packet late by 10 + first packet send time.
> Obviously other "normal" packets are send to the non-ETF queue, though they 
> do not block ETF packets
> The ETF delta is a barrier that the application have to send the packet 
> before to ensure the packet do not tossed.


Got it. The assumption here is that devices are FIFO. That is not
necessarily the case, but I do not know whether it is in practice,
e.g., on the i210.

>
> >
> > So far, the use case sounds a bit narrow and the use of two timestamp
> > fields for a single delivery event a bit of a hack.
>
> The definition of a hack is up to you

Fair enough :) That wasn't very constructive feedback on my part.

> > And one that does impose a cost in the hot path of many workloads
> > by adding a field the ip cookie, cork and writing to (possibly cold)
> > skb_shinfo for every packet.
>
> Most packets do not use skb->tstamp either, probably the cost of testing is 
> higher then just copying.
> But perhaps if we copy 2 time-stamp we can add a condition for both.
> What do you think?

I'd need to take a closer look at the skb_hwtstamps, which unlike
skb->tstamp lie in the skb_shared_data. If that is an otherwise cold
cacheline, then access would be expensive.

The ipcm and cork are admittedly cheap and not worth a branch. But
still it is good to understand that this situation of unsynchronized
clocks is a common operation condition for the foreseeable future, not
an unfortunate constraint of a single piece of hardware.

An extreme option would be moving everything behind a static_branch as
most hot paths will not have the feature enabled. But I'm not
seriously suggesting that for a few assignments.

> The cookie and the cork are just intermediate from application to SKB, I do 
> not think they cost much.
> Both writes of time stamp to the cookie and the cork are conditioned.
>
> >
> >>>>> Indeed, we want pacing offload to work for existing applications.
> >>>>>
> >>>> As the conversion of the PHC and the system clock is dynamic over time.
> >>>> How do you propse to achive it?
> >>>
> >>> Can you elaborate on this concern?
> >>
> >> Using single time stamp have 3 possible solutions:
> >>
> >> 1. Current solution, synchronize the system clock and the PHC.
> >>      Application uses the system clock.
> >>      The ETF can use the system clock for ordering and pass the packet to 
> >> the driver on time
> >>      The network interface hardware compare the time-stamp to the PHC.
> >>
> >> 2. The application convert the PHC time-stamp to system clock based.
> >>       The ETF works as solution 1
> >>       The network driver convert the system clock time-stamp back to PHC 
> >> time-stamp.
> >>       This solution need a new Net-Link flag and modify the relevant 
> >> network drivers.
> >>       Yet this solution have 2 problems:
> >>       * As applications today are not aware that system clock and PHC are 
> >> not synchronized and
> >>          therefore do not perform any conversion, most of them only use 
> >> the system clock.
> >>       * As the conversion in the network driver happens ~300 - 600 
> >> microseconds after
> >>          the application send the packet.
> >>          And as the PHC and system clock frequencies and offset can change 
> >> during this period.
> >>          The conversion will produce a different PHC time-stamp from the 
> >> application original time-stamp.
> >>          We require a precession of 1 nanoseconds of the PHC time-stamp.
> >>
> >> 3. The application uses PHC time-stamp for skb->tstamp
> >>      The ETF convert the  PHC time-stamp to system clock time-stamp.
> >>      This solution require implementations on supporting reading PHC clocks
> >>      from IRQ/kernel thread context in kernel space.
> >
> > ETF has to release the packet well in advance of the hardware
> > timestamp for the packet to arrive at the device on time. In practice
> > I would expect this delta parameter to be at least at usec timescale.
> > That gives some wiggle room with regard to s/w tstamp, at least.
>
> Yes, the author of the ETF uses a delta of 300 usec.
> The interface I use for testing, Intel I210 need around 100 usec to 150 usec.
> I believe it is related to PCIe speed of transferring the data on time and 
> probably similar to other interfaces using PCIe.
> If you overflow the interface hardware with high traffic the delta is much 
> higher.
> The clocks conversion error of the application is characteristic around ~1 
> usec to 5 usec for up to 10 ms sending a head.
>
> >
> > If changes in clock distance are relatively infrequent, could this
> > clock diff be a qdisc parameter, updated infrequently outside the
> > packet path?
>
> As the clocks are updating of both frequency and offset dynamically make it 
> very hard to perform.
> The rate of the update of the PHC depends on PTP setting (usually around 1 
> second).
> The rate of the update of the system clock depends how you synchronize it ( I 
> assume it is similar or slower).
> But user may require and use higher rates. It is only penalty by more traffic 
> and CPU.
> Bare in mind the 2 clocks synchronization are independent, the cross can be 
> unpredictable.
>
> The ETF would have to "know" on which packets we use the previous update and 
> which are the last update.
> And hope we do not "miss" updates.
>
> And we would need a "service" to update these values with proper 
> configuration, sound like overhead to me.

Ack. Thanks for the operating context. I didn't know these constraints
well enough. Agreed that this is not a very feasible approach then.

> Another point.
> The delta includes the PCIe DMA transfer speed, this is a hardware limitation.
> The idea of TSN is that the application send the packet as closer as possible 
> to the hardware send.
> Increase the error of the clocks conversion defy the purpose of TSN.
>
> A more reasonable is to track the clocks inside the kernel.
> As we mention on solution 3.
>
> >
> > It would even be preferable if the qdisc and core stack could be
> > ignorant of such hardware clocks and the time base is converted by the
> > device driver when encoding skb->tstamp into the tx descriptor. Is the
> > device hardware clock readable by the driver?
>
> All drivers that support PTP (IEEE 1558) have to read the PHC.
> PTP is mandatory for TSN.
> But some drivers might be limited on which context they can read the PHC.
> This is a question to the vendors.
> For example Intel I210 allow reading the PHC.
>
> However the kernel POSIX layer uses application context lockings.
>
> >
> >  From the above, it sounds like this is not trivial.
>
> I am not sure if it so complicated.
> But as the Linux maintainers want to keep the Linux kernel with a single 
> system clock.
> It might be more of a political question, or perhaps a better planning then I 
> did.

This would seem the preferable option to me: use a kernel time base
throughout the stack and limit knowledge of the hardware clock to the
relevant hardware driver.

If that is infeasible, then I don't immediately see an alternative to
the current dual timestamp field variant, either.

> >
> > I don't know which exact device you're targeting. Is it an in-tree driver?
>
> ETF uses ethernet interfaces with IEEE 1558 and 802.1Qbv or 802.1Qbu.
> Interfaces that support TSN, 
> https://en.wikipedia.org/wiki/Time-Sensitive_Networking
> I use Intel I210 at the moment.
> As of 5.10-rc6, there are 4 drivers
> kernel-etf/drivers/net/ethernet (etf-5.10-rc6)$ find -name '*.c' | xargs grep 
> -r TC_SETUP_QDISC_ETF
> ./freescale/enetc/enetc.c:              case TC_SETUP_QDISC_ETF:
> ./stmicro/stmmac/stmmac_main.c: case TC_SETUP_QDISC_ETF:
> ./intel/igc/igc_main.c:                 case TC_SETUP_QDISC_ETF:
> ./intel/igb/igb_main.c:                 case TC_SETUP_QDISC_ETF:
> There are more vendors like
>   Renesas that have a driver for the RZ-G SOC.
>   Broadcom have chips that support TSN, but they do not publish the code.
> I believe that other vendors will add TSN support as it becomes more popular.

Very clear. Thanks.

> >
> >> Just for clarification:
> >> ETF as all Net-Link, only uses system clock (the TAI)
> >> The network interface hardware only uses the PHC.
> >> Nor Net-Link neither the driver perform any conversions.
> >> The Kernel does not provide and clock conversion beside system clock.
> >> Linux kernel is a single clock system.
> >>
> >>>
> >>> The simplest solution for offloading pacing would be to interpret
> >>> skb->tstamp either for software pacing, or skip software pacing if the
> >>> device advertises a NETIF_F hardware pacing feature.
> >>
> >> That will defy the purpose of ETF.
> >> ETF exist for ordering packets.
> >> Why should the device driver defer it?
> >> Simply do not use the QDISC for this interface.
> >
> > ETF queues packets until their delivery time is reached. It does not
> > order for any other reason than to calculate the next qdisc watchdog
> > event, really.
>
> No, ETF also order the packets on .enqueue = etf_enqueue_timesortedlist()
> Our patch suggest to order them by hardware time stamp.
> And leave the watchdog setting using skb->tstamp that hold system clock TAI 
> time-stamp.
>
> >
> > If h/w can do the same and the driver can convert skb->tstamp to the
> > right timebase, indeed no qdisc is needed for pacing. But there may be
> > a need for selective h/w offload if h/w has additional constraints,
> > such as on the number of packets outstanding or time horizon.
>
> The driver do not order the packets, it send packets in the order of arrival.
> We can add ETF component to each relevant driver, but do we want to add 
> Net-Link features to drivers?
> I think the purpose is to make the drivers as small as possible and leave 
> common intelligence in the Net-Link layer.

I was thinking of devices that implement ETF in hardware for full
pacing hardware offload. Not in the driver.

> >
> >>>
> >>> Clockbase is an issue. The device driver may have to convert to
> >>> whatever format the device expects when copying skb->tstamp in the
> >>> device tx descriptor.
> >>
> >> We do hope our definition is clear.
> >> In the current kernel skb->tstamp uses system clock.
> >> The hardware time-stamp is PHC based, as it is used today for PTP two 
> >> steps.
> >> We only propose to use the same hardware time-stamp.
> >>
> >> Passing the hardware time-stamp to the skb->tstamp might seems a bit tricky
> >> The gaol is the leave the driver unaware to whether we
> >> * Synchronizing the PHC and system clock
> >> * The ETF pass the hardware time-stamp to skb->tstamp
> >> Only the applications and the ETF are aware.
> >> The application can detect by checking the ETF flag.
> >> The ETF flags are part of the network administration.
> >> That also configure the PTP and the system clock synchronization.
> >>
> >>>
> >>>>
> >>>>> It only requires that pacing qdiscs, both sch_etf and sch_fq,
> >>>>> optionally skip queuing in their .enqueue callback and instead allow
> >>>>> the skb to pass to the device driver as is, with skb->tstamp set. Only
> >>>>> to devices that advertise support for h/w pacing offload.
> >>>>>
> >>>> I did not use "Fair Queue traffic policing".
> >>>> As for ETF, it is all about ordering packets from different applications.
> >>>> How can we achive it with skiping queuing?
> >>>> Could you elaborate on this point?
> >>>
> >>> The qdisc can only defer pacing to hardware if hardware can ensure the
> >>> same invariants on ordering, of course.
> >>
> >> Yes, this is why we suggest ETF order packets using the hardware 
> >> time-stamp.
> >> And pass the packet based on system time.
> >> So ETF query the system clock only and not the PHC.
> >
> > On which note: with this patch set all applications have to agree to
> > use h/w time base in etf_enqueue_timesortedlist. In practice that
> > makes this h/w mode a qdisc used by a single process?
>
> A single process theoretically does not need ETF, just set the skb-> tstamp 
> and use a pass through queue.
> However the only way now to set TC_SETUP_QDISC_ETF in the driver is using ETF.

Yes, and I'd like to eventually get rid of this constraint.


> The ETF QDISC is per network interface.
> So all application that uses a single network interface have to comply to the 
> QDISC configuration.
> Sound like any other new feature in the NetLink.
>
> Theoretically a single network interface could participate in 2 TSN/PTP 
> domains.
> In that case you can create one QDISC without "use hardware time-stamp" for 
> first TSN/PTP domain and synchronize the PHC to system clock.
> And use the second one with a QDISC that use hardware time-stamp.
> You will need a driver/hardware that support multiple PHCs.
> The separation of the domains could be using VLANs.
>
> Note: A TSN domain is bound to a PTP domain.
>
> >
> >>>
> >>> Btw: this is quite a long list of CC:s
> >>>
> >> I need to update my company colleagues as well as the Linux group.
> >
> > Of course. But even ignoring that this is still quite a large list (> 40).
> >
> > My response yesterday was actually blocked as a result ;) Retrying now.
> >
>
> I left 5 people from Siemens, I hope it improves.
>
>
> Thanks for your comments and enlightenments, they are very useful
>    Erez

Re: [PATCH 1/3] Add TX sending hardware timestamp.

Reply via email to