On Tue, Aug 05, 2025 at 03:54:04PM -0700, Greenwalt, Paul wrote:
> 
> 
> On 7/29/2025 4:53 AM, Maciej Fijalkowski wrote:
> > On Mon, Jul 21, 2025 at 10:48:58AM -0400, Paul Greenwalt wrote:
> >> E830 supports Earliest TxTime First (ETF) hardware offload, which is
> >> configured via the ETF Qdisc on a per-queue basis (see tc-etf(8)). ETF
> >> introduces a new Tx flow mechanism that utilizes a timestamp ring
> >> (tstamp_ring) alongside the standard Tx ring. This timestamp ring is
> >> used to indicate when hardware will transmit a packet. Tx Time is
> >> supported on the first 2048 Tx queues of the device, and the NVM image
> >> limits the maximum number of Tx queues to 2048 for the device.
> >>
> >> The allocation and initialization of the timestamp ring occur when the
> >> feature is enabled on a specific Tx queue via tc-etf. The requested Tx
> >> Time queue index cannot be greater than the number of Tx queues
> >> (vsi->num_txq).
> >>
> >> To support ETF, the following flags and bitmap are introduced:
> >>
> >>  - ICE_F_TXTIME: Device feature flag set for E830 NICs, indicating ETF
> >>    support.
> >>  - txtime_txqs: PF-level bitmap set when ETF is enabled and cleared
> >>    when disabled for a specific Tx queue. It is used by
> >>    ice_is_txtime_ena() to check if ETF is allocated and configured on
> >>    any Tx queue, which is checked during Tx ring allocation.
> >>  - ICE_TX_FLAGS_TXTIME: Per Tx ring flag set when ETF is allocated and
> >>    configured for a specific Tx queue. It determines ETF status during
> >>    packet transmission and is checked by ice_is_txtime_ena() to verify
> >>    if ETF is enabled on any Tx queue.
> >>
> >> Due to a hardware issue that can result in a malicious driver detection
> >> event, additional timestamp descriptors are required when wrapping
> >> around the timestamp ring. Up to 64 additional timestamp descriptors
> >> are reserved, reducing the available Tx descriptors.
> >>
> >> To accommodate this, ICE_MAX_NUM_DESC_BY_MAC is introduced, defining:
> >>
> >>  - E830: Maximum Tx descriptor count of 8096 (8K - 32 - 64 for timestamp
> >>    fetch descriptors).
> >>  - E810 and E82X: Maximum Tx descriptor count of 8160 (8K - 32) .
> >>
> >> Reviewed-by: Aleksandr Loktionov <[email protected]>
> >> Co-developed-by: Alice Michael <[email protected]>
> >> Signed-off-by: Alice Michael <[email protected]>
> >> Signed-off-by: Paul Greenwalt <[email protected]>

(...)

> >> +/**
> >> + * ice_cfg_txtime - configure Tx Time for the Tx ring
> >> + * @tx_ring: pointer to the Tx ring structure
> >> + *
> >> + * Return: 0 on success, negative value on failure.
> >> + */
> >> +static int ice_cfg_txtime(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  struct iidc_rdma_core_dev_info *cdev;
> >> +  struct ice_vsi *vsi = tx_ring->vsi;
> >> +  struct ice_pf *pf = vsi->back;
> >> +  u16 queue = tx_ring->q_index;
> >> +  int err, timeout = 50;
> >> +  bool locked = false;
> >> +  struct device *dev;
> >> +
> >> +  while (test_and_set_bit(ICE_CFG_BUSY, pf->state)) {
> >> +          timeout--;
> >> +          if (!timeout)
> >> +                  return -EBUSY;
> >> +          usleep_range(1000, 2000);
> >> +  }
> >> +
> >> +  dev = ice_pf_to_dev(pf);
> >> +  cdev = pf->cdev_info;
> >> +  if (cdev && cdev->adev) {
> >> +          mutex_lock(&pf->adev_mutex);
> >> +          device_lock(&cdev->adev->dev);
> >> +          locked = true;
> >> +          if (cdev->adev->dev.driver) {
> >> +                  dev_err(dev, "Cannot change TxTime when RDMA is 
> >> active\n");
> > 
> > huh...all the locks just the find out rdma presence. noob question but
> > couldn't we have this info stored on pf side? out of the scope for this
> > series of course.
> > 
> 
> After review the hardware specifications, this check should be removed.
> The requirement is that Tx Time cannot be enabled on RDMA queues, but
> RDMA queues can be enable as well as LAN Tx Time enable queues.
> 
> > one additional question, why this can't co-exist with rdma? plus can this
> > co-exist with AF_XDP?
> > 
> 
> I'm not aware of any AF_XDP co-existence requirement.

I had a brain fart and thought you might enable zero-copy on tx ring which
would imply you would never be able to execute the code you're adding on
tx hotpath...but then i remembered we allocate a dedicated tx ring for
AF_XDP. So disregard that comment.

> 
> >> +                  err = -EBUSY;
> >> +                  goto adev_unlock;
> >> +          }
> >> +  }
> >> +
> >> +  err = ice_qp_dis(vsi, queue);
> >> +  if (err) {
> >> +          dev_err(dev, "Failed to disable Tx queue %d for TxTime 
> >> configuration\n",
> >> +                  tx_ring->q_index);
> >> +          goto adev_unlock;
> >> +  }
> >> +
> >> +  err = ice_qp_ena(vsi, queue);
> >> +  if (err) {
> >> +          dev_err(dev, "Failed to enable Tx queue %d for TxTime 
> >> configuration\n",
> >> +                  queue);
> >> +          goto adev_unlock;
> >> +  }
> >> +
> >> +adev_unlock:
> >> +  if (locked) {
> >> +          device_unlock(&cdev->adev->dev);
> >> +          mutex_unlock(&pf->adev_mutex);
> >> +  }
> >> +  clear_bit(ICE_CFG_BUSY, vsi->back->state);
> >> +  return err;
> >> +}
> >> +
> >> +/**
> >> + * ice_offload_txtime - set earliest TxTime first
> >> + * @netdev: network interface device structure
> >> + * @qopt_off: etf queue option offload from the skb to set
> >> + *
> >> + * Return: 0 on success, negative value on failure.
> >> + */
> >> +static int ice_offload_txtime(struct net_device *netdev,
> >> +                        void *qopt_off)
> >> +{
> >> +  struct ice_netdev_priv *np = netdev_priv(netdev);
> >> +  struct ice_pf *pf = np->vsi->back;
> >> +  struct tc_etf_qopt_offload *qopt;
> >> +  struct ice_vsi *vsi = np->vsi;
> >> +  struct ice_tx_ring *tx_ring;
> >> +  int ret = 0;
> >> +
> >> +  if (!ice_is_feature_supported(pf, ICE_F_TXTIME))
> >> +          return -EOPNOTSUPP;
> >> +
> >> +  qopt = qopt_off;
> >> +  if (!qopt_off || qopt->queue < 0 || qopt->queue >= vsi->num_txq)
> >> +          return -EINVAL;
> >> +
> >> +  if (qopt->enable)
> >> +          set_bit(qopt->queue,  pf->txtime_txqs);
> >> +  else
> >> +          clear_bit(qopt->queue, pf->txtime_txqs);
> >> +
> >> +  if (netif_running(vsi->netdev)) {
> >> +          tx_ring = vsi->tx_rings[qopt->queue];
> >> +          ret = ice_cfg_txtime(tx_ring);
> >> +          if (ret)
> >> +                  goto err;
> >> +  }
> >> +
> >> +  netdev_info(netdev, "%s TxTime on queue: %i\n",
> >> +              str_enable_disable(qopt->enable), qopt->queue);
> >> +  return 0;
> >> +
> >> +err:
> >> +  netdev_err(netdev, "Failed to %s TxTime on queue: %i\n",
> >> +             str_enable_disable(qopt->enable), qopt->queue);
> >> +
> >> +  if (qopt->enable)
> >> +          clear_bit(qopt->queue,  pf->txtime_txqs);
> >> +  else
> >> +          set_bit(qopt->queue,  pf->txtime_txqs);
> > 
> > why would you want to set this bit if you have failed with configuration?
> > 
> 
> The intent of the error handling flow is to return the pf txtime_txqs to
> it's original state. However, since the hardware state becomes undefined
> on disable failure, it's safer to treat the queue as disabled rather
> than optimistically restoring it to enabled state.
> 
> I'll fix this.
> 
> >> +  return ret;
> >> +}
> >> +
> >>  static LIST_HEAD(ice_block_cb_list);
> >>  
> >>  static int
> >> @@ -9365,6 +9490,8 @@ ice_setup_tc(struct net_device *netdev, enum 
> >> tc_setup_type type,
> >>                    mutex_unlock(&pf->adev_mutex);
> >>            }
> >>            return err;
> >> +  case TC_SETUP_QDISC_ETF:
> >> +          return ice_offload_txtime(netdev, type_data);
> >>    default:
> >>            return -EOPNOTSUPP;
> >>    }
> >> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
> >> b/drivers/net/ethernet/intel/ice/ice_txrx.c
> >> index 29e0088ab6b2..d433233a5fa1 100644
> >> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> >> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> >> @@ -143,6 +143,55 @@ static struct netdev_queue *txring_txq(const struct 
> >> ice_tx_ring *ring)
> >>    return netdev_get_tx_queue(ring->netdev, ring->q_index);
> >>  }
> >>  
> >> +/**
> >> + * ice_clean_tstamp_ring - clean time stamp ring
> >> + * @tx_ring: Tx ring to clean the Time Stamp ring for
> >> + */
> >> +static void ice_clean_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  u32 size;
> >> +
> >> +  if (!tx_ring->tstamp_desc)
> >> +          return;
> >> +
> >> +  size = ALIGN(tx_ring->tstamp_count * sizeof(struct ice_ts_desc),
> >> +               PAGE_SIZE);
> >> +  memset(tx_ring->tstamp_desc, 0, size);
> >> +  tx_ring->tstamp_next_to_use = 0;
> >> +}
> >> +
> >> +/**
> >> + * ice_free_tstamp_ring - free time stamp resources per queue
> >> + * @tx_ring: Tx ring to free the Time Stamp ring for
> >> + */
> >> +static void ice_free_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  struct ice_tstamp_ring *tstamp_ring = tx_ring->tstamp_ring;
> >> +  u32 size;
> >> +
> >> +  if (!tx_ring->tstamp_desc)
> >> +          return;
> >> +
> >> +  ice_clean_tstamp_ring(tx_ring);
> >> +  size = ALIGN(tx_ring->tstamp_count * sizeof(struct ice_ts_desc),
> >> +               PAGE_SIZE);
> >> +  dmam_free_coherent(tx_ring->dev, size, tx_ring->tstamp_desc,
> >> +                     tstamp_ring->dma);
> >> +  tx_ring->tstamp_desc = NULL;
> >> +}
> >> +
> >> +/**
> >> + * ice_free_tx_tstamp_ring - free time stamp resources per Tx ring
> >> + * @tx_ring: Tx ring to free the Time Stamp ring for
> >> + */
> >> +static void ice_free_tx_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  ice_free_tstamp_ring(tx_ring);
> >> +  kfree_rcu(tx_ring->tstamp_ring, rcu);
> >> +  tx_ring->tstamp_ring = NULL;
> >> +  tx_ring->flags &= ~ICE_TX_FLAGS_TXTIME;
> >> +}
> >> +
> >>  /**
> >>   * ice_clean_tx_ring - Free any empty Tx buffers
> >>   * @tx_ring: ring to be cleaned
> >> @@ -181,6 +230,9 @@ void ice_clean_tx_ring(struct ice_tx_ring *tx_ring)
> >>  
> >>    /* cleanup Tx queue statistics */
> >>    netdev_tx_reset_queue(txring_txq(tx_ring));
> >> +
> >> +  if (ice_is_txtime_cfg(tx_ring))
> >> +          ice_free_tx_tstamp_ring(tx_ring);
> >>  }
> >>  
> >>  /**
> >> @@ -331,6 +383,85 @@ static bool ice_clean_tx_irq(struct ice_tx_ring 
> >> *tx_ring, int napi_budget)
> >>    return !!budget;
> >>  }
> >>  
> >> +/**
> >> + * ice_alloc_tstamp_ring - allocate the Time Stamp ring
> >> + * @tx_ring: Tx ring to allocate the Time Stamp ring for
> >> + *
> >> + * Return: 0 on success, negative on error
> >> + */
> >> +static int ice_alloc_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  struct ice_tstamp_ring *tstamp_ring;
> >> +
> >> +  /* allocate with kzalloc(), free with kfree_rcu() */
> >> +  tstamp_ring = kzalloc(sizeof(*tstamp_ring), GFP_KERNEL);
> >> +  if (!tstamp_ring)
> >> +          return -ENOMEM;
> >> +
> >> +  tstamp_ring->tx_ring = tx_ring;
> >> +  tx_ring->tstamp_ring = tstamp_ring;
> >> +  tx_ring->tstamp_desc = NULL;
> >> +  tx_ring->tstamp_count = ice_calc_ts_ring_count(tx_ring);
> >> +  tx_ring->flags |= ICE_TX_FLAGS_TXTIME;
> >> +  return 0;
> >> +}
> >> +
> >> +/**
> >> + * ice_setup_tstamp_ring - allocate the Time Stamp ring
> >> + * @tx_ring: Tx ring to set up the Time Stamp ring for
> >> + *
> >> + * Return: 0 on success, negative on error
> >> + */
> >> +static int ice_setup_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  struct ice_tstamp_ring *tstamp_ring = tx_ring->tstamp_ring;
> >> +  struct device *dev = tx_ring->dev;
> >> +  u32 size;
> >> +
> >> +  /* round up to nearest page */
> >> +  size = ALIGN(tx_ring->tstamp_count * sizeof(struct ice_ts_desc),
> >> +               PAGE_SIZE);
> >> +  tx_ring->tstamp_desc = dmam_alloc_coherent(dev, size, &tstamp_ring->dma,
> >> +                                             GFP_KERNEL);
> >> +  if (!tx_ring->tstamp_desc) {
> >> +          dev_err(dev, "Unable to allocate memory for Time stamp Ring, 
> >> size=%d\n",
> >> +                  size);
> >> +          return -ENOMEM;
> >> +  }
> >> +
> >> +  tx_ring->tstamp_next_to_use = 0;
> >> +  return 0;
> >> +}
> >> +
> >> +/**
> >> + * ice_alloc_setup_tstamp_ring - Allocate and setup the Time Stamp ring
> >> + * @tx_ring: Tx ring to allocate and setup the Time Stamp ring for
> >> + *
> >> + * Return: 0 on success, negative on error
> >> + */
> >> +int ice_alloc_setup_tstamp_ring(struct ice_tx_ring *tx_ring)
> >> +{
> >> +  struct device *dev = tx_ring->dev;
> >> +  int err;
> >> +
> >> +  err = ice_alloc_tstamp_ring(tx_ring);
> >> +  if (err) {
> >> +          dev_err(dev, "Unable to allocate Time stamp ring for Tx ring 
> >> %d\n",
> >> +                  tx_ring->q_index);
> >> +          return err;
> >> +  }
> >> +
> >> +  err = ice_setup_tstamp_ring(tx_ring);
> >> +  if (err) {
> >> +          dev_err(dev, "Unable to setup Time stamp ring for Tx ring %d\n",
> >> +                  tx_ring->q_index);
> >> +          ice_free_tstamp_ring(tx_ring);
> >> +          tx_ring->tstamp_ring = NULL;
> >> +          return err;
> >> +  }
> >> +  return 0;
> >> +}
> >> +
> >>  /**
> >>   * ice_setup_tx_ring - Allocate the Tx descriptors
> >>   * @tx_ring: the Tx ring to set up
> >> @@ -1835,10 +1966,47 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct 
> >> ice_tx_buf *first,
> >>    /* notify HW of packet */
> >>    kick = __netdev_tx_sent_queue(txring_txq(tx_ring), first->bytecount,
> >>                                  netdev_xmit_more());
> >> -  if (kick)
> >> -          /* notify HW of packet */
> >> -          writel(i, tx_ring->tail);
> >> -
> >> +  if (kick) {
> > 
> > flatten the code by reversing the condition being checked:
> >     if (!kick)
> >             return;
> >     blah blah
> > 
> 
> I will make that change.
> 
> >> +          if (ice_is_txtime_cfg(tx_ring)) {
> >> +                  u16 tstamp_count = tx_ring->tstamp_count;
> >> +                  u16 j = tx_ring->tstamp_next_to_use;
> > 
> > nit: at some point i think we were insisting to not having stack variables
> > anything less than u32
> > 
> 
> I will make that change.
> 
> >> +                  struct ice_ts_desc *ts_desc;
> >> +                  struct timespec64 ts;
> >> +                  u32 tstamp;
> >> +
> >> +                  ts = ktime_to_timespec64(first->skb->tstamp);
> >> +                  tstamp = ts.tv_nsec >> ICE_TXTIME_CTX_RESOLUTION_128NS;
> >> +
> >> +                  ts_desc = ICE_TS_DESC(tx_ring, j);
> >> +                  ts_desc->tx_desc_idx_tstamp =
> >> +                                  ice_build_tstamp_desc(i, tstamp);
> >> +
> >> +                  j++;
> >> +                  if (j == tstamp_count) {
> >> +                          int fetch = tstamp_count - tx_ring->count;
> > 
> > care about negatives here?
> > 
> 
> No, I will change this to u32.
> 
> >> +
> >> +                          j = 0;
> >> +
> >> +                          /* To prevent an MDD, when wrapping the tstamp
> >> +                           * ring create additional TS descriptors equal
> >> +                           * to the number of the fetch TS descriptors
> >> +                           * value. HW will merge the TS descriptors with
> >> +                           * the same timestamp value into a single
> >> +                           * descriptor.
> >> +                           */
> >> +                          for (; j < fetch; j++) {
> >> +                                  ts_desc = ICE_TS_DESC(tx_ring, j);
> >> +                                  ts_desc->tx_desc_idx_tstamp =
> >> +                                         ice_build_tstamp_desc(i, tstamp);
> >> +                          }
> >> +                  }
> >> +                  tx_ring->tstamp_next_to_use = j;
> >> +                  writel_relaxed(tx_ring->tstamp_next_to_use,
> >> +                                 tx_ring->tstamp_tail);
> >> +          } else {
> >> +                  writel_relaxed(i, tx_ring->tail);
> >> +          }
> >> +  }
> >>    return;
> >>  
> >>  dma_error:
> >> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h 
> >> b/drivers/net/ethernet/intel/ice/ice_txrx.h
> >> index fef750c5f288..93089f6147cd 100644
> >> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> >> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> >> @@ -310,6 +310,12 @@ enum ice_dynamic_itr {
> >>  #define ICE_TX_LEGACY     1
> >>  
> >>  /* descriptor ring, associated with a VSI */
> >> +struct ice_tstamp_ring {
> >> +  struct ice_tx_ring *tx_ring;    /* Backreference to associated Tx ring 
> >> */
> >> +  dma_addr_t dma;                 /* physical address of ring */
> >> +  struct rcu_head rcu;            /* to avoid race on free */
> >> +} ____cacheline_internodealigned_in_smp;
> >> +
> >>  struct ice_rx_ring {
> >>    /* CL1 - 1st cacheline starts here */
> >>    void *desc;                     /* Descriptor ring memory */
> >> @@ -388,11 +394,22 @@ struct ice_tx_ring {
> >>    struct xsk_buff_pool *xsk_pool;
> >>    u16 next_to_use;
> >>    u16 next_to_clean;
> >> +  u16 tstamp_next_to_use;         /* Time stamp ring next to use */
> >> +  u16 tstamp_count;               /* Time stamp ring descriptors count */
> >> +  u8 __iomem *tstamp_tail;        /* Time stamp ring tail pointer */
> >> +  void *tstamp_desc;              /* Time stamp descriptor ring memory */
> > 
> > couldn't these members be within ice_tstamp_ring? otherwise explain what
> > made you to put them here, please.
> > 
> 
> Yes, they can be members of ice_tstamp_ring. Since the variables are
> accessed in the fast path they were placed in the ice_tx_ring second
> cacheline instead of ice_tstamp_ring to reduce possible cacheline misses.
> 
> Summary of changes for v6:
> - Fix resource leak in error path
> - Update error handling logic for disable failures
> - Move ice_qp_* functions to preparatory patch
> - Address code style issues (u32 vs u16, code flattening)
> - Propagate error status from ice_set_txq_ctx_vmvf
> 
> Thanks,
> Paul

Reply via email to