Hi All, I have switched to Linux based email client (SquirrelMail) right now. If required I'll resend the emails I just sent now. Sorry about the repetition. We have some of our team members planning for travel. I was in a rush.
Regards, -Kanaka > On Mon, 2015-09-07 at 20:22 +0100, Juvva, Kanaka D wrote: >> Hi Thomas, >> >> I'm sending updated patch(s). I have given details for each of >> these items below. >> > > Kanaka, this email is HTML formatted and so has been blocked by > vger.kernel.org where the linux-kernel mailing list is hosted. > > Please configure outlook not to send html email, or use a different mail > agent for working with upstream. > >> Regards, >> -Kanaka >> >> > -----Original Message----- >> > From: Thomas Gleixner [mailto:t...@linutronix.de] >> > Sent: Wednesday, August 19, 2015 1:50 PM >> > To: Kanaka Juvva >> > Cc: Juvva, Kanaka D; Williamson, Glenn P; Fleming, Matt; Auld, Will; >> Andi Kleen; >> > LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x...@kernel.org; Ingo >> Molnar; H. >> > Peter Anvin; Shivappa, Vikas >> > Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth >> Monitoring >> > (MBM) PMU >> > >> > On Fri, 7 Aug 2015, Kanaka Juvva wrote: >> > > +#define MBM_CNTR_MAX 0xffffff >> > > +#define MBM_SOCKET_MAX 8 >> > > +#define MBM_TIME_DELTA_MAX 1000 >> > > +#define MBM_TIME_DELTA_MIN 100 >> > >> > What are these constants for and how are they determined? Pulled out >> of thin >> > air? >> > >> >> /* >> * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter >> * value >> */ >> #define MBM_CNTR_MAX 0xffffff >> /* >> * Max #sockets supported >> */ >> #define MBM_SOCKET_MAX 8 > > This seems like a constant we could get by without. Do we really need to > know this at compile time? > >> /* >> * Expected time interval between consecutive MSR reads for a given rmid >> */ >> #define MBM_TIME_DELTA_MAX 1000 > > "max" and "expected" are not the same thing. > > >> > > #define QOS_L3_OCCUP_EVENT_ID (1 << 0) >> > > +#define QOS_MBM_TOTAL_EVENT_ID (1 << 1) >> > > +#define QOS_MBM_LOCAL_EVENT_ID_HW 0x3 >> > > +#define QOS_MBM_LOCAL_EVENT_ID (1 << 2) >> > >> > So we have ID values which are built with (1 << X) and then this HW >> variant in the >> > middle with 0x3. Of course without any explanation what the heck this >> stuff is. >> > >> > Last review: >> > >> > "So this wants a descriptive ID name and a comment." >> > >> > >> >> /* >> * MBM Event IDs as defined in SDM section 17.14.6 >> * Event IDs used to program MSRs for reading counters >> */ >> #define QOS_MBM_TOTAL_EVENT_ID (1 << 1) >> #define QOS_MBM_LOCAL_EVENT_ID_HW 0x3 >> /* >> * Perf needs event id to be 1 << x, hence we can't use 0x3 (HW EVENT ID) >> * for MBM_LOCAL_EVENT we use next 1 << x for MBM_LOCAL_EVENT_ID >> */ >> #define QOS_MBM_LOCAL_EVENT_ID (1 << 2) > > No, perf events do not need to be of the form (1 << X), that was just a > convention we used in the cqm code before we knew what values the MBM > events would take - you can change these to be whatever format you want, > but be sure to make it consistent. > > The constants are very much supposed to be programmed into the MSRs, > take a look at __rmid_read(). > > I would suggest (as I already did privately) that you change the format > to be 0x0x for all of these event IDs. > > >> > > @@ static bool intel_cqm_sched_in_event(u32 rmid) >> > > return false; >> > > } >> > > >> > > + >> > > +static u32 bw_sum_calc(struct sample *bw_stat, int rmid) { >> > > + u32 val = 0, i, j, index; >> > > + >> > > + if (++bw_stat->fifoout >= mbm_window_size) >> > > + bw_stat->fifoout = 0; >> > > + index = bw_stat->fifoout; >> > > + for (i = 0; i < mbm_window_size - 1; i++) { >> > > + if (index + i >= mbm_window_size) >> > > + j = index + i - mbm_window_size; >> > > + else >> > > + j = index + i; >> > > + val += bw_stat->mbmfifo[j]; >> > > + } >> > >> > This math wants a explanatory comment. >> > >> /* >> * Slide the window by 1 and calculate the sum of the last >> * mbm_window_size-1 bandwidth values. >> * fifoout is the current position of the window. >> * Increment the fifoout by 1 to slide the window by 1. >> * >> * Calcalute the bandwidth from ++fifiout to ( ++fifoout + >> mbm_window_size -1) >> * e.g.fifoout =1; Bandwidth1 Bandwidth2 ..... Bandwidthn are the >> * sliding window values where n is size of the sliding window >> * bandwidth sum: val = Bandwidth2 + Bandwidth3 + .. Bandwidthn >> */ > > Instead of these large comment blocks please comment smaller, > logically-connected chunks of code, e.g. > > /* Slide the window by one */ > if (++bw_stat->fifoout >= mbm_window_size) > bw_stat->fifoout = 0; > > /* > * Calculate the sum of last mbm_window_size-1 values. > */ > for (i = 0; i < mbm_window_size - 1; i++) { > /* Handle wraparound at end of window */ > if (index + i >= mbm_window_size) > j = index + i - mbm_window_size; > else > j = index + i; > > val += bw_stat->mbminfo[j]; > } > >> >> >> > > + return val; >> > > +} >> > > + >> > > +static u32 __mbm_fifo_sum_lastn_out(int rmid, bool is_localbw) { >> > > + if (is_localbw) >> > > + return bw_sum_calc(&mbm_local[rmid], rmid); >> > > + else >> > > + return bw_sum_calc(&mbm_total[rmid], rmid); } >> > > + >> > > +static void __mbm_fifo_in(struct sample *bw_stat, u32 val) { >> > > + bw_stat->mbmfifo[bw_stat->fifoin] = val; >> > > + if (++bw_stat->fifoin >= mbm_window_size) >> > >> > How does that become greater than mbm_windowsize? >> > >> >> This is fixed by changing >= to == >> Added a comment: >> >> /* >> * store current sample's bw value in sliding window at the >> * index fifoin. Increment fifoin. Check if fifoin has reached >> * max_window_size. If yes reset it to begining i.e. zero >> * e.g. >> * mbm_window_size = 10 >> * mbmfifo is a circular fifo 0 1 2 3 4 5 6 7 8 9 10 >> * ^ >> | >> * | >> | >> * | _ _ _ _ _ _ _ _ _ _| >> * >> * So when fifoin becomes 10, then it is reset to zero >> * >> */ > > I'm not sure that this comment adds anything of value that isn't already > understood by reading the code. I don't think you need a comment for > this function, it seems pretty straight forward and Thomas' question was > about the boundary limits of ->fifoin. > >> > > + bw_stat->fifoin = 0; >> > > +} >> > >> > > +/* >> > > + * __rmid_read_mbm checks whether it is LOCAL or GLOBAL MBM event >> and >> > > +reads >> > > + * its MSR counter. Check whether overflow occurred and handles it. >> > > +Calculates >> > > + * currenet BW and updates running average. >> > >> > currenet? And please get rid of the double spaces >> > >> This is fixed now. Here is the updated comment: >> >> /* >> * rmid_read_mbm checks whether it is LOCAL or Total MBM event and reads >> * its MSR counter. Check whether overflow occured and handles it. >> Calculates >> * currenet BW and updates running average. >> * > > ^^^ You've still misspelled current. >> >> * Overflow Handling: >> * if (MSR current value < MSR previous value) it is an >> * overflow. MSR values are increasing when bandwidth consumption for the >> thread >> * is non-zero; When MSR values reaches MAX_COUNTER_VALUE it overflows. >> After overflow, >> * MSR current value goes back to zero and starts increasing again at the >> rate of >> * bandwidth. >> * > > You don't need to provide a definition of "overflow", most people will > be familiar with it. What is more important to document is how the > overflow is handled... >> >> * Overflow handling: >> * Detect an overflow : current read value > last read value > > Isn't this inverted? Overflow occurred if current < previous. > >> >> * Overflow correction: if (overflow) >> * Current value = (MAX_COUNTER_VALUE - prev >> read value) + current read value >> * else >> * Current value = current read value >> * > > Please don't write pseudocode in the comments. Use English prose to > describe the important parts of the code. > >> >> * Calculation of Current Bandwidth value: >> * If MSR is read within last 100ms, then then the smaple is ignored; >> * If the MSR was Read with in last 100ms, why incur an extra overhead >> * of doing the MSR reads again. Anyway there'll be a negligible change >> or zero >> * change in MSR readings in 100ms. >> * >> * Bandwidth is calculated as: >> * memory bandwidth = difference of last two msr counter values/time >> difference. >> * >> * cum_avg = Running Average bandwidth of last 'n' bandwidth values for >> * the samples that are processed >> * > > Where 'n' is 'mbm_window_size' ? If so, please use 'mbm_window_size', > not 'n'. >> >> * Sliding window is used to save the last 'n' samples. Where, >> * n = sliding_window_size and results in sliding window duration of 'n' >> secs. > > Hmm... this confuses me a lot. Is 'n' a size or a duration? The two are > not the same thing. > >> >> * The sliding window size by default set to >> * MBM_FIFO_SIZE_MIN. User can configure it to the values in the range >> * (MBM_FIFO_SIZE_MIN,MBM_FIFO_SIZE_MAX). The range for sliding window >> * is chosen based on a general criteria for monitoring duration. Example >> * for a short lived application, 10sec monitoring period gives >> * good characterization of its bandwidth consumption. For an application >> * that runs for longer duration, 300sec monitoring period gives better >> * characterization of its bandwidth consumption. Since the running >> average >> * calculated for total monitoring period, user gets the most accuracate >> * average bandwidth for the each monitoring period. >> * >> * Scaling: >> * cum_avg is the raw bandwidth is Bytes/sec. >> * cum_avg is converted to MB/sec by applying MBM_CONVERSION_FACTOR and >> * rounded to nearest integer. User interface gets the Bandwidth values >> in MB/sec. >> * >> */ >> > > + * >> > > + * Overflow Handling: >> > > + * if (MSR current value < MSR previous value) it is an >> > > + * overflow. and overflow is handled. >> > >> > Wow. That's informative as hell! >> > >> Please look at the modified comment above >> > > + * >> > > + * Calculation of Current BW value: >> > >> > BW == Body Weight? >> > >> >> It is fixed now >> >> > > + * If MSR is read within last 100ms, then the value is ignored; >> > > + * this will suppress small deltas. We don't process MBM samples >> that >> > > + are >> > > + * within 100ms. >> > >> > WHY? >> > >> Explained in the comment. If mbm_read is called within in 100ms for the >> same rmid, we donât >> have to process the sample. > > The key piece of information you're missing here is that skipping these > small deltas is an optimization, because we avoid performing costly > operations for what would likely be a very minor change in the MBM data, > right? > > >> > > +{ >> > > + u64 val, tmp, diff_time, cma, bytes, index; >> > > + bool overflow = false, first = false; >> > > + ktime_t cur_time; >> > > + u32 tmp32 = rmid; >> > > + struct sample *mbm_current; >> > > + u32 vrmid = topology_physical_package_id(smp_processor_id()) * >> > > + cqm_max_rmid + rmid; >> > > + >> > > + rmid = vrmid; >> > >> > From my previous review: >> > >> > "This is completely backwards. >> > >> > tmp32 = rmid; >> > rmid = vrmid; >> > do_stuff(rmid); >> > rmid = tmp32; >> > do_other_stuff(rmid); >> > >> > Why can't you use vrmid for do_stuff() and leave rmid alone? Just >> > because it would make the code simpler to read?" >> > >> > Still applies. >> > >> >> This is now changed to >> u64 val, currentmsr, currentbw, diff_time, cma, bytes, index; >> bool overflow = false, first = false; >> ktime_t cur_time; >> u32 tmp32 = rmid, eventid; >> struct sample *mbm_current; >> u32 vrmid = rmid_2_index(rmid); >> >> rmid = vrmid; >> cur_time = ktime_get(); >> if (read_mbm_local) { >> mbm_current = &mbm_local[vrmid]; >> eventid = QOS_MBM_LOCAL_EVENT_ID_HW; >> wrmsr(MSR_IA32_QM_EVTSEL, QOS_MBM_LOCAL_EVENT_ID_HW, >> rmid); > > You don't need to perform this wrmsr() here because it's taken care of > in the common code below. > >> } else { >> mbm_current = &mbm_total[vrmid]; >> eventid = QOS_MBM_TOTAL_EVENT_ID; >> } >> rmid = tmp32; > > Why did you assign rmid to vrmid if you reassign it before it was used? > > >> > > + /* if current msr value < previous msr value , it means >> overflow */ >> > > + if (val < bytes) { >> > > + val = MBM_CNTR_MAX - bytes + val; >> > > + overflow = true; >> > > + } else >> > > + val = val - bytes; >> > > + >> > > + val = (val * MBM_TIME_DELTA_MAX) / diff_time; >> > > + >> > > + if ((diff_time > MBM_TIME_DELTA_MAX) && (!cma)) >> > > + /* First sample */ >> > > + first = true; >> > > + >> > > + rmid = vrmid; >> > >> > And another time: >> > >> > "More obfuscation" >> > >> >> /* >> * MBM_TIME_DELTA_MAX is picked as per MBM specs. As specified >> in Intel Platform >> * Quality of Service Monitoring Implementer's Guide V1, Section >> 2.7.2. page 21, >> * overflow can occur maximum once in a second. So latest we >> want to read the MSR >> * counters is 1000ms. If it is less than 1000ms we can ignore >> the sample. Then we >> * decide since when we should ignore. If the MSR was Read with >> in last 100ms, why >> * process the MSR reads again. Anyway there'll be small change >> or zero change. >> * So ignoring MSR Reads within 100ms or less is efficient. >> MBM_TIME_DELTA_MIN >> * is specified as 100ms as per this guideline. >> * >> */ > > I suspect the document you're referring to above is only available under > NDA, which makes it unsuitable for mention in the kernel source since a > large number of people won't have access to it. > > Just explain that the way the hardware is designed puts an upper limit > on how quickly the counter can overflow, which is once per second. > > >> > > +static void __intel_cqm_event_total_bw_count(void *info) { >> > > + struct rmid_read *rr = info; >> > > + u64 val; >> > > + >> > > + val = __rmid_read_mbm(rr->rmid, false); >> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) >> > > + return; >> > > + atomic64_add(val, &rr->value); >> > > +} >> > > + >> > > +static void __intel_cqm_event_local_bw_count(void *info) { >> > > + struct rmid_read *rr = info; >> > > + u64 val; >> > > + >> > > + val = __rmid_read_mbm(rr->rmid, true); >> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) >> > > + return; >> > > + atomic64_add(val, &rr->value); >> > > +} >> > >> > And once more: >> > >> > "You're really a fan of copy and paste." >> > >> >> These functions are invoked indirectly. They were written keeping >> intel_cqm_event_count in mind. >> Iâll change the arg to struct mbm_read{ >> struct rmid_read >> *rr; >> u32 eventid; >> }; >> Intel_cqm_event_*_bw_count(â¦.) needs eventid to call for decoding > > No, please do not duplicate the rmid_read structure, that is not an > improvement, we don't need two different structs for reading the read > data. > > Please add the event field to the existing struct rmid_read. > > >> > > @@ -1023,6 +1437,17 @@ static void intel_cqm_event_stop(struct >> > perf_event *event, int mode) >> > > } else { >> > > WARN_ON_ONCE(!state->rmid); >> > > } >> > > + >> > > + if (pmu) { >> > > + if (pmu->n_active > 0) { >> > >> > What's the purpose of this check? In the previous version there was a >> > WARN_ON(), which made sense. Did it trigger and you decided to "work" >> > around it? >> > >> >> We actually meant to check if there are active events > > I don't follow this answer. Are you saying that the WARN_ON() doesn't > make sense here? > > > >> >> > > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit, >> > > +"KB/sec"); EVENT_ATTR_STR(llc_local_bw.unit, >> > > +intel_cqm_llc_local_bw_unit, "KB/sec"); #endif >> > >> > > +static ssize_t >> > > +sliding_window_size_store(struct device *dev, >> > > + struct device_attribute *attr, >> > > + const char *buf, size_t count) >> > > +{ >> > > + unsigned int bytes; >> > > + int ret; >> > > + >> > > + ret = kstrtouint(buf, 0, &bytes); >> > > + if (ret) >> > > + return ret; >> > > + >> > > + mutex_lock(&cache_mutex); >> > > + if (bytes > 0 && bytes <= MBM_FIFO_SIZE_MAX) >> > > + mbm_window_size = bytes; >> > >> > So, it's valid to set the window to X where 0 < X < MBM_FIFO_SIZE_MIN. >> > What's the actual purpose of MBM_FIFO_SIZE_MIN? >> > >> This is changed to >> if (bytes >= MBM_FIFO_SIZE_MIN && bytes <= MBM_FIFO_SIZE_MAX) >> mbm_window_size = bytes; > > Note that if the user passes a value outside of this range you should be > returning -EINVAL to indicate that. > > >> > > + pmu->timer_interval = ms_to_ktime(MBM_TIME_DELTA_MAX); >> > > + per_cpu(mbm_pmu, cpu) = pmu; >> > > + per_cpu(mbm_pmu_to_free, cpu) = NULL; >> > >> > What's the point of this? If there is still something to be free'd its >> leaked. >> > Otherwise that's redundant. >> per_cpu(mbm_pmu_to_free, cpu) = NULL; is removed >> >> > > + mbm_hrtimer_init(pmu); >> > > + } >> > > + return 0; >> > >> > s/0/NOTIFY_OK/ because you return that value directly. >> > >> You mean I return the âreturn codeâ > > ? > > You should be using NOTIFY_OK here so that you follow the notifier API > convention. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/