from:"Ganapatrao Kulkarni"

Re: [PATCH v6 2/2] drivers/perf: Add CCPI2 PMU support in ThunderX2 UNCORE driver.

2019-10-18 Thread Ganapatrao Kulkarni

On Fri, Oct 18, 2019 at 2:08 PM John Garry  wrote:
>
> On 18/10/2019 05:21, Ganapatrao Kulkarni wrote:
> > Hi Will,
> >
> > On Thu, Oct 17, 2019 at 9:17 PM Will Deacon  wrote:
> >>
> >> On Thu, Oct 17, 2019 at 12:38:51PM +0530, Ganapatrao Kulkarni wrote:
> >>> On Wed, Oct 16, 2019 at 7:01 PM John Garry  wrote:
> >>>>> +TX2_EVENT_ATTR(req_pktsent, CCPI2_EVENT_REQ_PKT_SENT);
> >>>>> +TX2_EVENT_ATTR(snoop_pktsent, CCPI2_EVENT_SNOOP_PKT_SENT);
> >>>>> +TX2_EVENT_ATTR(data_pktsent, CCPI2_EVENT_DATA_PKT_SENT);
> >>>>> +TX2_EVENT_ATTR(gic_pktsent, CCPI2_EVENT_GIC_PKT_SENT);
> >>>>> +
> >>>>> +static struct attribute *ccpi2_pmu_events_attrs[] = {
> >>>>> + _pmu_event_attr_req_pktsent.attr.attr,
> >>>>> + _pmu_event_attr_snoop_pktsent.attr.attr,
> >>>>> + _pmu_event_attr_data_pktsent.attr.attr,
> >>>>> + _pmu_event_attr_gic_pktsent.attr.attr,
> >>>>> + NULL,
> >>>>> +};
> >>>>
> >>>> Hi Ganapatrao,
> >>>>
> >>>> Have you considered adding these as uncore pmu-events in the perf tool?
> >>>>
> >>> At the moment no, since the number of events exposed/listed are very few.
> >>
> >> Then sounds like a perfect time to nip it in the bud before the list grows
> >> ;)
> >
> > I had internal discussion with architecture team, they have confirmed
> > that, these are the only published events and no plan to add new.
> > However, If any such request comes from HW team in future, i will add
> > them to JSON files.
>
> Don't you find perf list is swamped with all the uncore events?
>
> For Huawei platform, I find this:
> ./perf list pmu | grep "Kernel PMU event" | grep hisi | wc -l
> 648
>

We don't have such issue at the moment. As i said earlier, the events
exposed are limited.
Total 16 events altogether(DMC, L3C and CCPI2) per socket.

root@SBR-26>~>> perf list | grep uncore | wc -l
32

> That's because we have so many instances of the same PMUs, not because
> there are many events per PMU.
>
> TBH, I would like to delete all the events from the hisi uncore kernel
> drivers, now that they're supported in the perf tool, but I think that
> would constitute an ABI breakage.
>
> Maybe there is a way to hide them, but I couldn't find it.
>
> John
>
> >
> > I have incorporate all your previous comments, Can you please Ack and
> > queue it to 5.5?
> >
> >>
> >> If you can manage with these things in userspace, then I agree with John
> >> that it would be preferential to do it that way. It also offers more
> >> flexibility if we get the metricgroup stuff working properly (I think it's
> >> buggered for big/little atm).
> >>
> >> Will
> >
> > Thanks,
> > Ganapat
> >
> > .
> >
>
>

Thanks,
Ganapat

Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

2019-10-14 Thread Ganapatrao Kulkarni

Hi Peter,

On Wed, Sep 18, 2019 at 12:51 PM Ganapatrao Kulkarni  wrote:
>
> On Fri, Aug 23, 2019 at 6:33 PM Peter Zijlstra  wrote:
> >
> > On Fri, Aug 23, 2019 at 06:26:34PM +0530, Ganapatrao Kulkarni wrote:
> > > On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra  
> > > wrote:
> > > > On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
> > > >
> > > > > We are seeing regression with our uncore perf driver(Marvell's
> > > > > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > > > > After bisecting, it turned out to be this patch causing the issue.
> > > >
> > > > Funnily enough; the email you replied to didn't contain a patch.
> > >
> > > Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
> >
> > Your email is in a random spot of the discussion for me. At least it was
> > fairly easy to find the related patch.
> >
> > > > > Test case:
> > > > > Load module and run perf for more than 4 events( we have 4 counters,
> > > > > event multiplexing takes place for more than 4 events), then unload
> > > > > module.
> > > > > With this sequence of testing, the system hangs(soft lockup) after 2
> > > > > or 3 iterations. Same test runs for hours on 5.2.
> > > > >
> > > > > while [ 1 ]
> > > > > do
> > > > > rmmod thunderx2_pmu
> > > > > modprobe thunderx2_pmu
> > > > > perf stat -a -e \
> > > > > uncore_dmc_0/cnt_cycles/,\
> > > > > uncore_dmc_0/data_transfers/,\
> > > > > uncore_dmc_0/read_txns/,\
> > > > > uncore_dmc_0/config=0xE/,\
> > > > > uncore_dmc_0/write_txns/ sleep 1
> > > > > sleep 2
> > > > > done
> > > >
> > > > Can you reproduce without the module load+unload? I don't think people
> > > > routinely unload modules.
> > >
> > > The issue wont happen, if module is not unloaded/reloaded.
> > > IMHO, this could be potential bug!
> >
> > Does the softlockup give a useful stacktrace? I don't have a thunderx2
> > so I cannot reproduce.
>
> Sorry for the late reply, below is the dump that i am getting, when i
> hit the softlockup.
> Any suggestions to debug this further?
>
> sequence of commands, which leads to this lockup,
> insmod thunderx2_pmu.ko
> perf stat -e \
> uncore_dmc_0/cnt_cycles/,\
> uncore_dmc_0/data_transfers/,\
> uncore_dmc_0/read_txns/,\
> uncore_dmc_0/config=0xE/,\
> uncore_dmc_0/write_txns/\
> rmmod thunderx2_pmu
> insmod thunderx2_pmu.ko
>
> root@SBR-26>~>> [ 1065.946772] watchdog: BUG: soft lockup - CPU#117
> stuck for 22s! [perf:5206]
> [ 1065.953722] Modules linked in: thunderx2_pmu(OE) nls_iso8859_1
> joydev input_leds bridge ipmi_ssif ipmi_devintf stp llc
> ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core
> iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ppdev lp parport ip_tables x_tables
> autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1
> raid0 multipat
> h linear aes_ce_blk hid_generic aes_ce_cipher usbhid uas usb_storage
> hid ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm i40e i2c_smbus bnx2x
> crct10dif_ce g
> hash_ce e1000e sha2_ce mpt3sas nvme sha256_arm64 ptp ahci raid_class
> sha1_ce nvme_core scsi_transport_sas mdio libahci pps_core libcrc32c
> gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd
> aes_arm6
> 4 [last unloaded: thunderx2_pmu]
> [ 1066.029640] CPU: 117 PID: 5206 Comm: perf Tainted: G   OE
>   5.3.0+ #160
> [ 1066.037109] Hardware name: Cavium Inc. Saber/Saber, BIOS
> TX2-FW-Release-7.2-build_08-0-g14f8c5bf8a 12/18/2018
> [ 1066.047009] pstate: 2049 (nzCv daif +PAN -UAO)
> [ 1066.051799] pc : smp_call_function_single+0x198/0x1b0
> [ 1066.056838] lr : smp_call_function_single+0x16c/0x1b0
> [ 1066.061875] sp : fc00434cfc50
> [ 1066.065177] x29: fc00434cfc50 x28: 
> [ 1066.070475] x27: febed4d2d952 x26: d4d2d800
> [ 1066.075774] x25: fddf5da1e240 x24: 0001
> [ 1066.081073] x23: fc00434cfd38 x22: fc001026adb8
> [ 1066.086371] x21:  x20: fc0011843000
> [ 1066.091669] x19: fc00434cfca0 x18: 
> [ 1066.096968] x17:  x16: 
> [ 1066.102266] x15: 000

Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

2019-09-18 Thread Ganapatrao Kulkarni

On Fri, Aug 23, 2019 at 6:33 PM Peter Zijlstra  wrote:
>
> On Fri, Aug 23, 2019 at 06:26:34PM +0530, Ganapatrao Kulkarni wrote:
> > On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra  wrote:
> > > On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
> > >
> > > > We are seeing regression with our uncore perf driver(Marvell's
> > > > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > > > After bisecting, it turned out to be this patch causing the issue.
> > >
> > > Funnily enough; the email you replied to didn't contain a patch.
> >
> > Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
>
> Your email is in a random spot of the discussion for me. At least it was
> fairly easy to find the related patch.
>
> > > > Test case:
> > > > Load module and run perf for more than 4 events( we have 4 counters,
> > > > event multiplexing takes place for more than 4 events), then unload
> > > > module.
> > > > With this sequence of testing, the system hangs(soft lockup) after 2
> > > > or 3 iterations. Same test runs for hours on 5.2.
> > > >
> > > > while [ 1 ]
> > > > do
> > > > rmmod thunderx2_pmu
> > > > modprobe thunderx2_pmu
> > > > perf stat -a -e \
> > > > uncore_dmc_0/cnt_cycles/,\
> > > > uncore_dmc_0/data_transfers/,\
> > > > uncore_dmc_0/read_txns/,\
> > > > uncore_dmc_0/config=0xE/,\
> > > > uncore_dmc_0/write_txns/ sleep 1
> > > > sleep 2
> > > > done
> > >
> > > Can you reproduce without the module load+unload? I don't think people
> > > routinely unload modules.
> >
> > The issue wont happen, if module is not unloaded/reloaded.
> > IMHO, this could be potential bug!
>
> Does the softlockup give a useful stacktrace? I don't have a thunderx2
> so I cannot reproduce.

Sorry for the late reply, below is the dump that i am getting, when i
hit the softlockup.
Any suggestions to debug this further?

sequence of commands, which leads to this lockup,
insmod thunderx2_pmu.ko
perf stat -e \
uncore_dmc_0/cnt_cycles/,\
uncore_dmc_0/data_transfers/,\
uncore_dmc_0/read_txns/,\
uncore_dmc_0/config=0xE/,\
uncore_dmc_0/write_txns/\
rmmod thunderx2_pmu
insmod thunderx2_pmu.ko

root@SBR-26>~>> [ 1065.946772] watchdog: BUG: soft lockup - CPU#117
stuck for 22s! [perf:5206]
[ 1065.953722] Modules linked in: thunderx2_pmu(OE) nls_iso8859_1
joydev input_leds bridge ipmi_ssif ipmi_devintf stp llc
ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core
iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ppdev lp parport ip_tables x_tables
autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1
raid0 multipat
h linear aes_ce_blk hid_generic aes_ce_cipher usbhid uas usb_storage
hid ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops drm i40e i2c_smbus bnx2x
crct10dif_ce g
hash_ce e1000e sha2_ce mpt3sas nvme sha256_arm64 ptp ahci raid_class
sha1_ce nvme_core scsi_transport_sas mdio libahci pps_core libcrc32c
gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd
aes_arm6
4 [last unloaded: thunderx2_pmu]
[ 1066.029640] CPU: 117 PID: 5206 Comm: perf Tainted: G   OE
  5.3.0+ #160
[ 1066.037109] Hardware name: Cavium Inc. Saber/Saber, BIOS
TX2-FW-Release-7.2-build_08-0-g14f8c5bf8a 12/18/2018
[ 1066.047009] pstate: 2049 (nzCv daif +PAN -UAO)
[ 1066.051799] pc : smp_call_function_single+0x198/0x1b0
[ 1066.056838] lr : smp_call_function_single+0x16c/0x1b0
[ 1066.061875] sp : fc00434cfc50
[ 1066.065177] x29: fc00434cfc50 x28: 
[ 1066.070475] x27: febed4d2d952 x26: d4d2d800
[ 1066.075774] x25: fddf5da1e240 x24: 0001
[ 1066.081073] x23: fc00434cfd38 x22: fc001026adb8
[ 1066.086371] x21:  x20: fc0011843000
[ 1066.091669] x19: fc00434cfca0 x18: 
[ 1066.096968] x17:  x16: 
[ 1066.102266] x15:  x14: 
[ 1066.107564] x13:  x12: 0020
[ 1066.112862] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[ 1066.118161] x9 :  x8 : febed44ec5e8
[ 1066.123459] x7 :  x6 : fc00434cfca0
[ 1066.128757] x5 : fc00434cfca0 x4 : 0001
[ 1066.134055] x3 : fc00434cfcb8 x2 : 
[ 1066.139353] x1 : 0003 x0 : 
[ 1066.144652] Call trace:
[ 1066.147088]  smp_call_function_single+0x198/0x1b0
[ 1066.151784]  perf_install_in_context+0x1b4/0x1d8
[ 1066.1

Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

2019-08-23 Thread Ganapatrao Kulkarni

Hi Peter,

On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra  wrote:
>
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?

Sorry for the top-posting.
>
> On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
>
> > We are seeing regression with our uncore perf driver(Marvell's
> > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > After bisecting, it turned out to be this patch causing the issue.
>
> Funnily enough; the email you replied to didn't contain a patch.

Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
>
> > Test case:
> > Load module and run perf for more than 4 events( we have 4 counters,
> > event multiplexing takes place for more than 4 events), then unload
> > module.
> > With this sequence of testing, the system hangs(soft lockup) after 2
> > or 3 iterations. Same test runs for hours on 5.2.
> >
> > while [ 1 ]
> > do
> > rmmod thunderx2_pmu
> > modprobe thunderx2_pmu
> > perf stat -a -e \
> > uncore_dmc_0/cnt_cycles/,\
> > uncore_dmc_0/data_transfers/,\
> > uncore_dmc_0/read_txns/,\
> > uncore_dmc_0/config=0xE/,\
> > uncore_dmc_0/write_txns/ sleep 1
> > sleep 2
> > done
>
> Can you reproduce without the module load+unload? I don't think people
> routinely unload modules.

The issue wont happen, if module is not unloaded/reloaded.
IMHO, this could be potential bug!

>
>

Thanks,
Ganapat

Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

2019-08-23 Thread Ganapatrao Kulkarni

Hi,

We are seeing regression with our uncore perf driver(Marvell's
ThunderX2, ARM64 server platform) on 5.3-Rc1.
After bisecting, it turned out to be this patch causing the issue.

Test case:
Load module and run perf for more than 4 events( we have 4 counters,
event multiplexing takes place for more than 4 events), then unload
module.
With this sequence of testing, the system hangs(soft lockup) after 2
or 3 iterations. Same test runs for hours on 5.2.

while [ 1 ]
do
rmmod thunderx2_pmu
modprobe thunderx2_pmu
perf stat -a -e \
uncore_dmc_0/cnt_cycles/,\
uncore_dmc_0/data_transfers/,\
uncore_dmc_0/read_txns/,\
uncore_dmc_0/config=0xE/,\
uncore_dmc_0/write_txns/ sleep 1
sleep 2
done

Is this patch tested adequately?

On Fri, Jun 28, 2019 at 3:18 AM Ian Rogers  wrote:
>
> group_index On Mon, Jun 24, 2019 at 12:55 AM Peter Zijlstra
>  wrote:
> >
> > On Fri, Jun 21, 2019 at 11:01:29AM -0700, Ian Rogers wrote:
> > > On Fri, Jun 21, 2019 at 1:24 AM Peter Zijlstra  
> > > wrote:
> > > >
> > > > On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> > > > > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event 
> > > > > *event, void *data)
> > > > >   sid->can_add_hw = 0;
> > > > >   }
> > > > >
> > > > > + /*
> > > > > +  * If the group wasn't scheduled then set that multiplexing is 
> > > > > necessary
> > > > > +  * for the context. Note, this won't be set if the event wasn't
> > > > > +  * scheduled due to event_filter_match failing due to the 
> > > > > earlier
> > > > > +  * return.
> > > > > +  */
> > > > > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > > > > + sid->ctx->rotate_necessary = 1;
> > > > > +
> > > > >   return 0;
> > > > >  }
> > > >
> > > > That looked odd; which had me look harder at this function which
> > > > resulted in the below. Should we not terminate the context interation
> > > > the moment one flexible thingy fails to schedule?
> > >
> > > If we knew all the events were hardware events then this would be
> > > true, as there may be software events that always schedule then the
> > > continued iteration is necessary.
> >
> > But this is the 'old' code, where this is guaranteed by the context.
> > That is, if this is a hardware context; there wil only be software
> > events due to them being in a group with hardware events.
> >
> > If this is a software group, then we'll never fail to schedule and we'll
> > not get in this branch to begin with.
> >
> > Or am I now confused for having been staring at two different code-bases
> > at the same time?
>
> I believe you're right and I'd overlooked this. I think there is a
> more efficient version of this code possible that I can follow up
> with. There are 3 perf_event_contexts, potentially a sw and hw context
> within the task_struct and one per-CPU in perf_cpu_context. With this
> change I'm focussed on improving rotation of cgroup events that appear
> system wide within the per-CPU context. Without cgroup events the
> system wide events don't need to be iterated over during scheduling
> in. The branch that can set rotate_necessary will only be necessary
> for the task hardware events. For system wide events, considered with
> cgroup mode scheduling in, the branch is necessary as rotation may be
> necessary. It'd be possible to have variants of flexible_sched_in that
> behave differently in the task software event context, and the system
> wide and task hardware contexts.
>
> I have a series of patches related to Kan Liang's cgroup
> perf_event_groups improvements. I'll mail these out and see if I can
> avoid the branch in the task software event context case.
>
> Thanks,
> Ian

Thanks,
Ganapat

[PATCH v3 0/2] Add CCPI2 PMU support

2019-07-23 Thread Ganapatrao Kulkarni

Add Cavium Coherent Processor Interconnect (CCPI2) PMU
support in ThunderX2 Uncore driver.

v3: Rebased to 5.3-rc1

v2: Updated with review comments [1]

[1] https://lkml.org/lkml/2019/6/14/965

v1: initial patch

Ganapatrao Kulkarni (2):
  Documentation: perf: Update documentation for ThunderX2 PMU uncore
driver
  drivers/perf: Add CCPI2 PMU support in ThunderX2 UNCORE driver.

 .../admin-guide/perf/thunderx2-pmu.rst|  20 +-
 drivers/perf/thunderx2_pmu.c  | 248 +++---
 2 files changed, 225 insertions(+), 43 deletions(-)

-- 
2.17.1

Re: [PATCH 2/2] drivers/perf: Add CCPI2 PMU support in ThunderX2 UNCORE driver.

2019-06-27 Thread Ganapatrao Kulkarni

Hi will,

On Thu, Jun 27, 2019 at 3:27 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Fri, Jun 14, 2019 at 05:42:46PM +, Ganapatrao Kulkarni wrote:
> > CCPI2 is a low-latency high-bandwidth serial interface for connecting
> > ThunderX2 processors. This patch adds support to capture CCPI2 perf events.
> >
> > Signed-off-by: Ganapatrao Kulkarni 
> > ---
> >  drivers/perf/thunderx2_pmu.c | 179 ++-
> >  1 file changed, 157 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
> > index 43d76c85da56..3791ac9b897d 100644
> > --- a/drivers/perf/thunderx2_pmu.c
> > +++ b/drivers/perf/thunderx2_pmu.c
> > @@ -16,7 +16,9 @@
> >   * they need to be sampled before overflow(i.e, at every 2 seconds).
> >   */
> >
> > -#define TX2_PMU_MAX_COUNTERS 4
> > +#define TX2_PMU_DMC_L3C_MAX_COUNTERS 4
>
> I find it odd that you rename this...

i am not sure, how to avoid this since dmc/l3c have 4 counters and ccpi2 has 8.
i will try to make this better in v2.
>
> > +#define TX2_PMU_CCPI2_MAX_COUNTERS   8
> > +
> >  #define TX2_PMU_DMC_CHANNELS 8
> >  #define TX2_PMU_L3_TILES 16
> >
> > @@ -28,11 +30,22 @@
> >*/
> >  #define DMC_EVENT_CFG(idx, val)  ((val) << (((idx) * 8) + 1))
> >
> > +#define GET_EVENTID_CCPI2(ev)((ev->hw.config) & 0x1ff)
> > +/* bits[3:0] to select counters, starts from 8, bit[3] set always. */
> > +#define GET_COUNTERID_CCPI2(ev)  ((ev->hw.idx) & 0x7)
> > +#define CCPI2_COUNTER_OFFSET 8
>
>
> ... but leave GET_EVENTID alone, even though it only applies to DMC/L3C
> events. Saying that, if you have the event you can figure out its type,
> so could you avoid the need for additional macros entirely and just use
> the correct masks based on the corresponding PMU type? That might also
> avoid some of the conditional control flow you're introducing elsewhere.

sure, i will add mask as argument to macro.
>
> >  #define L3C_COUNTER_CTL  0xA8
> >  #define L3C_COUNTER_DATA 0xAC
> >  #define DMC_COUNTER_CTL  0x234
> >  #define DMC_COUNTER_DATA 0x240
> >
> > +#define CCPI2_PERF_CTL   0x108
> > +#define CCPI2_COUNTER_CTL0x10C
> > +#define CCPI2_COUNTER_SEL0x12c
> > +#define CCPI2_COUNTER_DATA_L 0x130
> > +#define CCPI2_COUNTER_DATA_H 0x134
> > +
> >  /* L3C event IDs */
> >  #define L3_EVENT_READ_REQ0xD
> >  #define L3_EVENT_WRITEBACK_REQ   0xE
> > @@ -51,9 +64,16 @@
> >  #define DMC_EVENT_READ_TXNS  0xF
> >  #define DMC_EVENT_MAX0x10
> >
> > +#define CCPI2_EVENT_REQ_PKT_SENT 0x3D
> > +#define CCPI2_EVENT_SNOOP_PKT_SENT   0x65
> > +#define CCPI2_EVENT_DATA_PKT_SENT0x105
> > +#define CCPI2_EVENT_GIC_PKT_SENT 0x12D
> > +#define CCPI2_EVENT_MAX  0x200
> > +
> >  enum tx2_uncore_type {
> >   PMU_TYPE_L3C,
> >   PMU_TYPE_DMC,
> > + PMU_TYPE_CCPI2,
> >   PMU_TYPE_INVALID,
> >  };
> >
> > @@ -73,8 +93,8 @@ struct tx2_uncore_pmu {
> >   u32 max_events;
> >   u64 hrtimer_interval;
> >   void __iomem *base;
> > - DECLARE_BITMAP(active_counters, TX2_PMU_MAX_COUNTERS);
> > - struct perf_event *events[TX2_PMU_MAX_COUNTERS];
> > + DECLARE_BITMAP(active_counters, TX2_PMU_CCPI2_MAX_COUNTERS);
> > + struct perf_event *events[TX2_PMU_DMC_L3C_MAX_COUNTERS];
>
> Hmm, that looks very odd. Why are they now different sizes?

events[ ] is used to hold reference to active events to use in timer
callback, which is not applicable to ccpi2, hence 4.
active_counters is set to max of both. i.e, 8. i will try to make it
better readable in v2.

>
> >   struct device *dev;
> >   struct hrtimer hrtimer;
> >   const struct attribute_group **attr_groups;
> > @@ -92,7 +112,21 @@ static inline struct tx2_uncore_pmu 
> > *pmu_to_tx2_pmu(struct pmu *pmu)
> >   return container_of(pmu, struct tx2_uncore_pmu, pmu);
> >  }
> >
> > -PMU_FORMAT_ATTR(event,   "config:0-4");
> > +#define TX2_PMU_FORMAT_ATTR(_var, _name, _format)\
> > +static ssize_t 
> >   \
> > +__tx2_pmu_##_var##_show(struct device *dev,  \
> > +str

Re: [PATCH v3] irqchip: gicv3-its: Use NUMA aware memory allocation for ITS tables

2019-01-10 Thread Ganapatrao Kulkarni

Hi Shameer,

Patch looks OK to me, please feel free to add,
Reviewed-by: Ganapatrao Kulkarni 

On Thu, Dec 13, 2018 at 5:25 PM Marc Zyngier  wrote:
>
> On 13/12/2018 10:59, Shameer Kolothum wrote:
> > From: Shanker Donthineni 
> >
> > The NUMA node information is visible to ITS driver but not being used
> > other than handling hardware errata. ITS/GICR hardware accesses to the
> > local NUMA node is usually quicker than the remote NUMA node. How slow
> > the remote NUMA accesses are depends on the implementation details.
> >
> > This patch allocates memory for ITS management tables and command
> > queue from the corresponding NUMA node using the appropriate NUMA
> > aware functions. This change improves the performance of the ITS
> > tables read latency on systems where it has more than one ITS block,
> > and with the slower inter node accesses.
> >
> > Apache Web server benchmarking using ab tool on a HiSilicon D06
> > board with multiple numa mem nodes shows Time per request and
> > Transfer rate improvements of ~3.6% with this patch.
> >
> > Signed-off-by: Shanker Donthineni 
> > Signed-off-by: Hanjun Guo 
> > Signed-off-by: Shameer Kolothum 
> > ---
> >
> > This is to revive the patch originally sent by Shanker[1] and
> > to back it up with a benchmark test. Any further testing of
> > this is most welcome.
> >
> > v2-->v3
> >  -Addressed comments to use page_address().
> >  -Added Benchmark results to commit log.
> >  -Removed T-by from Ganapatrao for now.
> >
> > v1-->v2
> >  -Edited commit text.
> >  -Added Ganapatrao's tested-by.
> >
> > Benchmark test details:
> > 
> > Test Setup:
> > -D06 with dimm on node 0(Sock#0) and 3 (Sock#1).
> > -ITS belongs to numa node 0.
> > -Filesystem mounted on a PCIe NVMe based disk.
> > -Apache server installed on D06.
> > -Running ab benchmark test in concurrency mode from a remote m/c
> >  connected to D06 via  hns3(PCIe) n/w port.
> >  "ab -k -c 750 -n 200 http://10.202.225.188/;
> >
> > Test results are avg. of 15 runs.
> >
> > For 4.20-rc1  Kernel,
> > 
> > Time per request(mean, concurrent)  = 0.02753[ms]
> > Transfer Rate = 416501[Kbytes/sec]
> >
> > For 4.20-rc1 +  this patch,
> > --
> > Time per request(mean, concurrent)  = 0.02653[ms]
> > Transfer Rate = 431954[Kbytes/sec]
> >
> > % improvement ~3.6%
> >
> > vmstat shows around 170K-200K interrupts per second.
> >
> > ~# vmstat 1 -w
> > procs ---memory-- -  -system--
> >  r  b swpd freein
> >  5  00 30166724  102794
> >  9  00 30141828  171148
> >  5  00 30150160  207185
> > 13  00 30145924  175691
> > 15  00 30140792  145250
> > 13  00 30135556  201879
> > 13  00 30134864  192391
> > 10  00 30133632  168880
> > 
> >
> > [1] https://patchwork.kernel.org/patch/989/
>
> The figures certainly look convincing. I'd need someone from Cavium to
> benchmark it on their hardware and come back with results so that we can
> make a decision on this.

Hi Marc,
My setup got altered during Lab migration from Cavium to Marvell office.
I don't think, i will have same setup anytime soon.

>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

Thanks,
Ganapat

Re: [PATCH v11 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

2018-12-06 Thread Ganapatrao Kulkarni

On Thu, Dec 6, 2018 at 6:04 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Thu, Dec 06, 2018 at 11:51:24AM +, Kulkarni, Ganapatrao wrote:
> > This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
> > The SoC has PMU support in L3 cache controller (L3C) and in the
> > DDR4 Memory Controller (DMC).
> >
> >
> > v11:
> >   Updated Patch 2 with minor comments.
>
> Thanks. I've pushed this out on my perf/updates branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=perf/updates
>
> but note that I did make some spelling and formatting changes to the
> Documentation patch, as well as removal of the event descriptions (they
> should be part of the perf tool, as is done for the other system PMUs).

Thanks Will.
>
> Will

Thanks,
Ganapat

Re: [PATCH v6 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-10-11 Thread Ganapatrao Kulkarni

Hi Suzuki,

On Wed, Oct 10, 2018 at 3:22 PM Suzuki K Poulose  wrote:
>
> Hi Ganapatrao,
>
> On 21/06/18 07:33, Ganapatrao Kulkarni wrote:
> > This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
> > Controller(DMC) and Level 3 Cache(L3C).
> >
> > ThunderX2 has 8 independent DMC PMUs to capture performance events
> > corresponding to 8 channels of DDR4 Memory Controller and 16 independent
> > L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
> > Each PMU supports up to 4 counters. All counters lack overflow interrupt
> > and are sampled periodically.
> >
> > Signed-off-by: Ganapatrao Kulkarni 
> > ---
> >   drivers/perf/Kconfig |   8 +
> >   drivers/perf/Makefile|   1 +
> >   drivers/perf/thunderx2_pmu.c | 949 
> > +++
> >   include/linux/cpuhotplug.h   |   1 +
> >   4 files changed, 959 insertions(+)
> >   create mode 100644 drivers/perf/thunderx2_pmu.c
> >
>
>
> > +
> > +/*
> > + * DMC and L3 counter interface is muxed across all channels.
> > + * hence we need to select the channel before accessing counter
> > + * data/control registers.
> > + *
> > + *  L3 Tile and DMC channel selection is through SMC call
> > + *  SMC call arguments,
> > + *   x0 = THUNDERX2_SMC_CALL_ID  (Vendor SMC call Id)
> > + *   x1 = THUNDERX2_SMC_SET_CHANNEL  (Id to set DMC/L3C channel)
> > + *   x2 = Node id
> > + *   x3 = DMC(1)/L3C(0)
> > + *   x4 = channel Id
> > + */
> > +static void uncore_select_channel(struct perf_event *event)
> > +{
> > + struct arm_smccc_res res;
> > + struct thunderx2_pmu_uncore_channel *pmu_uncore =
> > + pmu_to_thunderx2_pmu_uncore(event->pmu);
> > + struct thunderx2_pmu_uncore_dev *uncore_dev =
> > + pmu_uncore->uncore_dev;
> > +
> > + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
> > + uncore_dev->node, uncore_dev->type,
> > + pmu_uncore->channel, 0, 0, 0, );
> > + if (res.a0) {
> > + dev_err(uncore_dev->dev,
> > + "SMC to Select channel failed for PMU UNCORE[%s]\n",
> > + pmu_uncore->uncore_dev->name);
> > + }
> > +}
> > +
>
> > +static void uncore_start_event_l3c(struct perf_event *event, int flags)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > +
> > + /* event id encoded in bits [07:03] */
> > + val = GET_EVENTID(event) << 3;
> > + reg_writel(val, hwc->config_base);
> > + local64_set(>prev_count, 0);
> > + reg_writel(0, hwc->event_base);
> > +}
> > +
> > +static void uncore_stop_event_l3c(struct perf_event *event)
> > +{
> > + reg_writel(0, event->hw.config_base);
> > +}
> > +
> > +static void uncore_start_event_dmc(struct perf_event *event, int flags)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > + int idx = GET_COUNTERID(event);
> > + int event_type = GET_EVENTID(event);
> > +
> > + /* enable and start counters.
> > +  * 8 bits for each counter, bits[05:01] of a counter to set event 
> > type.
> > +  */
> > + val = reg_readl(hwc->config_base);
> > + val &= ~DMC_EVENT_CFG(idx, 0x1f);
> > + val |= DMC_EVENT_CFG(idx, event_type);
> > + reg_writel(val, hwc->config_base);
> > + local64_set(>prev_count, 0);
> > + reg_writel(0, hwc->event_base);
> > +}
> > +
> > +static void uncore_stop_event_dmc(struct perf_event *event)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > + int idx = GET_COUNTERID(event);
> > +
> > + /* clear event type(bits[05:01]) to stop counter */
> > + val = reg_readl(hwc->config_base);
> > + val &= ~DMC_EVENT_CFG(idx, 0x1f);
> > + reg_writel(val, hwc->config_base);
> > +}
> > +
> > +static void init_cntr_base_l3c(struct perf_event *event,
> > + struct thunderx2_pmu_uncore_dev *uncore_dev)
> > +{
> > + struct hw_perf_event *hwc = >hw;
> > +
> > + /* counter ctrl/data reg offset at 8 */
> > + hwc->config_base = (unsigned long)uncore_dev->base
> > + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
> > + hwc->event_base =  (unsigned long)uncore_dev->base
&

Re: [PATCH v6 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-10-11 Thread Ganapatrao Kulkarni

Hi Suzuki,

On Wed, Oct 10, 2018 at 3:22 PM Suzuki K Poulose  wrote:
>
> Hi Ganapatrao,
>
> On 21/06/18 07:33, Ganapatrao Kulkarni wrote:
> > This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
> > Controller(DMC) and Level 3 Cache(L3C).
> >
> > ThunderX2 has 8 independent DMC PMUs to capture performance events
> > corresponding to 8 channels of DDR4 Memory Controller and 16 independent
> > L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
> > Each PMU supports up to 4 counters. All counters lack overflow interrupt
> > and are sampled periodically.
> >
> > Signed-off-by: Ganapatrao Kulkarni 
> > ---
> >   drivers/perf/Kconfig |   8 +
> >   drivers/perf/Makefile|   1 +
> >   drivers/perf/thunderx2_pmu.c | 949 
> > +++
> >   include/linux/cpuhotplug.h   |   1 +
> >   4 files changed, 959 insertions(+)
> >   create mode 100644 drivers/perf/thunderx2_pmu.c
> >
>
>
> > +
> > +/*
> > + * DMC and L3 counter interface is muxed across all channels.
> > + * hence we need to select the channel before accessing counter
> > + * data/control registers.
> > + *
> > + *  L3 Tile and DMC channel selection is through SMC call
> > + *  SMC call arguments,
> > + *   x0 = THUNDERX2_SMC_CALL_ID  (Vendor SMC call Id)
> > + *   x1 = THUNDERX2_SMC_SET_CHANNEL  (Id to set DMC/L3C channel)
> > + *   x2 = Node id
> > + *   x3 = DMC(1)/L3C(0)
> > + *   x4 = channel Id
> > + */
> > +static void uncore_select_channel(struct perf_event *event)
> > +{
> > + struct arm_smccc_res res;
> > + struct thunderx2_pmu_uncore_channel *pmu_uncore =
> > + pmu_to_thunderx2_pmu_uncore(event->pmu);
> > + struct thunderx2_pmu_uncore_dev *uncore_dev =
> > + pmu_uncore->uncore_dev;
> > +
> > + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
> > + uncore_dev->node, uncore_dev->type,
> > + pmu_uncore->channel, 0, 0, 0, );
> > + if (res.a0) {
> > + dev_err(uncore_dev->dev,
> > + "SMC to Select channel failed for PMU UNCORE[%s]\n",
> > + pmu_uncore->uncore_dev->name);
> > + }
> > +}
> > +
>
> > +static void uncore_start_event_l3c(struct perf_event *event, int flags)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > +
> > + /* event id encoded in bits [07:03] */
> > + val = GET_EVENTID(event) << 3;
> > + reg_writel(val, hwc->config_base);
> > + local64_set(>prev_count, 0);
> > + reg_writel(0, hwc->event_base);
> > +}
> > +
> > +static void uncore_stop_event_l3c(struct perf_event *event)
> > +{
> > + reg_writel(0, event->hw.config_base);
> > +}
> > +
> > +static void uncore_start_event_dmc(struct perf_event *event, int flags)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > + int idx = GET_COUNTERID(event);
> > + int event_type = GET_EVENTID(event);
> > +
> > + /* enable and start counters.
> > +  * 8 bits for each counter, bits[05:01] of a counter to set event 
> > type.
> > +  */
> > + val = reg_readl(hwc->config_base);
> > + val &= ~DMC_EVENT_CFG(idx, 0x1f);
> > + val |= DMC_EVENT_CFG(idx, event_type);
> > + reg_writel(val, hwc->config_base);
> > + local64_set(>prev_count, 0);
> > + reg_writel(0, hwc->event_base);
> > +}
> > +
> > +static void uncore_stop_event_dmc(struct perf_event *event)
> > +{
> > + u32 val;
> > + struct hw_perf_event *hwc = >hw;
> > + int idx = GET_COUNTERID(event);
> > +
> > + /* clear event type(bits[05:01]) to stop counter */
> > + val = reg_readl(hwc->config_base);
> > + val &= ~DMC_EVENT_CFG(idx, 0x1f);
> > + reg_writel(val, hwc->config_base);
> > +}
> > +
> > +static void init_cntr_base_l3c(struct perf_event *event,
> > + struct thunderx2_pmu_uncore_dev *uncore_dev)
> > +{
> > + struct hw_perf_event *hwc = >hw;
> > +
> > + /* counter ctrl/data reg offset at 8 */
> > + hwc->config_base = (unsigned long)uncore_dev->base
> > + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
> > + hwc->event_base =  (unsigned long)uncore_dev->base
&

Re: [PATCH v6 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-10-08 Thread Ganapatrao Kulkarni

Hi Pranith,

On Sat, Jul 7, 2018 at 11:22 AM Pranith Kumar  wrote:
>
> Hi Ganapatrao,
>
>
> On Wed, Jun 20, 2018 at 11:33 PM, Ganapatrao Kulkarni
>  wrote:
>
> > +
> > +enum thunderx2_uncore_l3_events {
> > +   L3_EVENT_NONE,
> > +   L3_EVENT_NBU_CANCEL,
> > +   L3_EVENT_DIB_RETRY,
> > +   L3_EVENT_DOB_RETRY,
> > +   L3_EVENT_DIB_CREDIT_RETRY,
> > +   L3_EVENT_DOB_CREDIT_RETRY,
> > +   L3_EVENT_FORCE_RETRY,
> > +   L3_EVENT_IDX_CONFLICT_RETRY,
> > +   L3_EVENT_EVICT_CONFLICT_RETRY,
> > +   L3_EVENT_BANK_CONFLICT_RETRY,
> > +   L3_EVENT_FILL_ENTRY_RETRY,
> > +   L3_EVENT_EVICT_NOT_READY_RETRY,
> > +   L3_EVENT_L3_RETRY,
> > +   L3_EVENT_READ_REQ,
> > +   L3_EVENT_WRITE_BACK_REQ,
> > +   L3_EVENT_INVALIDATE_NWRITE_REQ,
> > +   L3_EVENT_INV_REQ,
> > +   L3_EVENT_SELF_REQ,
> > +   L3_EVENT_REQ,
> > +   L3_EVENT_EVICT_REQ,
> > +   L3_EVENT_INVALIDATE_NWRITE_HIT,
> > +   L3_EVENT_INVALIDATE_HIT,
> > +   L3_EVENT_SELF_HIT,
> > +   L3_EVENT_READ_HIT,
> > +   L3_EVENT_MAX,
> > +};
> > +
> > +enum thunderx2_uncore_dmc_events {
> > +   DMC_EVENT_NONE,
> > +   DMC_EVENT_COUNT_CYCLES,
> > +   DMC_EVENT_RES2,
> > +   DMC_EVENT_RES3,
> > +   DMC_EVENT_RES4,
> > +   DMC_EVENT_RES5,
> > +   DMC_EVENT_RES6,
> > +   DMC_EVENT_RES7,
> > +   DMC_EVENT_RES8,
> > +   DMC_EVENT_READ_64B_TXNS,
> > +   DMC_EVENT_READ_BELOW_64B_TXNS,
> > +   DMC_EVENT_WRITE_TXNS,
> > +   DMC_EVENT_TXN_CYCLES,
> > +   DMC_EVENT_DATA_TRANSFERS,
> > +   DMC_EVENT_CANCELLED_READ_TXNS,
> > +   DMC_EVENT_CONSUMED_READ_TXNS,
> > +   DMC_EVENT_MAX,
> > +};
>
> Can you please provide a link to where these counters are documented?
> It is not clear what each counter does especially for the L3C events.

I will add brief description of each event in Documentation.
>
> Also, what counter do you need to use to get L3 hit/miss ratio? I
> think this is the most useful counter related to L3.

L3C cache Hit Ratio = (L3_READ_HIT + L3_INV_N_WRITE_HIT +
L3_INVALIDATE_HIT) / (L3_READ_REQ + L3_INV_N_WRITE_REQ +
L3_INVALIDATE_REQ)

>
> Thanks,
> --
> Pranith

thanks
Ganapat

Re: [PATCH v6 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-10-08 Thread Ganapatrao Kulkarni

Hi Pranith,

On Sat, Jul 7, 2018 at 11:22 AM Pranith Kumar  wrote:
>
> Hi Ganapatrao,
>
>
> On Wed, Jun 20, 2018 at 11:33 PM, Ganapatrao Kulkarni
>  wrote:
>
> > +
> > +enum thunderx2_uncore_l3_events {
> > +   L3_EVENT_NONE,
> > +   L3_EVENT_NBU_CANCEL,
> > +   L3_EVENT_DIB_RETRY,
> > +   L3_EVENT_DOB_RETRY,
> > +   L3_EVENT_DIB_CREDIT_RETRY,
> > +   L3_EVENT_DOB_CREDIT_RETRY,
> > +   L3_EVENT_FORCE_RETRY,
> > +   L3_EVENT_IDX_CONFLICT_RETRY,
> > +   L3_EVENT_EVICT_CONFLICT_RETRY,
> > +   L3_EVENT_BANK_CONFLICT_RETRY,
> > +   L3_EVENT_FILL_ENTRY_RETRY,
> > +   L3_EVENT_EVICT_NOT_READY_RETRY,
> > +   L3_EVENT_L3_RETRY,
> > +   L3_EVENT_READ_REQ,
> > +   L3_EVENT_WRITE_BACK_REQ,
> > +   L3_EVENT_INVALIDATE_NWRITE_REQ,
> > +   L3_EVENT_INV_REQ,
> > +   L3_EVENT_SELF_REQ,
> > +   L3_EVENT_REQ,
> > +   L3_EVENT_EVICT_REQ,
> > +   L3_EVENT_INVALIDATE_NWRITE_HIT,
> > +   L3_EVENT_INVALIDATE_HIT,
> > +   L3_EVENT_SELF_HIT,
> > +   L3_EVENT_READ_HIT,
> > +   L3_EVENT_MAX,
> > +};
> > +
> > +enum thunderx2_uncore_dmc_events {
> > +   DMC_EVENT_NONE,
> > +   DMC_EVENT_COUNT_CYCLES,
> > +   DMC_EVENT_RES2,
> > +   DMC_EVENT_RES3,
> > +   DMC_EVENT_RES4,
> > +   DMC_EVENT_RES5,
> > +   DMC_EVENT_RES6,
> > +   DMC_EVENT_RES7,
> > +   DMC_EVENT_RES8,
> > +   DMC_EVENT_READ_64B_TXNS,
> > +   DMC_EVENT_READ_BELOW_64B_TXNS,
> > +   DMC_EVENT_WRITE_TXNS,
> > +   DMC_EVENT_TXN_CYCLES,
> > +   DMC_EVENT_DATA_TRANSFERS,
> > +   DMC_EVENT_CANCELLED_READ_TXNS,
> > +   DMC_EVENT_CONSUMED_READ_TXNS,
> > +   DMC_EVENT_MAX,
> > +};
>
> Can you please provide a link to where these counters are documented?
> It is not clear what each counter does especially for the L3C events.

I will add brief description of each event in Documentation.
>
> Also, what counter do you need to use to get L3 hit/miss ratio? I
> think this is the most useful counter related to L3.

L3C cache Hit Ratio = (L3_READ_HIT + L3_INV_N_WRITE_HIT +
L3_INVALIDATE_HIT) / (L3_READ_REQ + L3_INV_N_WRITE_REQ +
L3_INVALIDATE_REQ)

>
> Thanks,
> --
> Pranith

thanks
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-04 Thread Ganapatrao Kulkarni

Hi Will,

On Thu, Oct 4, 2018 at 5:51 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Thu, Oct 04, 2018 at 11:12:09AM +0530, Ganapatrao Kulkarni wrote:
> > can you please pull this patch?
>
> I still don't like the idea of just removing events like this, especially
> when other architectures (including some x86 and Power CPUs afaict) playa
> similar games for generic events, and these events do actually appear in
> user code.
>
> I also don't understand why you remove the TLB events. I think that logic
> would imply we should remove all of the events, because we can't distinguish
> prefetches from reads either. If we want to be consistent, then I think
> we should just remove the OP_WRITE events for L1D and BPU -- would you be
> ok with that instead?

IIUC, dTLB-load-misses is mapped to
ARMV8_PMUV3_PERFCTR_L1D_TLB_REFILL(event 0x05) and dTLB-loads is
mapped to ARMV8_PMUV3_PERFCTR_L1D_TLB(0x25). Which are as per spec,
counts TLB access/misses for both memory-read operation and
memory-write operation.

IMO, It won't help in keeping these events, knowingly that their
mapping is not accurate, only thing i can say to users is , dont use
events that are marked as "Hardware cache event"

>
> Also, looking at the code, I think our PMCEID parsing is broken for 8.1
> parts, where the upper 32 bits of the register are offset by 0x4000 in the
> event numbering space.

yes, i did not find any mapping in PMCEIDx registers for
implementation defined events, otherwise we would have remapped at
runtime.

>
> Will

thanks
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-04 Thread Ganapatrao Kulkarni

Hi Will,

On Thu, Oct 4, 2018 at 5:51 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Thu, Oct 04, 2018 at 11:12:09AM +0530, Ganapatrao Kulkarni wrote:
> > can you please pull this patch?
>
> I still don't like the idea of just removing events like this, especially
> when other architectures (including some x86 and Power CPUs afaict) playa
> similar games for generic events, and these events do actually appear in
> user code.
>
> I also don't understand why you remove the TLB events. I think that logic
> would imply we should remove all of the events, because we can't distinguish
> prefetches from reads either. If we want to be consistent, then I think
> we should just remove the OP_WRITE events for L1D and BPU -- would you be
> ok with that instead?

IIUC, dTLB-load-misses is mapped to
ARMV8_PMUV3_PERFCTR_L1D_TLB_REFILL(event 0x05) and dTLB-loads is
mapped to ARMV8_PMUV3_PERFCTR_L1D_TLB(0x25). Which are as per spec,
counts TLB access/misses for both memory-read operation and
memory-write operation.

IMO, It won't help in keeping these events, knowingly that their
mapping is not accurate, only thing i can say to users is , dont use
events that are marked as "Hardware cache event"

>
> Also, looking at the code, I think our PMCEID parsing is broken for 8.1
> parts, where the upper 32 bits of the register are offset by 0x4000 in the
> event numbering space.

yes, i did not find any mapping in PMCEIDx registers for
implementation defined events, otherwise we would have remapped at
runtime.

>
> Will

thanks
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-03 Thread Ganapatrao Kulkarni

Hi Will,

can you please pull this patch?

On Mon, Oct 1, 2018 at 10:09 PM Ganapatrao Kulkarni  wrote:
>
> Hi Will,
>
> On Mon, Oct 1, 2018 at 7:58 PM Will Deacon  wrote:
> >
> > Hi Ganapat,
> >
> > On Mon, Oct 01, 2018 at 10:07:43AM +, Kulkarni, Ganapatrao wrote:
> > > Perf events L1-dcache-load-misses, L1-dcache-store-misses are mapped to
> > > armv8_pmuv3 (both DT and ACPI) event L1D_CACHE_REFILL. This is incorrect,
> > > since L1D_CACHE_REFILL counts both load and store misses.
> > > Similarly the events L1-dcache-loads, L1-dcache-stores, dTLB-load-misses
> > > and dTLB-loads are wrongly mapped. Hence Deleting all these cache events
> > > from armv8_pmuv3 cache mapping.
> > >
> > > Signed-off-by: Ganapatrao Kulkarni 
> > > ---
> > >  arch/arm64/kernel/perf_event.c | 8 
> > >  1 file changed, 8 deletions(-)
> >
> > The "generic" events are really implemented on a best-effort basis, as
> > they rarely tend to map exactly to what the hardware supports. I think
> > they originally stemmed from the x86 CPU PMU, but that doesn't really
> > help us.
>
> This works fairly well for DT based boots, since almost all SoCs have
> added remapping using custom dt object binding.
> However we have concluded in the past to drop SoC specific from the
> ACPI mapping and use json to add SoC/micro architecture specific
> events support.
> At present ,  When we boot with ACPI,  it is misleading for these events.
>
> In fact, this has been pointed internally from benchmark team and
> reported it as hardware bug!
> IMHO, it would be much simpler to delete these misleading events
> mapping rather explaining to perf tool users.
>
> We already have proper mapping for these events,
> armv8_pmuv3_0/l1d_cache_refill/
> armv8_pmuv3_0/l1d_cache/
> [core imp def:]
> l1d_cache_rd
> l1d_cache_wr
> l1d_cache_refill_rd
> l1d_cache_refill_wr
>
> >
> > I had a discussion with Ingo back when we originally implemented perf
> > because I actually preferred not to implement the generic events at all.
> > However, he was strongly of the opinion that a best-effort approach was
> > sufficient to get casual users going with the tool, so that's what we went
> > with.
>
> thanks for the background info, these generic mapping fairly works
> except these events.
>
> >
> > Will
>
> thanks,
> Ganapat

thanks,
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-03 Thread Ganapatrao Kulkarni

Hi Will,

can you please pull this patch?

On Mon, Oct 1, 2018 at 10:09 PM Ganapatrao Kulkarni  wrote:
>
> Hi Will,
>
> On Mon, Oct 1, 2018 at 7:58 PM Will Deacon  wrote:
> >
> > Hi Ganapat,
> >
> > On Mon, Oct 01, 2018 at 10:07:43AM +, Kulkarni, Ganapatrao wrote:
> > > Perf events L1-dcache-load-misses, L1-dcache-store-misses are mapped to
> > > armv8_pmuv3 (both DT and ACPI) event L1D_CACHE_REFILL. This is incorrect,
> > > since L1D_CACHE_REFILL counts both load and store misses.
> > > Similarly the events L1-dcache-loads, L1-dcache-stores, dTLB-load-misses
> > > and dTLB-loads are wrongly mapped. Hence Deleting all these cache events
> > > from armv8_pmuv3 cache mapping.
> > >
> > > Signed-off-by: Ganapatrao Kulkarni 
> > > ---
> > >  arch/arm64/kernel/perf_event.c | 8 
> > >  1 file changed, 8 deletions(-)
> >
> > The "generic" events are really implemented on a best-effort basis, as
> > they rarely tend to map exactly to what the hardware supports. I think
> > they originally stemmed from the x86 CPU PMU, but that doesn't really
> > help us.
>
> This works fairly well for DT based boots, since almost all SoCs have
> added remapping using custom dt object binding.
> However we have concluded in the past to drop SoC specific from the
> ACPI mapping and use json to add SoC/micro architecture specific
> events support.
> At present ,  When we boot with ACPI,  it is misleading for these events.
>
> In fact, this has been pointed internally from benchmark team and
> reported it as hardware bug!
> IMHO, it would be much simpler to delete these misleading events
> mapping rather explaining to perf tool users.
>
> We already have proper mapping for these events,
> armv8_pmuv3_0/l1d_cache_refill/
> armv8_pmuv3_0/l1d_cache/
> [core imp def:]
> l1d_cache_rd
> l1d_cache_wr
> l1d_cache_refill_rd
> l1d_cache_refill_wr
>
> >
> > I had a discussion with Ingo back when we originally implemented perf
> > because I actually preferred not to implement the generic events at all.
> > However, he was strongly of the opinion that a best-effort approach was
> > sufficient to get casual users going with the tool, so that's what we went
> > with.
>
> thanks for the background info, these generic mapping fairly works
> except these events.
>
> >
> > Will
>
> thanks,
> Ganapat

thanks,
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-01 Thread Ganapatrao Kulkarni

Hi Will,

On Mon, Oct 1, 2018 at 7:58 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Mon, Oct 01, 2018 at 10:07:43AM +, Kulkarni, Ganapatrao wrote:
> > Perf events L1-dcache-load-misses, L1-dcache-store-misses are mapped to
> > armv8_pmuv3 (both DT and ACPI) event L1D_CACHE_REFILL. This is incorrect,
> > since L1D_CACHE_REFILL counts both load and store misses.
> > Similarly the events L1-dcache-loads, L1-dcache-stores, dTLB-load-misses
> > and dTLB-loads are wrongly mapped. Hence Deleting all these cache events
> > from armv8_pmuv3 cache mapping.
> >
> > Signed-off-by: Ganapatrao Kulkarni 
> > ---
> >  arch/arm64/kernel/perf_event.c | 8 
> >  1 file changed, 8 deletions(-)
>
> The "generic" events are really implemented on a best-effort basis, as
> they rarely tend to map exactly to what the hardware supports. I think
> they originally stemmed from the x86 CPU PMU, but that doesn't really
> help us.

This works fairly well for DT based boots, since almost all SoCs have
added remapping using custom dt object binding.
However we have concluded in the past to drop SoC specific from the
ACPI mapping and use json to add SoC/micro architecture specific
events support.
At present ,  When we boot with ACPI,  it is misleading for these events.

In fact, this has been pointed internally from benchmark team and
reported it as hardware bug!
IMHO, it would be much simpler to delete these misleading events
mapping rather explaining to perf tool users.

We already have proper mapping for these events,
armv8_pmuv3_0/l1d_cache_refill/
armv8_pmuv3_0/l1d_cache/
[core imp def:]
l1d_cache_rd
l1d_cache_wr
l1d_cache_refill_rd
l1d_cache_refill_wr

>
> I had a discussion with Ingo back when we originally implemented perf
> because I actually preferred not to implement the generic events at all.
> However, he was strongly of the opinion that a best-effort approach was
> sufficient to get casual users going with the tool, so that's what we went
> with.

thanks for the background info, these generic mapping fairly works
except these events.

>
> Will

thanks,
Ganapat

Re: [PATCH] arm_pmu: Delete incorrect cache event mapping for some armv8_pmuv3 events.

2018-10-01 Thread Ganapatrao Kulkarni

Hi Will,

On Mon, Oct 1, 2018 at 7:58 PM Will Deacon  wrote:
>
> Hi Ganapat,
>
> On Mon, Oct 01, 2018 at 10:07:43AM +, Kulkarni, Ganapatrao wrote:
> > Perf events L1-dcache-load-misses, L1-dcache-store-misses are mapped to
> > armv8_pmuv3 (both DT and ACPI) event L1D_CACHE_REFILL. This is incorrect,
> > since L1D_CACHE_REFILL counts both load and store misses.
> > Similarly the events L1-dcache-loads, L1-dcache-stores, dTLB-load-misses
> > and dTLB-loads are wrongly mapped. Hence Deleting all these cache events
> > from armv8_pmuv3 cache mapping.
> >
> > Signed-off-by: Ganapatrao Kulkarni 
> > ---
> >  arch/arm64/kernel/perf_event.c | 8 
> >  1 file changed, 8 deletions(-)
>
> The "generic" events are really implemented on a best-effort basis, as
> they rarely tend to map exactly to what the hardware supports. I think
> they originally stemmed from the x86 CPU PMU, but that doesn't really
> help us.

This works fairly well for DT based boots, since almost all SoCs have
added remapping using custom dt object binding.
However we have concluded in the past to drop SoC specific from the
ACPI mapping and use json to add SoC/micro architecture specific
events support.
At present ,  When we boot with ACPI,  it is misleading for these events.

In fact, this has been pointed internally from benchmark team and
reported it as hardware bug!
IMHO, it would be much simpler to delete these misleading events
mapping rather explaining to perf tool users.

We already have proper mapping for these events,
armv8_pmuv3_0/l1d_cache_refill/
armv8_pmuv3_0/l1d_cache/
[core imp def:]
l1d_cache_rd
l1d_cache_wr
l1d_cache_refill_rd
l1d_cache_refill_wr

>
> I had a discussion with Ingo back when we originally implemented perf
> because I actually preferred not to implement the generic events at all.
> However, he was strongly of the opinion that a best-effort approach was
> sufficient to get casual users going with the tool, so that's what we went
> with.

thanks for the background info, these generic mapping fairly works
except these events.

>
> Will

thanks,
Ganapat

[tip:perf/core] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-08-02 Thread tip-bot for Ganapatrao Kulkarni

Commit-ID:  b9b77222d4ff6b5bb8f5d87fca20de0910618bb9
Gitweb: https://git.kernel.org/tip/b9b77222d4ff6b5bb8f5d87fca20de0910618bb9
Author: Ganapatrao Kulkarni 
AuthorDate: Tue, 31 Jul 2018 15:32:51 +0530
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Tue, 31 Jul 2018 11:28:44 -0300

perf vendor events arm64: Update ThunderX2 implementation defined pmu core 
events

Signed-off-by: Ganapatrao Kulkarni 
Cc: Alexander Shishkin 
Cc: Ganapatrao Kulkarni 
Cc: Jan Glauber 
Cc: Jayachandran C 
Cc: Jiri Olsa 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mark Rutland 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Robert Richter 
Cc: Vadim Lomovtsev 
Cc: Will Deacon 
Link: 
http://lkml.kernel.org/r/20180731100251.23575-1-ganapatrao.kulka...@cavium.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 +-
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git 
a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
index bc03c06c3918..752e47eb6977 100644
--- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
+++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
@@ -11,6 +11,21 @@
 {
 "ArchStdEvent": "L1D_CACHE_REFILL_WR",
 },
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
+},
+{
+"ArchStdEvent": "L1D_CACHE_INVAL",
+},
 {
 "ArchStdEvent": "L1D_TLB_REFILL_RD",
 },
@@ -23,10 +38,76 @@
 {
 "ArchStdEvent": "L1D_TLB_WR",
 },
+{
+"ArchStdEvent": "L2D_TLB_REFILL_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_REFILL_WR",
+},
+{
+"ArchStdEvent": "L2D_TLB_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_WR",
+},
 {
 "ArchStdEvent": "BUS_ACCESS_RD",
-   },
-   {
+},
+{
 "ArchStdEvent": "BUS_ACCESS_WR",
-   }
+},
+{
+"ArchStdEvent": "MEM_ACCESS_RD",
+},
+{
+"ArchStdEvent": "MEM_ACCESS_WR",
+},
+{
+"ArchStdEvent": "UNALIGNED_LD_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_ST_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_LDST_SPEC",
+},
+{
+"ArchStdEvent": "EXC_UNDEF",
+},
+{
+"ArchStdEvent": "EXC_SVC",
+},
+{
+"ArchStdEvent": "EXC_PABORT",
+},
+{
+"ArchStdEvent": "EXC_DABORT",
+},
+{
+"ArchStdEvent": "EXC_IRQ",
+},
+{
+"ArchStdEvent": "EXC_FIQ",
+},
+{
+"ArchStdEvent": "EXC_SMC",
+},
+{
+"ArchStdEvent": "EXC_HVC",
+},
+{
+"ArchStdEvent": "EXC_TRAP_PABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_DABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_OTHER",
+},
+{
+"ArchStdEvent": "EXC_TRAP_IRQ",
+},
+{
+"ArchStdEvent": "EXC_TRAP_FIQ",
+}
 ]

[tip:perf/core] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-08-02 Thread tip-bot for Ganapatrao Kulkarni

Commit-ID:  b9b77222d4ff6b5bb8f5d87fca20de0910618bb9
Gitweb: https://git.kernel.org/tip/b9b77222d4ff6b5bb8f5d87fca20de0910618bb9
Author: Ganapatrao Kulkarni 
AuthorDate: Tue, 31 Jul 2018 15:32:51 +0530
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Tue, 31 Jul 2018 11:28:44 -0300

perf vendor events arm64: Update ThunderX2 implementation defined pmu core 
events

Signed-off-by: Ganapatrao Kulkarni 
Cc: Alexander Shishkin 
Cc: Ganapatrao Kulkarni 
Cc: Jan Glauber 
Cc: Jayachandran C 
Cc: Jiri Olsa 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mark Rutland 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Robert Richter 
Cc: Vadim Lomovtsev 
Cc: Will Deacon 
Link: 
http://lkml.kernel.org/r/20180731100251.23575-1-ganapatrao.kulka...@cavium.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 +-
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git 
a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
index bc03c06c3918..752e47eb6977 100644
--- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
+++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
@@ -11,6 +11,21 @@
 {
 "ArchStdEvent": "L1D_CACHE_REFILL_WR",
 },
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
+},
+{
+"ArchStdEvent": "L1D_CACHE_INVAL",
+},
 {
 "ArchStdEvent": "L1D_TLB_REFILL_RD",
 },
@@ -23,10 +38,76 @@
 {
 "ArchStdEvent": "L1D_TLB_WR",
 },
+{
+"ArchStdEvent": "L2D_TLB_REFILL_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_REFILL_WR",
+},
+{
+"ArchStdEvent": "L2D_TLB_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_WR",
+},
 {
 "ArchStdEvent": "BUS_ACCESS_RD",
-   },
-   {
+},
+{
 "ArchStdEvent": "BUS_ACCESS_WR",
-   }
+},
+{
+"ArchStdEvent": "MEM_ACCESS_RD",
+},
+{
+"ArchStdEvent": "MEM_ACCESS_WR",
+},
+{
+"ArchStdEvent": "UNALIGNED_LD_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_ST_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_LDST_SPEC",
+},
+{
+"ArchStdEvent": "EXC_UNDEF",
+},
+{
+"ArchStdEvent": "EXC_SVC",
+},
+{
+"ArchStdEvent": "EXC_PABORT",
+},
+{
+"ArchStdEvent": "EXC_DABORT",
+},
+{
+"ArchStdEvent": "EXC_IRQ",
+},
+{
+"ArchStdEvent": "EXC_FIQ",
+},
+{
+"ArchStdEvent": "EXC_SMC",
+},
+{
+"ArchStdEvent": "EXC_HVC",
+},
+{
+"ArchStdEvent": "EXC_TRAP_PABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_DABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_OTHER",
+},
+{
+"ArchStdEvent": "EXC_TRAP_IRQ",
+},
+{
+"ArchStdEvent": "EXC_TRAP_FIQ",
+}
 ]

Re: [PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Hi Arnaldo,


On Tue, Jul 31, 2018 at 10:59 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Jul 31, 2018 at 08:40:51PM +0530, Ganapatrao Kulkarni escreveu:
>> Hi Arnaldo,
>>
>> On Tue, Jul 31, 2018 at 7:58 PM, Arnaldo Carvalho de Melo
>>  wrote:
>> > Em Tue, Jul 31, 2018 at 03:32:51PM +0530, Ganapatrao Kulkarni escreveu:
>> >> Signed-off-by: Ganapatrao Kulkarni 
>> >
>> > Can you please consider to provide an example of such counters being
>> > used, i.e. with a simple C synthetic test that causes these events to
>> > take place, then run it via 'perf stat' to show that indeed, they are
>> > being programmed and read correctly?
>> >
>> > Ideally for all of them, but if that becomes too burdensome, for a few
>> > of them?
>>
>> It may be tedious for all, certainly I will provide the test
>> results/log for some of them(as many as possible).
>
> Right, we do try to test some of the events via 'perf test', for
> instance:
>
> [root@jouet perf]# perf test openat
>  2: Detect openat syscall event   : Ok
>  3: Detect openat syscall event on all cpus   : Ok
> 15: syscalls:sys_enter_openat event fields: Ok
> [root@jouet perf]#

we have not tried perf test, will look in to this test suite to keep
it complaint on our hardware too!

>
> Things like setting up evsels for some events, then forking + calling a
> syscall, then checking if that event appeared on the ring buffer, check
> if the payload for the event, as read using the tracefs format fields
> matches the parameters we passed in the syscall, etc.
>
> See tools/perf/tests/openat-syscall-tp-fields.c for that
> "syscalls:sys_enter_openat event fields" specific source code.
>
> So doing some of these synthetic tests when updating the event files may
> help us in the direction of having tests that run on those specific
> hardwares (ThunderX2 in this case) everytime we run 'perf test', so that
> we can detect failures sooner.
>
> I.e. first write a simple test for one of those events, use it as
> documentation, at some point, as time permits, turn those into a 'perf
> test' entry.

All these events are implemented as per "ARMv8, The Performance
Monitors Extension specification" [1].
Brief explanation of each of these events is already captured at
tools/perf/pmu-events/arch/arm64/armv8-recommended.json

[1] 
https://static.docs.arm.com/ddi0487/a/DDI0487A_j_armv8_arm.pdf?_ga=2.104377475.2065785066.1533095452-1490247355.1441251141

i have used ltp testcases as workload to test some of the events and
log is below,

root@SBR-26>ganapat>> perf stat -e
unaligned_ld_spec,unaligned_st_spec,unaligned_ldst_spec,mem_access_rd,mem_access_wr,armv8_pmuv3_0/mem_access/
ltp/testcases/kernel/mem/mtest001 -p80
mtest01 0  TINFO  :  Total memory already used on system = 11849792 kbytes
mtest01 0  TINFO  :  Total memory used needed to reach maximum =
214325040 kbytes
mtest01 0  TINFO  :  Filling up 80% of ram which is 202475248 kbytes
mtest01 1  TPASS  :  202475248 kbytes allocated only.

 Performance counter stats for 'ltp/testcases/kernel/mem/mtest01/mtest01 -p80':

 2,573  unaligned_ld_spec
 3,976  unaligned_st_spec
 6,549  unaligned_ldst_spec
 1,525,489  mem_access_rd
 1,549,531  mem_access_wr
 3,075,020  armv8_pmuv3_0/mem_access/

   0.006368837 seconds time elapsed

   0.0 seconds user
   0.00639 seconds sys


root@SBR-26>ganapat>> perf stat -e
l1d_cache_refill_rd,l1d_cache_refill_wr,armv8_pmuv3_0/l1d_cache_refill/
./ltp/testcases/kernel/mem/mtest01/mtest01 -p80
mtest01 0  TINFO  :  Total memory already used on system = 11851520 kbytes
mtest01 0  TINFO  :  Total memory used needed to reach maximum =
214325040 kbytes
mtest01 0  TINFO  :  Filling up 80% of ram which is 202473520 kbytes
mtest01 1  TPASS  :  202473520 kbytes allocated only.

 Performance counter stats for
'./ltp/testcases/kernel/mem/mtest01/mtest01 -p80':

   257,128  l1d_cache_refill_rd
   162,151  l1d_cache_refill_wr
   419,279  armv8_pmuv3_0/l1d_cache_refill/

   0.006118645 seconds time elapsed

   0.0 seconds user
   0.006141000 seconds sys


root@SBR-26>ganapat>> perf stat -e exc_svc
./ltp/testcases/kernel/syscalls/brk/brk01
tst_test.c:1015: INFO: Timeout per run is 0h 05m 00s
brk01.c:67: PASS: brk() works fine

Summary:
passed   1
failed   0
skipped  0
warnings 0

 Performance counter stats for './ltp/testcases/kernel/syscalls/brk/brk01':

   100  exc_svc

   0.000887222 seconds time elapsed

   0.00095 seconds user
   0.0 seconds sys


root@SBR-26>ganapat>>

Re: [PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Hi Arnaldo,


On Tue, Jul 31, 2018 at 10:59 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Jul 31, 2018 at 08:40:51PM +0530, Ganapatrao Kulkarni escreveu:
>> Hi Arnaldo,
>>
>> On Tue, Jul 31, 2018 at 7:58 PM, Arnaldo Carvalho de Melo
>>  wrote:
>> > Em Tue, Jul 31, 2018 at 03:32:51PM +0530, Ganapatrao Kulkarni escreveu:
>> >> Signed-off-by: Ganapatrao Kulkarni 
>> >
>> > Can you please consider to provide an example of such counters being
>> > used, i.e. with a simple C synthetic test that causes these events to
>> > take place, then run it via 'perf stat' to show that indeed, they are
>> > being programmed and read correctly?
>> >
>> > Ideally for all of them, but if that becomes too burdensome, for a few
>> > of them?
>>
>> It may be tedious for all, certainly I will provide the test
>> results/log for some of them(as many as possible).
>
> Right, we do try to test some of the events via 'perf test', for
> instance:
>
> [root@jouet perf]# perf test openat
>  2: Detect openat syscall event   : Ok
>  3: Detect openat syscall event on all cpus   : Ok
> 15: syscalls:sys_enter_openat event fields: Ok
> [root@jouet perf]#

we have not tried perf test, will look in to this test suite to keep
it complaint on our hardware too!

>
> Things like setting up evsels for some events, then forking + calling a
> syscall, then checking if that event appeared on the ring buffer, check
> if the payload for the event, as read using the tracefs format fields
> matches the parameters we passed in the syscall, etc.
>
> See tools/perf/tests/openat-syscall-tp-fields.c for that
> "syscalls:sys_enter_openat event fields" specific source code.
>
> So doing some of these synthetic tests when updating the event files may
> help us in the direction of having tests that run on those specific
> hardwares (ThunderX2 in this case) everytime we run 'perf test', so that
> we can detect failures sooner.
>
> I.e. first write a simple test for one of those events, use it as
> documentation, at some point, as time permits, turn those into a 'perf
> test' entry.

All these events are implemented as per "ARMv8, The Performance
Monitors Extension specification" [1].
Brief explanation of each of these events is already captured at
tools/perf/pmu-events/arch/arm64/armv8-recommended.json

[1] 
https://static.docs.arm.com/ddi0487/a/DDI0487A_j_armv8_arm.pdf?_ga=2.104377475.2065785066.1533095452-1490247355.1441251141

i have used ltp testcases as workload to test some of the events and
log is below,

root@SBR-26>ganapat>> perf stat -e
unaligned_ld_spec,unaligned_st_spec,unaligned_ldst_spec,mem_access_rd,mem_access_wr,armv8_pmuv3_0/mem_access/
ltp/testcases/kernel/mem/mtest001 -p80
mtest01 0  TINFO  :  Total memory already used on system = 11849792 kbytes
mtest01 0  TINFO  :  Total memory used needed to reach maximum =
214325040 kbytes
mtest01 0  TINFO  :  Filling up 80% of ram which is 202475248 kbytes
mtest01 1  TPASS  :  202475248 kbytes allocated only.

 Performance counter stats for 'ltp/testcases/kernel/mem/mtest01/mtest01 -p80':

 2,573  unaligned_ld_spec
 3,976  unaligned_st_spec
 6,549  unaligned_ldst_spec
 1,525,489  mem_access_rd
 1,549,531  mem_access_wr
 3,075,020  armv8_pmuv3_0/mem_access/

   0.006368837 seconds time elapsed

   0.0 seconds user
   0.00639 seconds sys


root@SBR-26>ganapat>> perf stat -e
l1d_cache_refill_rd,l1d_cache_refill_wr,armv8_pmuv3_0/l1d_cache_refill/
./ltp/testcases/kernel/mem/mtest01/mtest01 -p80
mtest01 0  TINFO  :  Total memory already used on system = 11851520 kbytes
mtest01 0  TINFO  :  Total memory used needed to reach maximum =
214325040 kbytes
mtest01 0  TINFO  :  Filling up 80% of ram which is 202473520 kbytes
mtest01 1  TPASS  :  202473520 kbytes allocated only.

 Performance counter stats for
'./ltp/testcases/kernel/mem/mtest01/mtest01 -p80':

   257,128  l1d_cache_refill_rd
   162,151  l1d_cache_refill_wr
   419,279  armv8_pmuv3_0/l1d_cache_refill/

   0.006118645 seconds time elapsed

   0.0 seconds user
   0.006141000 seconds sys


root@SBR-26>ganapat>> perf stat -e exc_svc
./ltp/testcases/kernel/syscalls/brk/brk01
tst_test.c:1015: INFO: Timeout per run is 0h 05m 00s
brk01.c:67: PASS: brk() works fine

Summary:
passed   1
failed   0
skipped  0
warnings 0

 Performance counter stats for './ltp/testcases/kernel/syscalls/brk/brk01':

   100  exc_svc

   0.000887222 seconds time elapsed

   0.00095 seconds user
   0.0 seconds sys


root@SBR-26>ganapat>>

Re: [PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Hi Arnaldo,

On Tue, Jul 31, 2018 at 7:58 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Jul 31, 2018 at 03:32:51PM +0530, Ganapatrao Kulkarni escreveu:
>> Signed-off-by: Ganapatrao Kulkarni 
>
> Can you please consider to provide an example of such counters being
> used, i.e. with a simple C synthetic test that causes these events to
> take place, then run it via 'perf stat' to show that indeed, they are
> being programmed and read correctly?
>
> Ideally for all of them, but if that becomes too burdensome, for a few
> of them?

It may be tedious for all, certainly I will provide the test
results/log for some of them(as many as possible).

>
> Thanks,
>
> - Arnaldo
>
>> ---
>>  .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 
>> +-
>>  1 file changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git 
>> a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
>> b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> index bc03c06..752e47e 100644
>> --- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> +++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> @@ -12,6 +12,21 @@
>>  "ArchStdEvent": "L1D_CACHE_REFILL_WR",
>>  },
>>  {
>> +"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_INVAL",
>> +},
>> +{
>>  "ArchStdEvent": "L1D_TLB_REFILL_RD",
>>  },
>>  {
>> @@ -24,9 +39,75 @@
>>  "ArchStdEvent": "L1D_TLB_WR",
>>  },
>>  {
>> +"ArchStdEvent": "L2D_TLB_REFILL_RD",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_REFILL_WR",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_RD",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_WR",
>> +},
>> +{
>>  "ArchStdEvent": "BUS_ACCESS_RD",
>> -   },
>> -   {
>> +},
>> +{
>>  "ArchStdEvent": "BUS_ACCESS_WR",
>> -   }
>> +},
>> +{
>> +"ArchStdEvent": "MEM_ACCESS_RD",
>> +},
>> +{
>> +"ArchStdEvent": "MEM_ACCESS_WR",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_LD_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_ST_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_LDST_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_UNDEF",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_SVC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_PABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_DABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_IRQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_FIQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_SMC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_HVC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_PABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_DABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_OTHER",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_IRQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_FIQ",
>> +}
>>  ]
>> --
>> 2.9.4

thanks
Ganapat

Re: [PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Hi Arnaldo,

On Tue, Jul 31, 2018 at 7:58 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Jul 31, 2018 at 03:32:51PM +0530, Ganapatrao Kulkarni escreveu:
>> Signed-off-by: Ganapatrao Kulkarni 
>
> Can you please consider to provide an example of such counters being
> used, i.e. with a simple C synthetic test that causes these events to
> take place, then run it via 'perf stat' to show that indeed, they are
> being programmed and read correctly?
>
> Ideally for all of them, but if that becomes too burdensome, for a few
> of them?

It may be tedious for all, certainly I will provide the test
results/log for some of them(as many as possible).

>
> Thanks,
>
> - Arnaldo
>
>> ---
>>  .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 
>> +-
>>  1 file changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git 
>> a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
>> b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> index bc03c06..752e47e 100644
>> --- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> +++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
>> @@ -12,6 +12,21 @@
>>  "ArchStdEvent": "L1D_CACHE_REFILL_WR",
>>  },
>>  {
>> +"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
>> +},
>> +{
>> +"ArchStdEvent": "L1D_CACHE_INVAL",
>> +},
>> +{
>>  "ArchStdEvent": "L1D_TLB_REFILL_RD",
>>  },
>>  {
>> @@ -24,9 +39,75 @@
>>  "ArchStdEvent": "L1D_TLB_WR",
>>  },
>>  {
>> +"ArchStdEvent": "L2D_TLB_REFILL_RD",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_REFILL_WR",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_RD",
>> +},
>> +{
>> +"ArchStdEvent": "L2D_TLB_WR",
>> +},
>> +{
>>  "ArchStdEvent": "BUS_ACCESS_RD",
>> -   },
>> -   {
>> +},
>> +{
>>  "ArchStdEvent": "BUS_ACCESS_WR",
>> -   }
>> +},
>> +{
>> +"ArchStdEvent": "MEM_ACCESS_RD",
>> +},
>> +{
>> +"ArchStdEvent": "MEM_ACCESS_WR",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_LD_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_ST_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "UNALIGNED_LDST_SPEC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_UNDEF",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_SVC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_PABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_DABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_IRQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_FIQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_SMC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_HVC",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_PABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_DABORT",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_OTHER",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_IRQ",
>> +},
>> +{
>> +"ArchStdEvent": "EXC_TRAP_FIQ",
>> +}
>>  ]
>> --
>> 2.9.4

thanks
Ganapat

[PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Signed-off-by: Ganapatrao Kulkarni 
---
 .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 +-
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git 
a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
index bc03c06..752e47e 100644
--- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
+++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
@@ -12,6 +12,21 @@
 "ArchStdEvent": "L1D_CACHE_REFILL_WR",
 },
 {
+"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
+},
+{
+"ArchStdEvent": "L1D_CACHE_INVAL",
+},
+{
 "ArchStdEvent": "L1D_TLB_REFILL_RD",
 },
 {
@@ -24,9 +39,75 @@
 "ArchStdEvent": "L1D_TLB_WR",
 },
 {
+"ArchStdEvent": "L2D_TLB_REFILL_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_REFILL_WR",
+},
+{
+"ArchStdEvent": "L2D_TLB_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_WR",
+},
+{
 "ArchStdEvent": "BUS_ACCESS_RD",
-   },
-   {
+},
+{
 "ArchStdEvent": "BUS_ACCESS_WR",
-   }
+},
+{
+"ArchStdEvent": "MEM_ACCESS_RD",
+},
+{
+"ArchStdEvent": "MEM_ACCESS_WR",
+},
+{
+"ArchStdEvent": "UNALIGNED_LD_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_ST_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_LDST_SPEC",
+},
+{
+"ArchStdEvent": "EXC_UNDEF",
+},
+{
+"ArchStdEvent": "EXC_SVC",
+},
+{
+"ArchStdEvent": "EXC_PABORT",
+},
+{
+"ArchStdEvent": "EXC_DABORT",
+},
+{
+"ArchStdEvent": "EXC_IRQ",
+},
+{
+"ArchStdEvent": "EXC_FIQ",
+},
+{
+"ArchStdEvent": "EXC_SMC",
+},
+{
+"ArchStdEvent": "EXC_HVC",
+},
+{
+"ArchStdEvent": "EXC_TRAP_PABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_DABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_OTHER",
+},
+{
+"ArchStdEvent": "EXC_TRAP_IRQ",
+},
+{
+"ArchStdEvent": "EXC_TRAP_FIQ",
+}
 ]
-- 
2.9.4

[PATCH] perf vendor events arm64: Update ThunderX2 implementation defined pmu core events

2018-07-31 Thread Ganapatrao Kulkarni

Signed-off-by: Ganapatrao Kulkarni 
---
 .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 87 +-
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git 
a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
index bc03c06..752e47e 100644
--- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
+++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
@@ -12,6 +12,21 @@
 "ArchStdEvent": "L1D_CACHE_REFILL_WR",
 },
 {
+"ArchStdEvent": "L1D_CACHE_REFILL_INNER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_REFILL_OUTER",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_VICTIM",
+},
+{
+"ArchStdEvent": "L1D_CACHE_WB_CLEAN",
+},
+{
+"ArchStdEvent": "L1D_CACHE_INVAL",
+},
+{
 "ArchStdEvent": "L1D_TLB_REFILL_RD",
 },
 {
@@ -24,9 +39,75 @@
 "ArchStdEvent": "L1D_TLB_WR",
 },
 {
+"ArchStdEvent": "L2D_TLB_REFILL_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_REFILL_WR",
+},
+{
+"ArchStdEvent": "L2D_TLB_RD",
+},
+{
+"ArchStdEvent": "L2D_TLB_WR",
+},
+{
 "ArchStdEvent": "BUS_ACCESS_RD",
-   },
-   {
+},
+{
 "ArchStdEvent": "BUS_ACCESS_WR",
-   }
+},
+{
+"ArchStdEvent": "MEM_ACCESS_RD",
+},
+{
+"ArchStdEvent": "MEM_ACCESS_WR",
+},
+{
+"ArchStdEvent": "UNALIGNED_LD_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_ST_SPEC",
+},
+{
+"ArchStdEvent": "UNALIGNED_LDST_SPEC",
+},
+{
+"ArchStdEvent": "EXC_UNDEF",
+},
+{
+"ArchStdEvent": "EXC_SVC",
+},
+{
+"ArchStdEvent": "EXC_PABORT",
+},
+{
+"ArchStdEvent": "EXC_DABORT",
+},
+{
+"ArchStdEvent": "EXC_IRQ",
+},
+{
+"ArchStdEvent": "EXC_FIQ",
+},
+{
+"ArchStdEvent": "EXC_SMC",
+},
+{
+"ArchStdEvent": "EXC_HVC",
+},
+{
+"ArchStdEvent": "EXC_TRAP_PABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_DABORT",
+},
+{
+"ArchStdEvent": "EXC_TRAP_OTHER",
+},
+{
+"ArchStdEvent": "EXC_TRAP_IRQ",
+},
+{
+"ArchStdEvent": "EXC_TRAP_FIQ",
+}
 ]
-- 
2.9.4

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-21 Thread Ganapatrao Kulkarni

On Mon, May 21, 2018 at 4:10 PM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
>> Hi Ganapat,
>>
>>
>> Sorry for the delay in replying; I was away most of last week.
>>
>> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
>> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkm...@gmail.com> 
>> > wrote:
>> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutl...@arm.com> 
>> > > wrote:
>> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>
>> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel 
>> > >>> *pmu_uncore)
>> > >>> +{
>> > >>> + int counter;
>> > >>> +
>> > >>> + raw_spin_lock(_uncore->lock);
>> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> > >>> + pmu_uncore->uncore_dev->max_counters);
>> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> + return -ENOSPC;
>> > >>> + }
>> > >>> + set_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> + return counter;
>> > >>> +}
>> > >>> +
>> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel 
>> > >>> *pmu_uncore,
>> > >>> + int counter)
>> > >>> +{
>> > >>> + raw_spin_lock(_uncore->lock);
>> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> +}
>> > >>
>> > >> I don't believe that locking is required in either of these, as the perf
>> > >> core serializes pmu::add() and pmu::del(), where these get called.
>> >
>> > without this locking, i am seeing "BUG: scheduling while atomic" when
>> > i run perf with more events together than the maximum counters
>> > supported
>>
>> Did you manage to get to the bottom of this?
>>
>> Do you have a backtrace?
>>
>> It looks like in your latest posting you reserve counters through the
>> userspace ABI, which doesn't seem right to me, and I'd like to
>> understand the problem.
>
> Looks like I misunderstood -- those are still allocated kernel-side.
>
> I'll follow that up in the v5 posting.

please review v5.
>
> Thanks,
> Mark.

thanks
Ganapat

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-21 Thread Ganapatrao Kulkarni

On Mon, May 21, 2018 at 4:10 PM, Mark Rutland  wrote:
> On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
>> Hi Ganapat,
>>
>>
>> Sorry for the delay in replying; I was away most of last week.
>>
>> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
>> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni  
>> > wrote:
>> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland  
>> > > wrote:
>> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>
>> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel 
>> > >>> *pmu_uncore)
>> > >>> +{
>> > >>> + int counter;
>> > >>> +
>> > >>> + raw_spin_lock(_uncore->lock);
>> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> > >>> + pmu_uncore->uncore_dev->max_counters);
>> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> + return -ENOSPC;
>> > >>> + }
>> > >>> + set_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> + return counter;
>> > >>> +}
>> > >>> +
>> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel 
>> > >>> *pmu_uncore,
>> > >>> + int counter)
>> > >>> +{
>> > >>> + raw_spin_lock(_uncore->lock);
>> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(_uncore->lock);
>> > >>> +}
>> > >>
>> > >> I don't believe that locking is required in either of these, as the perf
>> > >> core serializes pmu::add() and pmu::del(), where these get called.
>> >
>> > without this locking, i am seeing "BUG: scheduling while atomic" when
>> > i run perf with more events together than the maximum counters
>> > supported
>>
>> Did you manage to get to the bottom of this?
>>
>> Do you have a backtrace?
>>
>> It looks like in your latest posting you reserve counters through the
>> userspace ABI, which doesn't seem right to me, and I'd like to
>> understand the problem.
>
> Looks like I misunderstood -- those are still allocated kernel-side.
>
> I'll follow that up in the v5 posting.

please review v5.
>
> Thanks,
> Mark.

thanks
Ganapat

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-21 Thread Ganapatrao Kulkarni

Hi Mark,

On Mon, May 21, 2018 at 4:25 PM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
>> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
>> >> + *
>> >> + *  L3 Tile and DMC channel selection is through SMC call
>> >> + *  SMC call arguments,
>> >> + *   x0 = THUNDERX2_SMC_CALL_ID  (Vendor SMC call Id)
>> >> + *   x1 = THUNDERX2_SMC_SET_CHANNEL  (Id to set DMC/L3C channel)
>> >> + *   x2 = Node id
>> >
>> > How do we map Linux node IDs to the firmware's view of node IDs?
>> >
>> > I don't believe the two are necessarily the same -- Linux's node IDs are
>> > a Linux-specific construct.
>>
>> both are same, it is numa node id from ACPI/firmware.
>
> I am very wary about assuming that the Linux nid will always be the same
> as the ACPI node id.
>
> For that to *potentially* be true, this driver should depend on
> CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
> node id will always be NUMA_NO_NODE.

ok, i can check the node id which we get from ACPI helpers in probe.
if it is NUMA_NO_NODE, I will init first socket uncore only and nid
param to fw is always zero?

>
> I would be *much* happier if we had an explicit mapping somewhere to the
> ID the FW expects.
>
>> > It would be much nicer if we could pass something based on the MPIDR,
>> > which is a known HW construct, or if this implicitly affected the
>> > current node.
>>
>> IMO,  node id is sufficient.
>
> I agree that *a* node ID is sufficient, I just don't think that we're
> guaranteed to have the specific node ID the FW wants.

for thunderx2 which is 2 socket only platform, pxm and nid should be
same(either 0 or 1)
however, i can send PXM id(node_to_pxm) to firmware to make it more sane.

>
>> > It would be vastly more sane for this to not be muxed at all. :/
>>
>> i am helpless due to crappy hw design!
>
> I'm certainly not blaming you for this! :)
>
> I hope the HW designers don't make the same mistake in future, though...
>
> Thanks,
> Mark.

thanks
Ganapat

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-21 Thread Ganapatrao Kulkarni

Hi Mark,

On Mon, May 21, 2018 at 4:25 PM, Mark Rutland  wrote:
> On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
>> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland  wrote:
>> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
>> >> + *
>> >> + *  L3 Tile and DMC channel selection is through SMC call
>> >> + *  SMC call arguments,
>> >> + *   x0 = THUNDERX2_SMC_CALL_ID  (Vendor SMC call Id)
>> >> + *   x1 = THUNDERX2_SMC_SET_CHANNEL  (Id to set DMC/L3C channel)
>> >> + *   x2 = Node id
>> >
>> > How do we map Linux node IDs to the firmware's view of node IDs?
>> >
>> > I don't believe the two are necessarily the same -- Linux's node IDs are
>> > a Linux-specific construct.
>>
>> both are same, it is numa node id from ACPI/firmware.
>
> I am very wary about assuming that the Linux nid will always be the same
> as the ACPI node id.
>
> For that to *potentially* be true, this driver should depend on
> CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
> node id will always be NUMA_NO_NODE.

ok, i can check the node id which we get from ACPI helpers in probe.
if it is NUMA_NO_NODE, I will init first socket uncore only and nid
param to fw is always zero?

>
> I would be *much* happier if we had an explicit mapping somewhere to the
> ID the FW expects.
>
>> > It would be much nicer if we could pass something based on the MPIDR,
>> > which is a known HW construct, or if this implicitly affected the
>> > current node.
>>
>> IMO,  node id is sufficient.
>
> I agree that *a* node ID is sufficient, I just don't think that we're
> guaranteed to have the specific node ID the FW wants.

for thunderx2 which is 2 socket only platform, pxm and nid should be
same(either 0 or 1)
however, i can send PXM id(node_to_pxm) to firmware to make it more sane.

>
>> > It would be vastly more sane for this to not be muxed at all. :/
>>
>> i am helpless due to crappy hw design!
>
> I'm certainly not blaming you for this! :)
>
> I hope the HW designers don't make the same mistake in future, though...
>
> Thanks,
> Mark.

thanks
Ganapat

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-05-20 Thread Ganapatrao Kulkarni

On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkm...@gmail.com> wrote:
> Hi Robin,
>
> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
> <gklkm...@gmail.com> wrote:
>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.mur...@arm.com> wrote:
>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>
>>>> The performance drop is observed with long hours iperf testing using 40G
>>>> cards. This is mainly due to long iterations in finding the free iova
>>>> range in 32bit address space.
>>>>
>>>> In current implementation for 64bit PCI devices, there is always first
>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>> however due to cached32_node optimization it does not suppose to be
>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>> When address range is full, it will be pointing to last allocated node
>>>> (leaf node), so walking rbtree to find the available range is not
>>>> expensive affair. However this optimization does not behave well when
>>>> one of the middle node is freed. In that case cached32_node is updated
>>>> to point to next iova range. The next iova allocation will consume free
>>>> range and again update cached32_node to itself. From now on, walking
>>>> over 32-bit range is more expensive.
>>>>
>>>> This patch adds fix to update cached node to leaf node when there are no
>>>> iova free range left, which avoids unnecessary long iterations.
>>>
>>>
>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>> "space full". Say that after some time the 32-bit space ends up empty except
>>> for one page at 0x1000 and one at 0x8000, then somebody tries to
>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>> that fails, all subsequent allocation attempts are now going to fail despite
>>> the space being 99.% free!
>>>
>>> I can see a couple of ways to solve that general problem of free space above
>>> the cached node getting lost, but neither of them helps with the case where
>>> there is genuinely insufficient space (and if anything would make it even
>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>> it looks.
>
> For this testing, dual port intel 40G card(XL710) used and both ports
> were connected in loop-back. Ran iperf server and clients on both
> ports(used NAT to route packets out on intended ports).There were 10
> iperf clients invoked every 60 seconds in loop for hours for each
> port. Initially the performance on both ports is seen close to line
> rate, however after test ran about 4 to 6 hours, the performance
> started dropping  to very low (to few hundred Mbps) on both
> connections.
>
> IMO,  this is common bug and should happen on any other platforms too
> and needs to be fixed at the earliest.
> Please let me know if you have better way to fix this,  i am happy to
> test your patch!

any update on this issue?
>
>>
>> i see 2 problems in current implementation,
>> 1. We don't replenish the 32 bits range, until first attempt of second
>> allocation(64 bit) fails.
>> 2. Having  per cpu cache might not yield good hit on platforms with
>> more number of CPUs.
>>
>> however irrespective of current issues, It makes sense to update
>> cached node as done in this patch , when there is failure to get iova
>> range using current cached pointer which is forcing for the
>> unnecessary time consuming do-while iterations until any replenish
>> happens!
>>
>> thanks
>> Ganapat
>>
>>>
>>> Robin.
>>>
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>> ---
>>>>   drivers/iommu/iova.c | 6 ++
>>>>   1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index 83fe262..e6ee2ea 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>> +   /* No more cached node points to free hole, update to leaf
>>>> node.
>>>> +*/
>>>> +   struct iova *prev_iova;
>>>> +
>>>> +   prev_iova = rb_entry(prev, struct iova, node);
>>>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>>>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>>>> return -ENOMEM;
>>>> }
>>>>
>>>
>
> thanks
> Ganapat

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-05-20 Thread Ganapatrao Kulkarni

On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni  wrote:
> Hi Robin,
>
> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
>  wrote:
>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy  wrote:
>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>
>>>> The performance drop is observed with long hours iperf testing using 40G
>>>> cards. This is mainly due to long iterations in finding the free iova
>>>> range in 32bit address space.
>>>>
>>>> In current implementation for 64bit PCI devices, there is always first
>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>> however due to cached32_node optimization it does not suppose to be
>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>> When address range is full, it will be pointing to last allocated node
>>>> (leaf node), so walking rbtree to find the available range is not
>>>> expensive affair. However this optimization does not behave well when
>>>> one of the middle node is freed. In that case cached32_node is updated
>>>> to point to next iova range. The next iova allocation will consume free
>>>> range and again update cached32_node to itself. From now on, walking
>>>> over 32-bit range is more expensive.
>>>>
>>>> This patch adds fix to update cached node to leaf node when there are no
>>>> iova free range left, which avoids unnecessary long iterations.
>>>
>>>
>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>> "space full". Say that after some time the 32-bit space ends up empty except
>>> for one page at 0x1000 and one at 0x8000, then somebody tries to
>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>> that fails, all subsequent allocation attempts are now going to fail despite
>>> the space being 99.% free!
>>>
>>> I can see a couple of ways to solve that general problem of free space above
>>> the cached node getting lost, but neither of them helps with the case where
>>> there is genuinely insufficient space (and if anything would make it even
>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>> it looks.
>
> For this testing, dual port intel 40G card(XL710) used and both ports
> were connected in loop-back. Ran iperf server and clients on both
> ports(used NAT to route packets out on intended ports).There were 10
> iperf clients invoked every 60 seconds in loop for hours for each
> port. Initially the performance on both ports is seen close to line
> rate, however after test ran about 4 to 6 hours, the performance
> started dropping  to very low (to few hundred Mbps) on both
> connections.
>
> IMO,  this is common bug and should happen on any other platforms too
> and needs to be fixed at the earliest.
> Please let me know if you have better way to fix this,  i am happy to
> test your patch!

any update on this issue?
>
>>
>> i see 2 problems in current implementation,
>> 1. We don't replenish the 32 bits range, until first attempt of second
>> allocation(64 bit) fails.
>> 2. Having  per cpu cache might not yield good hit on platforms with
>> more number of CPUs.
>>
>> however irrespective of current issues, It makes sense to update
>> cached node as done in this patch , when there is failure to get iova
>> range using current cached pointer which is forcing for the
>> unnecessary time consuming do-while iterations until any replenish
>> happens!
>>
>> thanks
>> Ganapat
>>
>>>
>>> Robin.
>>>
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>> ---
>>>>   drivers/iommu/iova.c | 6 ++
>>>>   1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index 83fe262..e6ee2ea 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>> +   /* No more cached node points to free hole, update to leaf
>>>> node.
>>>> +*/
>>>> +   struct iova *prev_iova;
>>>> +
>>>> +   prev_iova = rb_entry(prev, struct iova, node);
>>>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>>>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>>>> return -ENOMEM;
>>>> }
>>>>
>>>
>
> thanks
> Ganapat

Re: [PATCH v5 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-18 Thread Ganapatrao Kulkarni

On Thu, May 17, 2018 at 4:42 PM, John Garry <john.ga...@huawei.com> wrote:
> On 16/05/2018 05:55, Ganapatrao Kulkarni wrote:
>>
>> This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
>> Controller(DMC) and Level 3 Cache(L3C).
>>
>
> Hi,
>
> Just some coding comments below:
>
>> ThunderX2 has 8 independent DMC PMUs to capture performance events
>> corresponding to 8 channels of DDR4 Memory Controller and 16 independent
>> L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
>> Each PMU supports up to 4 counters. All counters lack overflow interrupt
>> and are sampled periodically.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>  drivers/perf/Kconfig |   8 +
>>  drivers/perf/Makefile|   1 +
>>  drivers/perf/thunderx2_pmu.c | 965
>> +++
>>  include/linux/cpuhotplug.h   |   1 +
>>  4 files changed, 975 insertions(+)
>>  create mode 100644 drivers/perf/thunderx2_pmu.c
>>
>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>> index 28bb5a0..eafd0fc 100644
>> --- a/drivers/perf/Kconfig
>> +++ b/drivers/perf/Kconfig
>> @@ -85,6 +85,14 @@ config QCOM_L3_PMU
>>Adds the L3 cache PMU into the perf events subsystem for
>>monitoring L3 cache events.
>>
>> +config THUNDERX2_PMU
>> +bool "Cavium ThunderX2 SoC PMU UNCORE"
>> +depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
>
>
> Is the explicit dependency for PERF_EVENTS required, since we're under the
> PERF_EVENTS menu?

not really, i can drop this.
>
> And IIRC for other perf drivers we required a dependency on ARM64 - is that
> required here also? I see arm_smccc_smc() calls in the code...

ok.
>
>
>> +   help
>> + Provides support for ThunderX2 UNCORE events.
>> + The SoC has PMU support in its L3 cache controller (L3C) and
>> + in the DDR4 Memory Controller (DMC).
>> +
>>  config XGENE_PMU
>>  depends on ARCH_XGENE
>>  bool "APM X-Gene SoC PMU"
>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
>> index b3902bd..909f27f 100644
>> --- a/drivers/perf/Makefile
>> +++ b/drivers/perf/Makefile
>> @@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
>>  obj-$(CONFIG_HISI_PMU) += hisilicon/
>>  obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
>>  obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
>> +obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
>>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
>>  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
>> diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
>> new file mode 100644
>> index 000..0401443
>> --- /dev/null
>> +++ b/drivers/perf/thunderx2_pmu.c
>> @@ -0,0 +1,965 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * CAVIUM THUNDERX2 SoC PMU UNCORE
>> + *
>> + * Copyright (C) 2018 Cavium Inc.
>> + * Author: Ganapatrao Kulkarni <gkulka...@cavium.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>
>
> Isn't this the same as the SPDX?

ok, i will remove it.
>
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>
>
> L3C, right?

ok
>
>> + * Each Channel supports UNCORE PMU device and consists of
>> + * 4 independent programmable counters. Counters are 32 bit
>> + * and does not support overflow interrupt, they needs to be
>
>
> /s/needs/need/, /s/does/do/

ok
>
>> + * sampled before overflow(i.e, at every 2 seconds).
>
>
> how can you ensure that this value is low enough?
>
> "I saw this comment in previous patch:
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to

Re: [PATCH v5 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-18 Thread Ganapatrao Kulkarni

On Thu, May 17, 2018 at 4:42 PM, John Garry  wrote:
> On 16/05/2018 05:55, Ganapatrao Kulkarni wrote:
>>
>> This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
>> Controller(DMC) and Level 3 Cache(L3C).
>>
>
> Hi,
>
> Just some coding comments below:
>
>> ThunderX2 has 8 independent DMC PMUs to capture performance events
>> corresponding to 8 channels of DDR4 Memory Controller and 16 independent
>> L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
>> Each PMU supports up to 4 counters. All counters lack overflow interrupt
>> and are sampled periodically.
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>>  drivers/perf/Kconfig |   8 +
>>  drivers/perf/Makefile|   1 +
>>  drivers/perf/thunderx2_pmu.c | 965
>> +++
>>  include/linux/cpuhotplug.h   |   1 +
>>  4 files changed, 975 insertions(+)
>>  create mode 100644 drivers/perf/thunderx2_pmu.c
>>
>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>> index 28bb5a0..eafd0fc 100644
>> --- a/drivers/perf/Kconfig
>> +++ b/drivers/perf/Kconfig
>> @@ -85,6 +85,14 @@ config QCOM_L3_PMU
>>Adds the L3 cache PMU into the perf events subsystem for
>>monitoring L3 cache events.
>>
>> +config THUNDERX2_PMU
>> +bool "Cavium ThunderX2 SoC PMU UNCORE"
>> +depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
>
>
> Is the explicit dependency for PERF_EVENTS required, since we're under the
> PERF_EVENTS menu?

not really, i can drop this.
>
> And IIRC for other perf drivers we required a dependency on ARM64 - is that
> required here also? I see arm_smccc_smc() calls in the code...

ok.
>
>
>> +   help
>> + Provides support for ThunderX2 UNCORE events.
>> + The SoC has PMU support in its L3 cache controller (L3C) and
>> + in the DDR4 Memory Controller (DMC).
>> +
>>  config XGENE_PMU
>>  depends on ARCH_XGENE
>>  bool "APM X-Gene SoC PMU"
>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
>> index b3902bd..909f27f 100644
>> --- a/drivers/perf/Makefile
>> +++ b/drivers/perf/Makefile
>> @@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
>>  obj-$(CONFIG_HISI_PMU) += hisilicon/
>>  obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
>>  obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
>> +obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
>>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
>>  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
>> diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
>> new file mode 100644
>> index 000..0401443
>> --- /dev/null
>> +++ b/drivers/perf/thunderx2_pmu.c
>> @@ -0,0 +1,965 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * CAVIUM THUNDERX2 SoC PMU UNCORE
>> + *
>> + * Copyright (C) 2018 Cavium Inc.
>> + * Author: Ganapatrao Kulkarni 
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>
>
> Isn't this the same as the SPDX?

ok, i will remove it.
>
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>
>
> L3C, right?

ok
>
>> + * Each Channel supports UNCORE PMU device and consists of
>> + * 4 independent programmable counters. Counters are 32 bit
>> + * and does not support overflow interrupt, they needs to be
>
>
> /s/needs/need/, /s/does/do/

ok
>
>> + * sampled before overflow(i.e, at every 2 seconds).
>
>
> how can you ensure that this value is low enough?
>
> "I saw this comment in previous patch:
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to try more often than once every 2
>> seconds...
>
> 2 seconds seems to be sufficient. So far testin

[PATCH v5 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-15 Thread Ganapatrao Kulkarni

This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
Controller(DMC) and Level 3 Cache(L3C).

ThunderX2 has 8 independent DMC PMUs to capture performance events
corresponding to 8 channels of DDR4 Memory Controller and 16 independent
L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
Each PMU supports up to 4 counters. All counters lack overflow interrupt
and are sampled periodically.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 965 +++
 include/linux/cpuhotplug.h   |   1 +
 4 files changed, 975 insertions(+)
 create mode 100644 drivers/perf/thunderx2_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 28bb5a0..eafd0fc 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -85,6 +85,14 @@ config QCOM_L3_PMU
   Adds the L3 cache PMU into the perf events subsystem for
   monitoring L3 cache events.
 
+config THUNDERX2_PMU
+bool "Cavium ThunderX2 SoC PMU UNCORE"
+depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
+   help
+ Provides support for ThunderX2 UNCORE events.
+ The SoC has PMU support in its L3 cache controller (L3C) and
+ in the DDR4 Memory Controller (DMC).
+
 config XGENE_PMU
 depends on ARCH_XGENE
 bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b3902bd..909f27f 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
+obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
new file mode 100644
index 000..0401443
--- /dev/null
+++ b/drivers/perf/thunderx2_pmu.c
@@ -0,0 +1,965 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CAVIUM THUNDERX2 SoC PMU UNCORE
+ *
+ * Copyright (C) 2018 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <gkulka...@cavium.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* L3c and DMC has 16 and 8 channels per socket respectively.
+ * Each Channel supports UNCORE PMU device and consists of
+ * 4 independent programmable counters. Counters are 32 bit
+ * and does not support overflow interrupt, they needs to be
+ * sampled before overflow(i.e, at every 2 seconds).
+ */
+
+#define UNCORE_MAX_COUNTERS4
+#define UNCORE_L3_MAX_TILES16
+#define UNCORE_DMC_MAX_CHANNELS8
+
+#define UNCORE_HRTIMER_INTERVAL(2 * NSEC_PER_SEC)
+#define GET_EVENTID(ev)((ev->hw.config) & 0x1ff)
+#define GET_COUNTERID(ev)  ((ev->hw.idx) & 0xf)
+#define GET_CHANNELID(pmu_uncore)  (pmu_uncore->channel)
+#define DMC_EVENT_CFG(idx, val)((val) << (((idx) * 8) + 1))
+
+#define DMC_COUNTER_CTL0x234
+#define DMC_COUNTER_DATA   0x240
+#define L3C_COUNTER_CTL0xA8
+#define L3C_COUNTER_DATA   0xAC
+
+#define THUNDERX2_SMC_CALL_ID  0xC200FF00
+#define THUNDERX2_SMC_SET_CHANNEL  0xB010
+
+enum thunderx2_uncore_l3_events {
+   L3_EVENT_NONE,
+   L3_EVENT_NBU_CANCEL,
+   L3_EVENT_DIB_RETRY,
+   L3_EVENT_DOB_RETRY,
+   L3_EVENT_DIB_CREDIT_RETRY,
+   L3_EVENT_DOB_CREDIT_RETRY,
+   L3_EVENT_FORCE_RETRY,
+   L3_EVENT_IDX_CONFLICT_RETRY,
+   L3_EVENT_EVICT_CONFLICT_RETRY,
+   L3_EVENT_BANK_CONFLICT_RETRY,
+   L3_EVENT_FILL_ENTRY_RETRY,
+   L3_EVENT_EVICT_NOT_READY_RETRY,
+   L3_EVENT_L3_RETRY,
+   L3_EVENT_READ_REQ,
+   L3_EVENT_WRITE_BACK_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_REQ,
+   L3_EVENT_INV_REQ,
+   L3_EVENT_SELF_REQ,
+   L3_EVENT_REQ,
+   L3_EVENT_EVICT_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_HIT,
+   L3_EVENT_INVALIDATE_HIT,
+   L3_EVENT_SELF_HIT,
+   L3_EVENT_READ_HIT,
+   L3_EVENT_MAX,
+};
+
+enum thunderx2_uncore_dmc_events {
+   DMC_EVENT_NONE,
+

[PATCH v5 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-05-15 Thread Ganapatrao Kulkarni

Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 Documentation/perf/thunderx2-pmu.txt | 66 
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt

diff --git a/Documentation/perf/thunderx2-pmu.txt 
b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 000..7d89935
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,66 @@
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+It has 8 independent DMC PMUs to capture performance events corresponding
+to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
+to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
+up to 4 counters.
+
+Counters are independently programmable and can be started and stopped
+individually. Each counter can be set to sample specific perf events.
+Counters are 32 bit and do not support overflow interrupt; they are
+sampled at every 2 seconds. The Counters register access are multiplexed
+across channels of DMC and L3C. The muxing(select channel) is done through
+write to a Secure register using smcc calls.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMUs provides description of its available events
+and configuration options in sysfs.
+   see /sys/devices/uncore_
+
+S is socket id and X represents channel number.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e \
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_1/cnt_cycles/,\
+uncore_dmc_0_2/cnt_cycles/,\
+uncore_dmc_0_3/cnt_cycles/,\
+uncore_dmc_0_4/cnt_cycles/,\
+uncore_dmc_0_5/cnt_cycles/,\
+uncore_dmc_0_6/cnt_cycles/,\
+uncore_dmc_0_7/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0_0/cancelled_read_txns/,\
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_0/consumed_read_txns/,\
+uncore_dmc_0_0/data_transfers/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0_0/l3_retry/,\
+uncore_l3c_0_0/read_hit/,\
+uncore_l3c_0_0/read_request/,\
+uncore_l3c_0_0/inv_request/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
-- 
2.9.4

[PATCH v5 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-05-15 Thread Ganapatrao Kulkarni

Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni 
---
 Documentation/perf/thunderx2-pmu.txt | 66 
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt

diff --git a/Documentation/perf/thunderx2-pmu.txt 
b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 000..7d89935
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,66 @@
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+It has 8 independent DMC PMUs to capture performance events corresponding
+to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
+to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
+up to 4 counters.
+
+Counters are independently programmable and can be started and stopped
+individually. Each counter can be set to sample specific perf events.
+Counters are 32 bit and do not support overflow interrupt; they are
+sampled at every 2 seconds. The Counters register access are multiplexed
+across channels of DMC and L3C. The muxing(select channel) is done through
+write to a Secure register using smcc calls.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMUs provides description of its available events
+and configuration options in sysfs.
+   see /sys/devices/uncore_
+
+S is socket id and X represents channel number.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e \
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_1/cnt_cycles/,\
+uncore_dmc_0_2/cnt_cycles/,\
+uncore_dmc_0_3/cnt_cycles/,\
+uncore_dmc_0_4/cnt_cycles/,\
+uncore_dmc_0_5/cnt_cycles/,\
+uncore_dmc_0_6/cnt_cycles/,\
+uncore_dmc_0_7/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0_0/cancelled_read_txns/,\
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_0/consumed_read_txns/,\
+uncore_dmc_0_0/data_transfers/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0_0/l3_retry/,\
+uncore_l3c_0_0/read_hit/,\
+uncore_l3c_0_0/read_request/,\
+uncore_l3c_0_0/inv_request/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
-- 
2.9.4

[PATCH v5 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-15 Thread Ganapatrao Kulkarni

This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
Controller(DMC) and Level 3 Cache(L3C).

ThunderX2 has 8 independent DMC PMUs to capture performance events
corresponding to 8 channels of DDR4 Memory Controller and 16 independent
L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
Each PMU supports up to 4 counters. All counters lack overflow interrupt
and are sampled periodically.

Signed-off-by: Ganapatrao Kulkarni 
---
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 965 +++
 include/linux/cpuhotplug.h   |   1 +
 4 files changed, 975 insertions(+)
 create mode 100644 drivers/perf/thunderx2_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 28bb5a0..eafd0fc 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -85,6 +85,14 @@ config QCOM_L3_PMU
   Adds the L3 cache PMU into the perf events subsystem for
   monitoring L3 cache events.
 
+config THUNDERX2_PMU
+bool "Cavium ThunderX2 SoC PMU UNCORE"
+depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
+   help
+ Provides support for ThunderX2 UNCORE events.
+ The SoC has PMU support in its L3 cache controller (L3C) and
+ in the DDR4 Memory Controller (DMC).
+
 config XGENE_PMU
 depends on ARCH_XGENE
 bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b3902bd..909f27f 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
+obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
new file mode 100644
index 000..0401443
--- /dev/null
+++ b/drivers/perf/thunderx2_pmu.c
@@ -0,0 +1,965 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CAVIUM THUNDERX2 SoC PMU UNCORE
+ *
+ * Copyright (C) 2018 Cavium Inc.
+ * Author: Ganapatrao Kulkarni 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* L3c and DMC has 16 and 8 channels per socket respectively.
+ * Each Channel supports UNCORE PMU device and consists of
+ * 4 independent programmable counters. Counters are 32 bit
+ * and does not support overflow interrupt, they needs to be
+ * sampled before overflow(i.e, at every 2 seconds).
+ */
+
+#define UNCORE_MAX_COUNTERS4
+#define UNCORE_L3_MAX_TILES16
+#define UNCORE_DMC_MAX_CHANNELS8
+
+#define UNCORE_HRTIMER_INTERVAL(2 * NSEC_PER_SEC)
+#define GET_EVENTID(ev)((ev->hw.config) & 0x1ff)
+#define GET_COUNTERID(ev)  ((ev->hw.idx) & 0xf)
+#define GET_CHANNELID(pmu_uncore)  (pmu_uncore->channel)
+#define DMC_EVENT_CFG(idx, val)((val) << (((idx) * 8) + 1))
+
+#define DMC_COUNTER_CTL0x234
+#define DMC_COUNTER_DATA   0x240
+#define L3C_COUNTER_CTL0xA8
+#define L3C_COUNTER_DATA   0xAC
+
+#define THUNDERX2_SMC_CALL_ID  0xC200FF00
+#define THUNDERX2_SMC_SET_CHANNEL  0xB010
+
+enum thunderx2_uncore_l3_events {
+   L3_EVENT_NONE,
+   L3_EVENT_NBU_CANCEL,
+   L3_EVENT_DIB_RETRY,
+   L3_EVENT_DOB_RETRY,
+   L3_EVENT_DIB_CREDIT_RETRY,
+   L3_EVENT_DOB_CREDIT_RETRY,
+   L3_EVENT_FORCE_RETRY,
+   L3_EVENT_IDX_CONFLICT_RETRY,
+   L3_EVENT_EVICT_CONFLICT_RETRY,
+   L3_EVENT_BANK_CONFLICT_RETRY,
+   L3_EVENT_FILL_ENTRY_RETRY,
+   L3_EVENT_EVICT_NOT_READY_RETRY,
+   L3_EVENT_L3_RETRY,
+   L3_EVENT_READ_REQ,
+   L3_EVENT_WRITE_BACK_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_REQ,
+   L3_EVENT_INV_REQ,
+   L3_EVENT_SELF_REQ,
+   L3_EVENT_REQ,
+   L3_EVENT_EVICT_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_HIT,
+   L3_EVENT_INVALIDATE_HIT,
+   L3_EVENT_SELF_HIT,
+   L3_EVENT_READ_HIT,
+   L3_EVENT_MAX,
+};
+
+enum thunderx2_uncore_dmc_events {
+   DMC_EVENT_NONE,
+   DMC_EVENT_COUNT_CYCLES,
+   DMC_EVENT_RES2,
+

[PATCH v5 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

2018-05-15 Thread Ganapatrao Kulkarni

This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
The SoC has PMU support in L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

v5:
 -Incroporated review comments from Mark Rutland[2]
v4:
 -Incroporated review comments from Mark Rutland[1]

[1] https://www.spinics.net/lists/arm-kernel/msg588563.html
[2] https://lkml.org/lkml/2018/4/26/376

v3:
 - fixed warning reported by kbuild robot

v2:
 - rebased to 4.12-rc1
 - Removed Arch VULCAN dependency.
 - update SMC call parameters as per latest firmware.

v1:
 -Initial patch

Ganapatrao Kulkarni (2):
  perf: uncore: Adding documentation for ThunderX2 pmu uncore driver
  ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

 Documentation/perf/thunderx2-pmu.txt |  66 +++
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 965 +++
 include/linux/cpuhotplug.h   |   1 +
 5 files changed, 1041 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt
 create mode 100644 drivers/perf/thunderx2_pmu.c

-- 
2.9.4

[PATCH v5 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

2018-05-15 Thread Ganapatrao Kulkarni

This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
The SoC has PMU support in L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

v5:
 -Incroporated review comments from Mark Rutland[2]
v4:
 -Incroporated review comments from Mark Rutland[1]

[1] https://www.spinics.net/lists/arm-kernel/msg588563.html
[2] https://lkml.org/lkml/2018/4/26/376

v3:
 - fixed warning reported by kbuild robot

v2:
 - rebased to 4.12-rc1
 - Removed Arch VULCAN dependency.
 - update SMC call parameters as per latest firmware.

v1:
 -Initial patch

Ganapatrao Kulkarni (2):
  perf: uncore: Adding documentation for ThunderX2 pmu uncore driver
  ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

 Documentation/perf/thunderx2-pmu.txt |  66 +++
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 965 +++
 include/linux/cpuhotplug.h   |   1 +
 5 files changed, 1041 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt
 create mode 100644 drivers/perf/thunderx2_pmu.c

-- 
2.9.4

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-15 Thread Ganapatrao Kulkarni

Hi Mark,


On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkm...@gmail.com> wrote:
> Hi Mark,
>
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> Hi,
>>
>> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>> +
>>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>>> + * Each Channel supports UNCORE PMU device and consists of
>>> + * 4 independent programmable counters. Counters are 32 bit
>>> + * and does not support overflow interrupt, they needs to be
>>> + * sampled before overflow(i.e, at every 2 seconds).
>>> + */
>>> +
>>> +#define UNCORE_MAX_COUNTERS  4
>>> +#define UNCORE_L3_MAX_TILES  16
>>> +#define UNCORE_DMC_MAX_CHANNELS  8
>>> +
>>> +#define UNCORE_HRTIMER_INTERVAL  (2 * NSEC_PER_SEC)
>>
>> How was a period of two seconds chosen?
>
> It has been suggested from hw team  to sample before  3 to 4 seconds.
>
>>
>> What's the maximum clock speed for the L3C and DMC?
>
> L3C at 1.5GHz and DMC at 1.2GHz
>>
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to try more often than once every 2
>> seconds...
>
> 2 seconds seems to be sufficient. So far testing looks good.
>
>>
>> [...]
>>
>>> +struct active_timer {
>>> + struct perf_event *event;
>>> + struct hrtimer hrtimer;
>>> +};
>>> +
>>> +/*
>>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>>> + * each uncore device has up to 16 channels, each channel can sample
>>> + * events independently with counters up to 4.
>>> + *
>>> + * struct thunderx2_pmu_uncore_channel created per channel.
>>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>>> + */
>>> +struct thunderx2_pmu_uncore_channel {
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + struct pmu pmu;
>>
>> Can we put the pmu first in the struct, please?
>
> ok
>>
>>> + int counter;
>>
>> AFAICT, this counter field is never used.
>
> hmm ok, will remove.
>>
>>> + int channel;
>>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>>> + struct active_timer *active_timers;
>>
>> You should only need a single timer per channel, rather than one per
>> event.
>>
>> I think you can get rid of the active_timer structure, and have:
>>
>> struct perf_event *events[UNCORE_MAX_COUNTERS];
>> struct hrtimer timer;
>>
>
> thanks, will change as suggested.
>
>>> + /* to sync counter alloc/release */
>>> + raw_spinlock_t lock;
>>> +};
>>> +
>>> +struct thunderx2_pmu_uncore_dev {
>>> + char *name;
>>> + struct device *dev;
>>> + enum thunderx2_uncore_type type;
>>> + unsigned long base;
>>
>> This should be:
>>
>> void __iomem *base;
>
> ok
>>
>> [...]
>>
>>> +static ssize_t cpumask_show(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + struct cpumask cpu_mask;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>> +
>>> + /* Pick first online cpu from the node */
>>> + cpumask_clear(_mask);
>>> + cpumask_set_cpu(cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>>> + _mask);
>>> +
>>> + return cpumap_print_to_pagebuf(true, buf, _mask);
>>> +}
>>
>> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>>
>> Regardless, I don't think that you can keep track of the management CPU
>> this way. Please keep track of the CPU the PMU should be managed by,
>> either with a cpumask or int embedded within the PMU structure.
>
> thanks, will add hotplug callbacks.
>>
>> At hotplug time, you'll need to update the management CPU, calling
>> perf_pmu_migrate_context() when it is offlined.
>
> ok
>>
>> [...]
>>
>>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + int counter;
>>> +
>>> + raw_spin_lock(_uncore->lock);
>>> + counter = find_first_zero_bit(pmu_un

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-15 Thread Ganapatrao Kulkarni

Hi Mark,


On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni  wrote:
> Hi Mark,
>
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland  wrote:
>> Hi,
>>
>> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>> +
>>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>>> + * Each Channel supports UNCORE PMU device and consists of
>>> + * 4 independent programmable counters. Counters are 32 bit
>>> + * and does not support overflow interrupt, they needs to be
>>> + * sampled before overflow(i.e, at every 2 seconds).
>>> + */
>>> +
>>> +#define UNCORE_MAX_COUNTERS  4
>>> +#define UNCORE_L3_MAX_TILES  16
>>> +#define UNCORE_DMC_MAX_CHANNELS  8
>>> +
>>> +#define UNCORE_HRTIMER_INTERVAL  (2 * NSEC_PER_SEC)
>>
>> How was a period of two seconds chosen?
>
> It has been suggested from hw team  to sample before  3 to 4 seconds.
>
>>
>> What's the maximum clock speed for the L3C and DMC?
>
> L3C at 1.5GHz and DMC at 1.2GHz
>>
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to try more often than once every 2
>> seconds...
>
> 2 seconds seems to be sufficient. So far testing looks good.
>
>>
>> [...]
>>
>>> +struct active_timer {
>>> + struct perf_event *event;
>>> + struct hrtimer hrtimer;
>>> +};
>>> +
>>> +/*
>>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>>> + * each uncore device has up to 16 channels, each channel can sample
>>> + * events independently with counters up to 4.
>>> + *
>>> + * struct thunderx2_pmu_uncore_channel created per channel.
>>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>>> + */
>>> +struct thunderx2_pmu_uncore_channel {
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + struct pmu pmu;
>>
>> Can we put the pmu first in the struct, please?
>
> ok
>>
>>> + int counter;
>>
>> AFAICT, this counter field is never used.
>
> hmm ok, will remove.
>>
>>> + int channel;
>>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>>> + struct active_timer *active_timers;
>>
>> You should only need a single timer per channel, rather than one per
>> event.
>>
>> I think you can get rid of the active_timer structure, and have:
>>
>> struct perf_event *events[UNCORE_MAX_COUNTERS];
>> struct hrtimer timer;
>>
>
> thanks, will change as suggested.
>
>>> + /* to sync counter alloc/release */
>>> + raw_spinlock_t lock;
>>> +};
>>> +
>>> +struct thunderx2_pmu_uncore_dev {
>>> + char *name;
>>> + struct device *dev;
>>> + enum thunderx2_uncore_type type;
>>> + unsigned long base;
>>
>> This should be:
>>
>> void __iomem *base;
>
> ok
>>
>> [...]
>>
>>> +static ssize_t cpumask_show(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + struct cpumask cpu_mask;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>> +
>>> + /* Pick first online cpu from the node */
>>> + cpumask_clear(_mask);
>>> + cpumask_set_cpu(cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>>> + _mask);
>>> +
>>> + return cpumap_print_to_pagebuf(true, buf, _mask);
>>> +}
>>
>> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>>
>> Regardless, I don't think that you can keep track of the management CPU
>> this way. Please keep track of the CPU the PMU should be managed by,
>> either with a cpumask or int embedded within the PMU structure.
>
> thanks, will add hotplug callbacks.
>>
>> At hotplug time, you'll need to update the management CPU, calling
>> perf_pmu_migrate_context() when it is offlined.
>
> ok
>>
>> [...]
>>
>>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + int counter;
>>> +
>>> + raw_spin_lock(_uncore->lock);
>>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>>> +

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-10 Thread Ganapatrao Kulkarni

On Thu, May 10, 2018 at 1:00 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Thu 10-05-18 08:27:35, Ganapatrao Kulkarni wrote:
>> On Wed, May 9, 2018 at 6:26 PM, Michal Hocko <mho...@kernel.org> wrote:
>> > On Wed 09-05-18 18:07:16, Ganapatrao Kulkarni wrote:
>> >> Hi Michal
>> >>
>> >>
>> >> On Wed, May 9, 2018 at 5:54 PM, Michal Hocko <mho...@kernel.org> wrote:
>> >> > On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> >> >> Hi,
>> >> >> my attention was brought to the %subj commit and either I am missing
>> >> >> something or the patch is quite dubious. What is it actually trying to
>> >> >> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> >> >> get misleading numa topology (numactl -H output) but is the situation
>> >> >> much better with it applied? Numa init code will refuse to init more
>> >> >> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> >> >> which will break the topology again and numactl -H will have a
>> >> >> misleading output anyway.
>> >>
>> >> IIRC, the MEMBLOCK beyond max limit getting dropped from visible
>> >> memory(partial drop from a node).
>> >> this patch removed any upper limit on memblocks and allowed to parse
>> >> all entries of SRAT.
>> >
>> > Yeah I've understood that much. My question is, however, why do we care
>> > about parsing the NUMA topology when we fallback into a single NUMA node
>> > anyway? Or do I misunderstand the code? I do not have any platform with
>> > that many memblocks.
>>
>> IMHO, this fix is very much logical by removing the SRAT parsing restriction.
>> below is the crash log which made us to debug and eventually fix with
>> this patch.
>
> Ohh, I am not saying that the current code handles too many memblocks
> correctly. I just think that your fix is not correct or incomplete at
> least. Assuming that my understanding is correct which you haven't
> disputed yet. So can we focus on the proper solution now? Do we actually
> need the memblock restrictions? We do not need those for reserved
> memblocks so I do not see any real reason to simply remove the
> restriction altogether. Have you explored that option?

my logic was simple, when i added this patch, when the cap on max
memblocks is arch specific, why to restrict SRAT parsing which is not
arch specific.
other way around argument is, why the restriction added in the first
place itself!

> --
> Michal Hocko
> SUSE Labs

thanks
Ganapat

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-10 Thread Ganapatrao Kulkarni

On Thu, May 10, 2018 at 1:00 PM, Michal Hocko  wrote:
> On Thu 10-05-18 08:27:35, Ganapatrao Kulkarni wrote:
>> On Wed, May 9, 2018 at 6:26 PM, Michal Hocko  wrote:
>> > On Wed 09-05-18 18:07:16, Ganapatrao Kulkarni wrote:
>> >> Hi Michal
>> >>
>> >>
>> >> On Wed, May 9, 2018 at 5:54 PM, Michal Hocko  wrote:
>> >> > On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> >> >> Hi,
>> >> >> my attention was brought to the %subj commit and either I am missing
>> >> >> something or the patch is quite dubious. What is it actually trying to
>> >> >> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> >> >> get misleading numa topology (numactl -H output) but is the situation
>> >> >> much better with it applied? Numa init code will refuse to init more
>> >> >> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> >> >> which will break the topology again and numactl -H will have a
>> >> >> misleading output anyway.
>> >>
>> >> IIRC, the MEMBLOCK beyond max limit getting dropped from visible
>> >> memory(partial drop from a node).
>> >> this patch removed any upper limit on memblocks and allowed to parse
>> >> all entries of SRAT.
>> >
>> > Yeah I've understood that much. My question is, however, why do we care
>> > about parsing the NUMA topology when we fallback into a single NUMA node
>> > anyway? Or do I misunderstand the code? I do not have any platform with
>> > that many memblocks.
>>
>> IMHO, this fix is very much logical by removing the SRAT parsing restriction.
>> below is the crash log which made us to debug and eventually fix with
>> this patch.
>
> Ohh, I am not saying that the current code handles too many memblocks
> correctly. I just think that your fix is not correct or incomplete at
> least. Assuming that my understanding is correct which you haven't
> disputed yet. So can we focus on the proper solution now? Do we actually
> need the memblock restrictions? We do not need those for reserved
> memblocks so I do not see any real reason to simply remove the
> restriction altogether. Have you explored that option?

my logic was simple, when i added this patch, when the cap on max
memblocks is arch specific, why to restrict SRAT parsing which is not
arch specific.
other way around argument is, why the restriction added in the first
place itself!

> --
> Michal Hocko
> SUSE Labs

thanks
Ganapat

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-09 Thread Ganapatrao Kulkarni

On Wed, May 9, 2018 at 6:26 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Wed 09-05-18 18:07:16, Ganapatrao Kulkarni wrote:
>> Hi Michal
>>
>>
>> On Wed, May 9, 2018 at 5:54 PM, Michal Hocko <mho...@kernel.org> wrote:
>> > On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> >> Hi,
>> >> my attention was brought to the %subj commit and either I am missing
>> >> something or the patch is quite dubious. What is it actually trying to
>> >> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> >> get misleading numa topology (numactl -H output) but is the situation
>> >> much better with it applied? Numa init code will refuse to init more
>> >> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> >> which will break the topology again and numactl -H will have a
>> >> misleading output anyway.
>>
>> IIRC, the MEMBLOCK beyond max limit getting dropped from visible
>> memory(partial drop from a node).
>> this patch removed any upper limit on memblocks and allowed to parse
>> all entries of SRAT.
>
> Yeah I've understood that much. My question is, however, why do we care
> about parsing the NUMA topology when we fallback into a single NUMA node
> anyway? Or do I misunderstand the code? I do not have any platform with
> that many memblocks.

IMHO, this fix is very much logical by removing the SRAT parsing restriction.
below is the crash log which made us to debug and eventually fix with
this patch.

[0.00] NUMA: Adding memblock [0x8000 - 0xfeff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8000-0xfeff]
[0.00] NUMA: Adding memblock [0x88000 - 0xffcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x88000-0xffcff]
[0.00] NUMA: Adding memblock [0xffd00 - 0xf] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0xffd00-0xf]
[0.00] NUMA: Adding memblock [0x88 - 0x8bfcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x88-0x8bfcff]
[0.00] NUMA: Adding memblock [0x8bfd00 - 0x8ffcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8bfd00-0x8ffcff]
[0.00] NUMA: Adding memblock [0x8ffd00 - 0x93fcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8ffd00-0x93fcff]
[0.00] NUMA: Adding memblock [0x93fd00 - 0x9bfcff] on node 1
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x93fd00-0x9bfcff]
[0.00] NUMA: Adding memblock [0x9bfd00 - 0x9ffcff] on node 1
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x9bfd00-0x9ffcff]
[0.00] NUMA: Warning: invalid memblk node 4 [mem
0x9ffd00-0xa7fcff]
[0.00] NUMA: Faking a node at [mem
0x-0x00a7fcff]
[0.00] NUMA: Adding memblock [0x802f - 0x802f] on node 0
[0.00] NUMA: Adding memblock [0x8030 - 0xbfff] on node 0
[0.00] NUMA: Adding memblock [0xc400 - 0xf5ef] on node 0
[0.00] NUMA: Adding memblock [0xf5f0 - 0xf5f6] on node 0
[0.00] NUMA: Adding memblock [0xf5f7 - 0xf603] on node 0
[0.00] NUMA: Adding memblock [0xf604 - 0xf667] on node 0
[0.00] NUMA: Adding memblock [0xf668 - 0xfe45] on node 0
[0.00] NUMA: Adding memblock [0xfe46 - 0xfe4e] on node 0
[0.00] NUMA: Adding memblock [0xfe4f - 0xfe4f] on node 0
[0.00] NUMA: Adding memblock [0xfe50 - 0xfe61] on node 0
[0.00] NUMA: Adding memblock [0xfe62 - 0xfeff] on node 0
[0.00] NUMA: Adding memblock [0x88000 - 0xf] on node 0
[0.00] NUMA: Adding memblock [0x88 - 0x93fcff] on node 0
[0.00] NUMA: Adding memblock [0x93fd00 - 0x9ffcff] on node 0
[0.00] NUMA: Warning: invalid memblk node 4 [mem
0x9ffd00-0xa7fcff]
[0.00] Unable to handle kernel NULL pointer dereference at
virtual address 1b40
[0.00] pgd = fc000957
[0.00] [1b40] *pgd=00a7fcfe0003,
*pud=00a7fcfe0003, *pmd=00a7fcfe0003, *pte=
[0.00] Internal error: Oops: 9606 [#1] SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
4.11.12-11.cavium.ml.aarch64 #1
[0.00] Hardware name: (null) (DT)
[0.00] task: fc0008d35780 task.stack: fc0008cf
[0.00] PC is at sparse_early_usemaps_alloc_node+0x20/0xb4
[0.00] LR is at sparse_init+0xec/0x204
[0.00] pc : [] lr : []
pstate: 8089
[0.00] sp : fc0008cf3e40

thanks
Ganapat
>
> --
> Michal Hocko
> SUSE Labs

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-09 Thread Ganapatrao Kulkarni

On Wed, May 9, 2018 at 6:26 PM, Michal Hocko  wrote:
> On Wed 09-05-18 18:07:16, Ganapatrao Kulkarni wrote:
>> Hi Michal
>>
>>
>> On Wed, May 9, 2018 at 5:54 PM, Michal Hocko  wrote:
>> > On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> >> Hi,
>> >> my attention was brought to the %subj commit and either I am missing
>> >> something or the patch is quite dubious. What is it actually trying to
>> >> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> >> get misleading numa topology (numactl -H output) but is the situation
>> >> much better with it applied? Numa init code will refuse to init more
>> >> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> >> which will break the topology again and numactl -H will have a
>> >> misleading output anyway.
>>
>> IIRC, the MEMBLOCK beyond max limit getting dropped from visible
>> memory(partial drop from a node).
>> this patch removed any upper limit on memblocks and allowed to parse
>> all entries of SRAT.
>
> Yeah I've understood that much. My question is, however, why do we care
> about parsing the NUMA topology when we fallback into a single NUMA node
> anyway? Or do I misunderstand the code? I do not have any platform with
> that many memblocks.

IMHO, this fix is very much logical by removing the SRAT parsing restriction.
below is the crash log which made us to debug and eventually fix with
this patch.

[0.00] NUMA: Adding memblock [0x8000 - 0xfeff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8000-0xfeff]
[0.00] NUMA: Adding memblock [0x88000 - 0xffcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x88000-0xffcff]
[0.00] NUMA: Adding memblock [0xffd00 - 0xf] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0xffd00-0xf]
[0.00] NUMA: Adding memblock [0x88 - 0x8bfcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x88-0x8bfcff]
[0.00] NUMA: Adding memblock [0x8bfd00 - 0x8ffcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8bfd00-0x8ffcff]
[0.00] NUMA: Adding memblock [0x8ffd00 - 0x93fcff] on node 0
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x8ffd00-0x93fcff]
[0.00] NUMA: Adding memblock [0x93fd00 - 0x9bfcff] on node 1
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x93fd00-0x9bfcff]
[0.00] NUMA: Adding memblock [0x9bfd00 - 0x9ffcff] on node 1
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x9bfd00-0x9ffcff]
[0.00] NUMA: Warning: invalid memblk node 4 [mem
0x9ffd00-0xa7fcff]
[0.00] NUMA: Faking a node at [mem
0x-0x00a7fcff]
[0.00] NUMA: Adding memblock [0x802f - 0x802f] on node 0
[0.00] NUMA: Adding memblock [0x8030 - 0xbfff] on node 0
[0.00] NUMA: Adding memblock [0xc400 - 0xf5ef] on node 0
[0.00] NUMA: Adding memblock [0xf5f0 - 0xf5f6] on node 0
[0.00] NUMA: Adding memblock [0xf5f7 - 0xf603] on node 0
[0.00] NUMA: Adding memblock [0xf604 - 0xf667] on node 0
[0.00] NUMA: Adding memblock [0xf668 - 0xfe45] on node 0
[0.00] NUMA: Adding memblock [0xfe46 - 0xfe4e] on node 0
[0.00] NUMA: Adding memblock [0xfe4f - 0xfe4f] on node 0
[0.00] NUMA: Adding memblock [0xfe50 - 0xfe61] on node 0
[0.00] NUMA: Adding memblock [0xfe62 - 0xfeff] on node 0
[0.00] NUMA: Adding memblock [0x88000 - 0xf] on node 0
[0.00] NUMA: Adding memblock [0x88 - 0x93fcff] on node 0
[0.00] NUMA: Adding memblock [0x93fd00 - 0x9ffcff] on node 0
[0.00] NUMA: Warning: invalid memblk node 4 [mem
0x9ffd00-0xa7fcff]
[0.00] Unable to handle kernel NULL pointer dereference at
virtual address 1b40
[0.00] pgd = fc000957
[0.00] [1b40] *pgd=00a7fcfe0003,
*pud=00a7fcfe0003, *pmd=00a7fcfe0003, *pte=
[0.00] Internal error: Oops: 9606 [#1] SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
4.11.12-11.cavium.ml.aarch64 #1
[0.00] Hardware name: (null) (DT)
[0.00] task: fc0008d35780 task.stack: fc0008cf
[0.00] PC is at sparse_early_usemaps_alloc_node+0x20/0xb4
[0.00] LR is at sparse_init+0xec/0x204
[0.00] pc : [] lr : []
pstate: 8089
[0.00] sp : fc0008cf3e40

thanks
Ganapat
>
> --
> Michal Hocko
> SUSE Labs

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-09 Thread Ganapatrao Kulkarni

Hi Michal


On Wed, May 9, 2018 at 5:54 PM, Michal Hocko  wrote:
> On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> Hi,
>> my attention was brought to the %subj commit and either I am missing
>> something or the patch is quite dubious. What is it actually trying to
>> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> get misleading numa topology (numactl -H output) but is the situation
>> much better with it applied? Numa init code will refuse to init more
>> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> which will break the topology again and numactl -H will have a
>> misleading output anyway.

IIRC, the MEMBLOCK beyond max limit getting dropped from visible
memory(partial drop from a node).
this patch removed any upper limit on memblocks and allowed to parse
all entries of SRAT.

>>
>> So why is the patch an improvement at all?
>
> ping? I would be tempted to simply revert the patch as a wrong fix.
> --
> Michal Hocko
> SUSE Labs

thanks
Ganapat
sorry, somehow, i have missed your previous email

Re: fd3e45436660 ("ACPI / NUMA: ia64: Parse all entries of SRAT memory affinity table")

2018-05-09 Thread Ganapatrao Kulkarni

Hi Michal


On Wed, May 9, 2018 at 5:54 PM, Michal Hocko  wrote:
> On Wed 11-04-18 12:48:32, Michal Hocko wrote:
>> Hi,
>> my attention was brought to the %subj commit and either I am missing
>> something or the patch is quite dubious. What is it actually trying to
>> fix? If a BIOS/FW provides more memblocks than the limit then we would
>> get misleading numa topology (numactl -H output) but is the situation
>> much better with it applied? Numa init code will refuse to init more
>> memblocks than the limit and falls back to dummy_numa_init (AFAICS)
>> which will break the topology again and numactl -H will have a
>> misleading output anyway.

IIRC, the MEMBLOCK beyond max limit getting dropped from visible
memory(partial drop from a node).
this patch removed any upper limit on memblocks and allowed to parse
all entries of SRAT.

>>
>> So why is the patch an improvement at all?
>
> ping? I would be tempted to simply revert the patch as a wrong fix.
> --
> Michal Hocko
> SUSE Labs

thanks
Ganapat
sorry, somehow, i have missed your previous email

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-04 Thread Ganapatrao Kulkarni

Hi Mark,

On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutl...@arm.com> wrote:
> Hi,
>
> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>> +
>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>> + * Each Channel supports UNCORE PMU device and consists of
>> + * 4 independent programmable counters. Counters are 32 bit
>> + * and does not support overflow interrupt, they needs to be
>> + * sampled before overflow(i.e, at every 2 seconds).
>> + */
>> +
>> +#define UNCORE_MAX_COUNTERS  4
>> +#define UNCORE_L3_MAX_TILES  16
>> +#define UNCORE_DMC_MAX_CHANNELS  8
>> +
>> +#define UNCORE_HRTIMER_INTERVAL  (2 * NSEC_PER_SEC)
>
> How was a period of two seconds chosen?

It has been suggested from hw team  to sample before  3 to 4 seconds.

>
> What's the maximum clock speed for the L3C and DMC?

L3C at 1.5GHz and DMC at 1.2GHz
>
> Given that all channels compete for access to the muxed register
> interface, I suspect we need to try more often than once every 2
> seconds...

2 seconds seems to be sufficient. So far testing looks good.

>
> [...]
>
>> +struct active_timer {
>> + struct perf_event *event;
>> + struct hrtimer hrtimer;
>> +};
>> +
>> +/*
>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>> + * each uncore device has up to 16 channels, each channel can sample
>> + * events independently with counters up to 4.
>> + *
>> + * struct thunderx2_pmu_uncore_channel created per channel.
>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>> + */
>> +struct thunderx2_pmu_uncore_channel {
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + struct pmu pmu;
>
> Can we put the pmu first in the struct, please?

ok
>
>> + int counter;
>
> AFAICT, this counter field is never used.

hmm ok, will remove.
>
>> + int channel;
>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>> + struct active_timer *active_timers;
>
> You should only need a single timer per channel, rather than one per
> event.
>
> I think you can get rid of the active_timer structure, and have:
>
> struct perf_event *events[UNCORE_MAX_COUNTERS];
> struct hrtimer timer;
>

thanks, will change as suggested.

>> + /* to sync counter alloc/release */
>> + raw_spinlock_t lock;
>> +};
>> +
>> +struct thunderx2_pmu_uncore_dev {
>> + char *name;
>> + struct device *dev;
>> + enum thunderx2_uncore_type type;
>> + unsigned long base;
>
> This should be:
>
> void __iomem *base;

ok
>
> [...]
>
>> +static ssize_t cpumask_show(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + struct cpumask cpu_mask;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>> +
>> + /* Pick first online cpu from the node */
>> + cpumask_clear(_mask);
>> + cpumask_set_cpu(cpumask_first(
>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>> + _mask);
>> +
>> + return cpumap_print_to_pagebuf(true, buf, _mask);
>> +}
>
> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>
> Regardless, I don't think that you can keep track of the management CPU
> this way. Please keep track of the CPU the PMU should be managed by,
> either with a cpumask or int embedded within the PMU structure.

thanks, will add hotplug callbacks.
>
> At hotplug time, you'll need to update the management CPU, calling
> perf_pmu_migrate_context() when it is offlined.

ok
>
> [...]
>
>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> +{
>> + int counter;
>> +
>> + raw_spin_lock(_uncore->lock);
>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> + pmu_uncore->uncore_dev->max_counters);
>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> + raw_spin_unlock(_uncore->lock);
>> + return -ENOSPC;
>> + }
>> + set_bit(counter, pmu_uncore->counter_mask);
>> + raw_spin_unlock(_uncore->lock);
>> + return counter;
>> +}
>> +
>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>> + int counter)
>> +{
>> +

Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-05-04 Thread Ganapatrao Kulkarni

Hi Mark,

On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland  wrote:
> Hi,
>
> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>> +
>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>> + * Each Channel supports UNCORE PMU device and consists of
>> + * 4 independent programmable counters. Counters are 32 bit
>> + * and does not support overflow interrupt, they needs to be
>> + * sampled before overflow(i.e, at every 2 seconds).
>> + */
>> +
>> +#define UNCORE_MAX_COUNTERS  4
>> +#define UNCORE_L3_MAX_TILES  16
>> +#define UNCORE_DMC_MAX_CHANNELS  8
>> +
>> +#define UNCORE_HRTIMER_INTERVAL  (2 * NSEC_PER_SEC)
>
> How was a period of two seconds chosen?

It has been suggested from hw team  to sample before  3 to 4 seconds.

>
> What's the maximum clock speed for the L3C and DMC?

L3C at 1.5GHz and DMC at 1.2GHz
>
> Given that all channels compete for access to the muxed register
> interface, I suspect we need to try more often than once every 2
> seconds...

2 seconds seems to be sufficient. So far testing looks good.

>
> [...]
>
>> +struct active_timer {
>> + struct perf_event *event;
>> + struct hrtimer hrtimer;
>> +};
>> +
>> +/*
>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>> + * each uncore device has up to 16 channels, each channel can sample
>> + * events independently with counters up to 4.
>> + *
>> + * struct thunderx2_pmu_uncore_channel created per channel.
>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>> + */
>> +struct thunderx2_pmu_uncore_channel {
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + struct pmu pmu;
>
> Can we put the pmu first in the struct, please?

ok
>
>> + int counter;
>
> AFAICT, this counter field is never used.

hmm ok, will remove.
>
>> + int channel;
>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>> + struct active_timer *active_timers;
>
> You should only need a single timer per channel, rather than one per
> event.
>
> I think you can get rid of the active_timer structure, and have:
>
> struct perf_event *events[UNCORE_MAX_COUNTERS];
> struct hrtimer timer;
>

thanks, will change as suggested.

>> + /* to sync counter alloc/release */
>> + raw_spinlock_t lock;
>> +};
>> +
>> +struct thunderx2_pmu_uncore_dev {
>> + char *name;
>> + struct device *dev;
>> + enum thunderx2_uncore_type type;
>> + unsigned long base;
>
> This should be:
>
> void __iomem *base;

ok
>
> [...]
>
>> +static ssize_t cpumask_show(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + struct cpumask cpu_mask;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>> +
>> + /* Pick first online cpu from the node */
>> + cpumask_clear(_mask);
>> + cpumask_set_cpu(cpumask_first(
>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>> + _mask);
>> +
>> + return cpumap_print_to_pagebuf(true, buf, _mask);
>> +}
>
> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>
> Regardless, I don't think that you can keep track of the management CPU
> this way. Please keep track of the CPU the PMU should be managed by,
> either with a cpumask or int embedded within the PMU structure.

thanks, will add hotplug callbacks.
>
> At hotplug time, you'll need to update the management CPU, calling
> perf_pmu_migrate_context() when it is offlined.

ok
>
> [...]
>
>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> +{
>> + int counter;
>> +
>> + raw_spin_lock(_uncore->lock);
>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> + pmu_uncore->uncore_dev->max_counters);
>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> + raw_spin_unlock(_uncore->lock);
>> + return -ENOSPC;
>> + }
>> + set_bit(counter, pmu_uncore->counter_mask);
>> + raw_spin_unlock(_uncore->lock);
>> + return counter;
>> +}
>> +
>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>> + int counter)
>> +{
>> + raw_spin_lock(_uncore->lock);
>> +

Re: [PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-04-26 Thread Ganapatrao Kulkarni

On Fri, Apr 27, 2018 at 2:25 AM, Randy Dunlap <rdun...@infradead.org> wrote:
> Hi,
>
> Just a few typo corrections...
>
> On 04/25/2018 02:00 AM, Ganapatrao Kulkarni wrote:
>> Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
>> The SoC has PMU support in its L3 cache controller (L3C) and in the
>> DDR4 Memory Controller (DMC).
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>  Documentation/perf/thunderx2-pmu.txt | 66 
>> 
>>  1 file changed, 66 insertions(+)
>>  create mode 100644 Documentation/perf/thunderx2-pmu.txt
>>
>> diff --git a/Documentation/perf/thunderx2-pmu.txt 
>> b/Documentation/perf/thunderx2-pmu.txt
>> new file mode 100644
>> index 000..9e9f535
>> --- /dev/null
>> +++ b/Documentation/perf/thunderx2-pmu.txt
>> @@ -0,0 +1,66 @@
>> +
>> +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
>> +==
>> +
>> +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
>> +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
>> +
>> +It has 8 independent DMC PMUs to capture performance events corresponding
>> +to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
>> +to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
>> +up to 4 counters.
>> +
>> +Counters are independent programmable and can be started and stopped
>
> independently

thanks, will update.
>
>> +individually. Each counter can be set to sample specific perf events.
>> +Counters are 32 bit and does not support overflow interrupt, they are
>
>do notinterrupt; they are
>
>
>> +sampled at every 2 seconds. The Counters register access are multiplexed
>> +across channels of DMC and L3C. The muxing(select channel) is done through
>> +write to a Secure register using smcc calls.
>> +
>> +PMU UNCORE (perf) driver:
>> +
>> +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C 
>> devices.
>> +Each of the PMU provides description of its available events
>
> of the PMUs
>
>> +and configuration options in sysfs.
>> + see /sys/devices/uncore_
>> +
>> +S is socket id and X represents channel number.
>> +Each PMU can be used to sample up to 4 events simultaneously.
>> +
>> +The "format" directory describes format of the config (event ID).
>> +The "events" directory provides configuration templates for all
>> +supported event types that can be used with perf tool.
>> +
>> +For example, "uncore_dmc_0_0/cnt_cycles/" is an
>> +equivalent of "uncore_dmc_0_0/config=0x1/".
>> +
>> +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
>> +single CPU ID of the processor which is likely to be used to handle all the
>> +PMU events. It will be the first online CPU from the NUMA node of PMU 
>> device.
>> +
>> +Example for perf tool use:
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_1/cnt_cycles/,\
>> +uncore_dmc_0_2/cnt_cycles/,\
>> +uncore_dmc_0_3/cnt_cycles/,\
>> +uncore_dmc_0_4/cnt_cycles/,\
>> +uncore_dmc_0_5/cnt_cycles/,\
>> +uncore_dmc_0_6/cnt_cycles/,\
>> +uncore_dmc_0_7/cnt_cycles/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cancelled_read_txns/,\
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_0/consumed_read_txns/,\
>> +uncore_dmc_0_0/data_transfers/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_l3c_0_0/l3_retry/,\
>> +uncore_l3c_0_0/read_hit/,\
>> +uncore_l3c_0_0/read_request/,\
>> +uncore_l3c_0_0/inv_request/ sleep 1
>> +
>> +The driver does not support sampling, therefore "perf record" will
>> +not work. Per-task (without "-a") perf sessions are not supported.
>>
>
>
> --
> ~Randy

thanks
Ganapat

Re: [PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-04-26 Thread Ganapatrao Kulkarni

On Fri, Apr 27, 2018 at 2:25 AM, Randy Dunlap  wrote:
> Hi,
>
> Just a few typo corrections...
>
> On 04/25/2018 02:00 AM, Ganapatrao Kulkarni wrote:
>> Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
>> The SoC has PMU support in its L3 cache controller (L3C) and in the
>> DDR4 Memory Controller (DMC).
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>>  Documentation/perf/thunderx2-pmu.txt | 66 
>> 
>>  1 file changed, 66 insertions(+)
>>  create mode 100644 Documentation/perf/thunderx2-pmu.txt
>>
>> diff --git a/Documentation/perf/thunderx2-pmu.txt 
>> b/Documentation/perf/thunderx2-pmu.txt
>> new file mode 100644
>> index 000..9e9f535
>> --- /dev/null
>> +++ b/Documentation/perf/thunderx2-pmu.txt
>> @@ -0,0 +1,66 @@
>> +
>> +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
>> +==
>> +
>> +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
>> +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
>> +
>> +It has 8 independent DMC PMUs to capture performance events corresponding
>> +to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
>> +to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
>> +up to 4 counters.
>> +
>> +Counters are independent programmable and can be started and stopped
>
> independently

thanks, will update.
>
>> +individually. Each counter can be set to sample specific perf events.
>> +Counters are 32 bit and does not support overflow interrupt, they are
>
>do notinterrupt; they are
>
>
>> +sampled at every 2 seconds. The Counters register access are multiplexed
>> +across channels of DMC and L3C. The muxing(select channel) is done through
>> +write to a Secure register using smcc calls.
>> +
>> +PMU UNCORE (perf) driver:
>> +
>> +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C 
>> devices.
>> +Each of the PMU provides description of its available events
>
> of the PMUs
>
>> +and configuration options in sysfs.
>> + see /sys/devices/uncore_
>> +
>> +S is socket id and X represents channel number.
>> +Each PMU can be used to sample up to 4 events simultaneously.
>> +
>> +The "format" directory describes format of the config (event ID).
>> +The "events" directory provides configuration templates for all
>> +supported event types that can be used with perf tool.
>> +
>> +For example, "uncore_dmc_0_0/cnt_cycles/" is an
>> +equivalent of "uncore_dmc_0_0/config=0x1/".
>> +
>> +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
>> +single CPU ID of the processor which is likely to be used to handle all the
>> +PMU events. It will be the first online CPU from the NUMA node of PMU 
>> device.
>> +
>> +Example for perf tool use:
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_1/cnt_cycles/,\
>> +uncore_dmc_0_2/cnt_cycles/,\
>> +uncore_dmc_0_3/cnt_cycles/,\
>> +uncore_dmc_0_4/cnt_cycles/,\
>> +uncore_dmc_0_5/cnt_cycles/,\
>> +uncore_dmc_0_6/cnt_cycles/,\
>> +uncore_dmc_0_7/cnt_cycles/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cancelled_read_txns/,\
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_0/consumed_read_txns/,\
>> +uncore_dmc_0_0/data_transfers/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_l3c_0_0/l3_retry/,\
>> +uncore_l3c_0_0/read_hit/,\
>> +uncore_l3c_0_0/read_request/,\
>> +uncore_l3c_0_0/inv_request/ sleep 1
>> +
>> +The driver does not support sampling, therefore "perf record" will
>> +not work. Per-task (without "-a") perf sessions are not supported.
>>
>
>
> --
> ~Randy

thanks
Ganapat

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-26 Thread Ganapatrao Kulkarni

Hi Robin,

On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
<gklkm...@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.mur...@arm.com> wrote:
>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>
>>> The performance drop is observed with long hours iperf testing using 40G
>>> cards. This is mainly due to long iterations in finding the free iova
>>> range in 32bit address space.
>>>
>>> In current implementation for 64bit PCI devices, there is always first
>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>> range. Once we run out 32bit range, there is allocation from higher range,
>>> however due to cached32_node optimization it does not suppose to be
>>> painful. cached32_node always points to recently allocated 32-bit node.
>>> When address range is full, it will be pointing to last allocated node
>>> (leaf node), so walking rbtree to find the available range is not
>>> expensive affair. However this optimization does not behave well when
>>> one of the middle node is freed. In that case cached32_node is updated
>>> to point to next iova range. The next iova allocation will consume free
>>> range and again update cached32_node to itself. From now on, walking
>>> over 32-bit range is more expensive.
>>>
>>> This patch adds fix to update cached node to leaf node when there are no
>>> iova free range left, which avoids unnecessary long iterations.
>>
>>
>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>> "space full". Say that after some time the 32-bit space ends up empty except
>> for one page at 0x1000 and one at 0x8000, then somebody tries to
>> allocate 2GB. If we move the cached node down to the leftmost entry when
>> that fails, all subsequent allocation attempts are now going to fail despite
>> the space being 99.% free!
>>
>> I can see a couple of ways to solve that general problem of free space above
>> the cached node getting lost, but neither of them helps with the case where
>> there is genuinely insufficient space (and if anything would make it even
>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>> allocation cannot possibly succeed, the only reliable idea which comes to
>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>> it looks.

For this testing, dual port intel 40G card(XL710) used and both ports
were connected in loop-back. Ran iperf server and clients on both
ports(used NAT to route packets out on intended ports).There were 10
iperf clients invoked every 60 seconds in loop for hours for each
port. Initially the performance on both ports is seen close to line
rate, however after test ran about 4 to 6 hours, the performance
started dropping  to very low (to few hundred Mbps) on both
connections.

IMO,  this is common bug and should happen on any other platforms too
and needs to be fixed at the earliest.
Please let me know if you have better way to fix this,  i am happy to
test your patch!

>
> i see 2 problems in current implementation,
> 1. We don't replenish the 32 bits range, until first attempt of second
> allocation(64 bit) fails.
> 2. Having  per cpu cache might not yield good hit on platforms with
> more number of CPUs.
>
> however irrespective of current issues, It makes sense to update
> cached node as done in this patch , when there is failure to get iova
> range using current cached pointer which is forcing for the
> unnecessary time consuming do-while iterations until any replenish
> happens!
>
> thanks
> Ganapat
>
>>
>> Robin.
>>
>>
>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>> ---
>>>   drivers/iommu/iova.c | 6 ++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 83fe262..e6ee2ea 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>> +   /* No more cached node points to free hole, update to leaf
>>> node.
>>> +*/
>>> +   struct iova *prev_iova;
>>> +
>>> +   prev_iova = rb_entry(prev, struct iova, node);
>>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>>> return -ENOMEM;
>>> }
>>>
>>

thanks
Ganapat

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-26 Thread Ganapatrao Kulkarni

Hi Robin,

On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
 wrote:
> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy  wrote:
>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>
>>> The performance drop is observed with long hours iperf testing using 40G
>>> cards. This is mainly due to long iterations in finding the free iova
>>> range in 32bit address space.
>>>
>>> In current implementation for 64bit PCI devices, there is always first
>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>> range. Once we run out 32bit range, there is allocation from higher range,
>>> however due to cached32_node optimization it does not suppose to be
>>> painful. cached32_node always points to recently allocated 32-bit node.
>>> When address range is full, it will be pointing to last allocated node
>>> (leaf node), so walking rbtree to find the available range is not
>>> expensive affair. However this optimization does not behave well when
>>> one of the middle node is freed. In that case cached32_node is updated
>>> to point to next iova range. The next iova allocation will consume free
>>> range and again update cached32_node to itself. From now on, walking
>>> over 32-bit range is more expensive.
>>>
>>> This patch adds fix to update cached node to leaf node when there are no
>>> iova free range left, which avoids unnecessary long iterations.
>>
>>
>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>> "space full". Say that after some time the 32-bit space ends up empty except
>> for one page at 0x1000 and one at 0x8000, then somebody tries to
>> allocate 2GB. If we move the cached node down to the leftmost entry when
>> that fails, all subsequent allocation attempts are now going to fail despite
>> the space being 99.% free!
>>
>> I can see a couple of ways to solve that general problem of free space above
>> the cached node getting lost, but neither of them helps with the case where
>> there is genuinely insufficient space (and if anything would make it even
>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>> allocation cannot possibly succeed, the only reliable idea which comes to
>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>> it looks.

For this testing, dual port intel 40G card(XL710) used and both ports
were connected in loop-back. Ran iperf server and clients on both
ports(used NAT to route packets out on intended ports).There were 10
iperf clients invoked every 60 seconds in loop for hours for each
port. Initially the performance on both ports is seen close to line
rate, however after test ran about 4 to 6 hours, the performance
started dropping  to very low (to few hundred Mbps) on both
connections.

IMO,  this is common bug and should happen on any other platforms too
and needs to be fixed at the earliest.
Please let me know if you have better way to fix this,  i am happy to
test your patch!

>
> i see 2 problems in current implementation,
> 1. We don't replenish the 32 bits range, until first attempt of second
> allocation(64 bit) fails.
> 2. Having  per cpu cache might not yield good hit on platforms with
> more number of CPUs.
>
> however irrespective of current issues, It makes sense to update
> cached node as done in this patch , when there is failure to get iova
> range using current cached pointer which is forcing for the
> unnecessary time consuming do-while iterations until any replenish
> happens!
>
> thanks
> Ganapat
>
>>
>> Robin.
>>
>>
>>> Signed-off-by: Ganapatrao Kulkarni 
>>> ---
>>>   drivers/iommu/iova.c | 6 ++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 83fe262..e6ee2ea 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>> +   /* No more cached node points to free hole, update to leaf
>>> node.
>>> +*/
>>> +   struct iova *prev_iova;
>>> +
>>> +   prev_iova = rb_entry(prev, struct iova, node);
>>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>>> return -ENOMEM;
>>> }
>>>
>>

thanks
Ganapat

[PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-04-25 Thread Ganapatrao Kulkarni

Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 Documentation/perf/thunderx2-pmu.txt | 66 
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt

diff --git a/Documentation/perf/thunderx2-pmu.txt 
b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 000..9e9f535
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,66 @@
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+It has 8 independent DMC PMUs to capture performance events corresponding
+to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
+to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
+up to 4 counters.
+
+Counters are independent programmable and can be started and stopped
+individually. Each counter can be set to sample specific perf events.
+Counters are 32 bit and does not support overflow interrupt, they are
+sampled at every 2 seconds. The Counters register access are multiplexed
+across channels of DMC and L3C. The muxing(select channel) is done through
+write to a Secure register using smcc calls.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMU provides description of its available events
+and configuration options in sysfs.
+   see /sys/devices/uncore_
+
+S is socket id and X represents channel number.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e \
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_1/cnt_cycles/,\
+uncore_dmc_0_2/cnt_cycles/,\
+uncore_dmc_0_3/cnt_cycles/,\
+uncore_dmc_0_4/cnt_cycles/,\
+uncore_dmc_0_5/cnt_cycles/,\
+uncore_dmc_0_6/cnt_cycles/,\
+uncore_dmc_0_7/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0_0/cancelled_read_txns/,\
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_0/consumed_read_txns/,\
+uncore_dmc_0_0/data_transfers/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0_0/l3_retry/,\
+uncore_l3c_0_0/read_hit/,\
+uncore_l3c_0_0/read_request/,\
+uncore_l3c_0_0/inv_request/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
-- 
2.9.4

[PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

2018-04-25 Thread Ganapatrao Kulkarni

Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni 
---
 Documentation/perf/thunderx2-pmu.txt | 66 
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt

diff --git a/Documentation/perf/thunderx2-pmu.txt 
b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 000..9e9f535
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,66 @@
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+It has 8 independent DMC PMUs to capture performance events corresponding
+to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
+to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
+up to 4 counters.
+
+Counters are independent programmable and can be started and stopped
+individually. Each counter can be set to sample specific perf events.
+Counters are 32 bit and does not support overflow interrupt, they are
+sampled at every 2 seconds. The Counters register access are multiplexed
+across channels of DMC and L3C. The muxing(select channel) is done through
+write to a Secure register using smcc calls.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMU provides description of its available events
+and configuration options in sysfs.
+   see /sys/devices/uncore_
+
+S is socket id and X represents channel number.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e \
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_1/cnt_cycles/,\
+uncore_dmc_0_2/cnt_cycles/,\
+uncore_dmc_0_3/cnt_cycles/,\
+uncore_dmc_0_4/cnt_cycles/,\
+uncore_dmc_0_5/cnt_cycles/,\
+uncore_dmc_0_6/cnt_cycles/,\
+uncore_dmc_0_7/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0_0/cancelled_read_txns/,\
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_0/consumed_read_txns/,\
+uncore_dmc_0_0/data_transfers/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0_0/l3_retry/,\
+uncore_l3c_0_0/read_hit/,\
+uncore_l3c_0_0/read_request/,\
+uncore_l3c_0_0/inv_request/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
-- 
2.9.4

[PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-04-25 Thread Ganapatrao Kulkarni

This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
Controller(DMC) and Level 3 Cache(L3C).

ThunderX2 has 8 independent DMC PMUs to capture performance events
corresponding to 8 channels of DDR4 Memory Controller and 16 independent
L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
Each PMU supports up to 4 counters. All counters lack overflow interrupt
and are sampled periodically.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 958 +++
 3 files changed, 967 insertions(+)
 create mode 100644 drivers/perf/thunderx2_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 28bb5a0..eafd0fc 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -85,6 +85,14 @@ config QCOM_L3_PMU
   Adds the L3 cache PMU into the perf events subsystem for
   monitoring L3 cache events.
 
+config THUNDERX2_PMU
+bool "Cavium ThunderX2 SoC PMU UNCORE"
+depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
+   help
+ Provides support for ThunderX2 UNCORE events.
+ The SoC has PMU support in its L3 cache controller (L3C) and
+ in the DDR4 Memory Controller (DMC).
+
 config XGENE_PMU
 depends on ARCH_XGENE
 bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b3902bd..909f27f 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
+obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
new file mode 100644
index 000..92d19b5
--- /dev/null
+++ b/drivers/perf/thunderx2_pmu.c
@@ -0,0 +1,958 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CAVIUM THUNDERX2 SoC PMU UNCORE
+ *
+ * Copyright (C) 2018 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <gkulka...@cavium.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+/* L3c and DMC has 16 and 8 channels per socket respectively.
+ * Each Channel supports UNCORE PMU device and consists of
+ * 4 independent programmable counters. Counters are 32 bit
+ * and does not support overflow interrupt, they needs to be
+ * sampled before overflow(i.e, at every 2 seconds).
+ */
+
+#define UNCORE_MAX_COUNTERS4
+#define UNCORE_L3_MAX_TILES16
+#define UNCORE_DMC_MAX_CHANNELS8
+
+#define UNCORE_HRTIMER_INTERVAL(2 * NSEC_PER_SEC)
+#define GET_EVENTID(ev)((ev->hw.config) & 0x1ff)
+#define GET_COUNTERID(ev)  ((ev->hw.idx) & 0xf)
+#define GET_CHANNELID(pmu_uncore)  (pmu_uncore->channel)
+
+#define DMC_COUNTER_CTL0x234
+#define DMC_COUNTER_DATA   0x240
+#define L3C_COUNTER_CTL0xA8
+#define L3C_COUNTER_DATA   0xAC
+
+#define THUNDERX2_SMC_CALL_ID  0xC200FF00
+#define THUNDERX2_SMC_SET_CHANNEL  0xB010
+
+
+enum thunderx2_uncore_l3_events {
+   L3_EVENT_NONE,
+   L3_EVENT_NBU_CANCEL,
+   L3_EVENT_DIB_RETRY,
+   L3_EVENT_DOB_RETRY,
+   L3_EVENT_DIB_CREDIT_RETRY,
+   L3_EVENT_DOB_CREDIT_RETRY,
+   L3_EVENT_FORCE_RETRY,
+   L3_EVENT_IDX_CONFLICT_RETRY,
+   L3_EVENT_EVICT_CONFLICT_RETRY,
+   L3_EVENT_BANK_CONFLICT_RETRY,
+   L3_EVENT_FILL_ENTRY_RETRY,
+   L3_EVENT_EVICT_NOT_READY_RETRY,
+   L3_EVENT_L3_RETRY,
+   L3_EVENT_READ_REQ,
+   L3_EVENT_WRITE_BACK_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_REQ,
+   L3_EVENT_INV_REQ,
+   L3_EVENT_SELF_REQ,
+   L3_EVENT_REQ,
+   L3_EVENT_EVICT_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_HIT,
+   L3_EVENT_INVALIDATE_HIT,
+   L3_EVENT_SELF_HIT,
+   L3_EVENT_READ_HIT,
+   L3_EVENT_MAX,
+};
+
+enum thunderx2_uncore_dmc_events {
+   DMC_EVENT_NONE,
+   DMC_EVENT_COUNT_CYCLES,
+   DMC_EVENT_RES2,
+   DMC_EVENT_RES3,
+   DMC_EVENT_RES4,
+   DMC_EVENT_RES5,

[PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

2018-04-25 Thread Ganapatrao Kulkarni

This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
Controller(DMC) and Level 3 Cache(L3C).

ThunderX2 has 8 independent DMC PMUs to capture performance events
corresponding to 8 channels of DDR4 Memory Controller and 16 independent
L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
Each PMU supports up to 4 counters. All counters lack overflow interrupt
and are sampled periodically.

Signed-off-by: Ganapatrao Kulkarni 
---
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 958 +++
 3 files changed, 967 insertions(+)
 create mode 100644 drivers/perf/thunderx2_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 28bb5a0..eafd0fc 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -85,6 +85,14 @@ config QCOM_L3_PMU
   Adds the L3 cache PMU into the perf events subsystem for
   monitoring L3 cache events.
 
+config THUNDERX2_PMU
+bool "Cavium ThunderX2 SoC PMU UNCORE"
+depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
+   help
+ Provides support for ThunderX2 UNCORE events.
+ The SoC has PMU support in its L3 cache controller (L3C) and
+ in the DDR4 Memory Controller (DMC).
+
 config XGENE_PMU
 depends on ARCH_XGENE
 bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b3902bd..909f27f 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
+obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
new file mode 100644
index 000..92d19b5
--- /dev/null
+++ b/drivers/perf/thunderx2_pmu.c
@@ -0,0 +1,958 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CAVIUM THUNDERX2 SoC PMU UNCORE
+ *
+ * Copyright (C) 2018 Cavium Inc.
+ * Author: Ganapatrao Kulkarni 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+/* L3c and DMC has 16 and 8 channels per socket respectively.
+ * Each Channel supports UNCORE PMU device and consists of
+ * 4 independent programmable counters. Counters are 32 bit
+ * and does not support overflow interrupt, they needs to be
+ * sampled before overflow(i.e, at every 2 seconds).
+ */
+
+#define UNCORE_MAX_COUNTERS4
+#define UNCORE_L3_MAX_TILES16
+#define UNCORE_DMC_MAX_CHANNELS8
+
+#define UNCORE_HRTIMER_INTERVAL(2 * NSEC_PER_SEC)
+#define GET_EVENTID(ev)((ev->hw.config) & 0x1ff)
+#define GET_COUNTERID(ev)  ((ev->hw.idx) & 0xf)
+#define GET_CHANNELID(pmu_uncore)  (pmu_uncore->channel)
+
+#define DMC_COUNTER_CTL0x234
+#define DMC_COUNTER_DATA   0x240
+#define L3C_COUNTER_CTL0xA8
+#define L3C_COUNTER_DATA   0xAC
+
+#define THUNDERX2_SMC_CALL_ID  0xC200FF00
+#define THUNDERX2_SMC_SET_CHANNEL  0xB010
+
+
+enum thunderx2_uncore_l3_events {
+   L3_EVENT_NONE,
+   L3_EVENT_NBU_CANCEL,
+   L3_EVENT_DIB_RETRY,
+   L3_EVENT_DOB_RETRY,
+   L3_EVENT_DIB_CREDIT_RETRY,
+   L3_EVENT_DOB_CREDIT_RETRY,
+   L3_EVENT_FORCE_RETRY,
+   L3_EVENT_IDX_CONFLICT_RETRY,
+   L3_EVENT_EVICT_CONFLICT_RETRY,
+   L3_EVENT_BANK_CONFLICT_RETRY,
+   L3_EVENT_FILL_ENTRY_RETRY,
+   L3_EVENT_EVICT_NOT_READY_RETRY,
+   L3_EVENT_L3_RETRY,
+   L3_EVENT_READ_REQ,
+   L3_EVENT_WRITE_BACK_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_REQ,
+   L3_EVENT_INV_REQ,
+   L3_EVENT_SELF_REQ,
+   L3_EVENT_REQ,
+   L3_EVENT_EVICT_REQ,
+   L3_EVENT_INVALIDATE_NWRITE_HIT,
+   L3_EVENT_INVALIDATE_HIT,
+   L3_EVENT_SELF_HIT,
+   L3_EVENT_READ_HIT,
+   L3_EVENT_MAX,
+};
+
+enum thunderx2_uncore_dmc_events {
+   DMC_EVENT_NONE,
+   DMC_EVENT_COUNT_CYCLES,
+   DMC_EVENT_RES2,
+   DMC_EVENT_RES3,
+   DMC_EVENT_RES4,
+   DMC_EVENT_RES5,
+   DMC_EVENT_RES6,
+   DMC_EVENT_RE

[PATCH v4 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

2018-04-25 Thread Ganapatrao Kulkarni

This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
The SoC has PMU support in L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

v4:
 -Incroporated review comments from Mark Rutland[1]

[1] https://www.spinics.net/lists/arm-kernel/msg588563.html

v3:
 - fixed warning reported by kbuild robot

v2:
 - rebased to 4.12-rc1
 - Removed Arch VULCAN dependency.
 - update SMC call parameters as per latest firmware.

v1:
 -Initial patch

Ganapatrao Kulkarni (2):
  perf: uncore: Adding documentation for ThunderX2 pmu uncore driver
  ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

 Documentation/perf/thunderx2-pmu.txt |  66 +++
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 958 +++
 4 files changed, 1033 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt
 create mode 100644 drivers/perf/thunderx2_pmu.c

-- 
2.9.4

[PATCH v4 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

2018-04-25 Thread Ganapatrao Kulkarni

This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
The SoC has PMU support in L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

v4:
 -Incroporated review comments from Mark Rutland[1]

[1] https://www.spinics.net/lists/arm-kernel/msg588563.html

v3:
 - fixed warning reported by kbuild robot

v2:
 - rebased to 4.12-rc1
 - Removed Arch VULCAN dependency.
 - update SMC call parameters as per latest firmware.

v1:
 -Initial patch

Ganapatrao Kulkarni (2):
  perf: uncore: Adding documentation for ThunderX2 pmu uncore driver
  ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

 Documentation/perf/thunderx2-pmu.txt |  66 +++
 drivers/perf/Kconfig |   8 +
 drivers/perf/Makefile|   1 +
 drivers/perf/thunderx2_pmu.c | 958 +++
 4 files changed, 1033 insertions(+)
 create mode 100644 Documentation/perf/thunderx2-pmu.txt
 create mode 100644 drivers/perf/thunderx2_pmu.c

-- 
2.9.4

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-23 Thread Ganapatrao Kulkarni

On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.mur...@arm.com> wrote:
> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>
>> The performance drop is observed with long hours iperf testing using 40G
>> cards. This is mainly due to long iterations in finding the free iova
>> range in 32bit address space.
>>
>> In current implementation for 64bit PCI devices, there is always first
>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>> range. Once we run out 32bit range, there is allocation from higher range,
>> however due to cached32_node optimization it does not suppose to be
>> painful. cached32_node always points to recently allocated 32-bit node.
>> When address range is full, it will be pointing to last allocated node
>> (leaf node), so walking rbtree to find the available range is not
>> expensive affair. However this optimization does not behave well when
>> one of the middle node is freed. In that case cached32_node is updated
>> to point to next iova range. The next iova allocation will consume free
>> range and again update cached32_node to itself. From now on, walking
>> over 32-bit range is more expensive.
>>
>> This patch adds fix to update cached node to leaf node when there are no
>> iova free range left, which avoids unnecessary long iterations.
>
>
> The only trouble with this is that "allocation failed" doesn't uniquely mean
> "space full". Say that after some time the 32-bit space ends up empty except
> for one page at 0x1000 and one at 0x8000, then somebody tries to
> allocate 2GB. If we move the cached node down to the leftmost entry when
> that fails, all subsequent allocation attempts are now going to fail despite
> the space being 99.% free!
>
> I can see a couple of ways to solve that general problem of free space above
> the cached node getting lost, but neither of them helps with the case where
> there is genuinely insufficient space (and if anything would make it even
> slower). In terms of the optimisation you want here, i.e. fail fast when an
> allocation cannot possibly succeed, the only reliable idea which comes to
> mind is free-PFN accounting. I might give that a go myself to see how ugly
> it looks.

i see 2 problems in current implementation,
1. We don't replenish the 32 bits range, until first attempt of second
allocation(64 bit) fails.
2. Having  per cpu cache might not yield good hit on platforms with
more number of CPUs.

however irrespective of current issues, It makes sense to update
cached node as done in this patch , when there is failure to get iova
range using current cached pointer which is forcing for the
unnecessary time consuming do-while iterations until any replenish
happens!

thanks
Ganapat

>
> Robin.
>
>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>   drivers/iommu/iova.c | 6 ++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..e6ee2ea 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>> +   /* No more cached node points to free hole, update to leaf
>> node.
>> +*/
>> +   struct iova *prev_iova;
>> +
>> +   prev_iova = rb_entry(prev, struct iova, node);
>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>> return -ENOMEM;
>> }
>>
>

Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-23 Thread Ganapatrao Kulkarni

On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy  wrote:
> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>
>> The performance drop is observed with long hours iperf testing using 40G
>> cards. This is mainly due to long iterations in finding the free iova
>> range in 32bit address space.
>>
>> In current implementation for 64bit PCI devices, there is always first
>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>> range. Once we run out 32bit range, there is allocation from higher range,
>> however due to cached32_node optimization it does not suppose to be
>> painful. cached32_node always points to recently allocated 32-bit node.
>> When address range is full, it will be pointing to last allocated node
>> (leaf node), so walking rbtree to find the available range is not
>> expensive affair. However this optimization does not behave well when
>> one of the middle node is freed. In that case cached32_node is updated
>> to point to next iova range. The next iova allocation will consume free
>> range and again update cached32_node to itself. From now on, walking
>> over 32-bit range is more expensive.
>>
>> This patch adds fix to update cached node to leaf node when there are no
>> iova free range left, which avoids unnecessary long iterations.
>
>
> The only trouble with this is that "allocation failed" doesn't uniquely mean
> "space full". Say that after some time the 32-bit space ends up empty except
> for one page at 0x1000 and one at 0x8000, then somebody tries to
> allocate 2GB. If we move the cached node down to the leftmost entry when
> that fails, all subsequent allocation attempts are now going to fail despite
> the space being 99.% free!
>
> I can see a couple of ways to solve that general problem of free space above
> the cached node getting lost, but neither of them helps with the case where
> there is genuinely insufficient space (and if anything would make it even
> slower). In terms of the optimisation you want here, i.e. fail fast when an
> allocation cannot possibly succeed, the only reliable idea which comes to
> mind is free-PFN accounting. I might give that a go myself to see how ugly
> it looks.

i see 2 problems in current implementation,
1. We don't replenish the 32 bits range, until first attempt of second
allocation(64 bit) fails.
2. Having  per cpu cache might not yield good hit on platforms with
more number of CPUs.

however irrespective of current issues, It makes sense to update
cached node as done in this patch , when there is failure to get iova
range using current cached pointer which is forcing for the
unnecessary time consuming do-while iterations until any replenish
happens!

thanks
Ganapat

>
> Robin.
>
>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>>   drivers/iommu/iova.c | 6 ++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..e6ee2ea 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>> +   /* No more cached node points to free hole, update to leaf
>> node.
>> +*/
>> +   struct iova *prev_iova;
>> +
>> +   prev_iova = rb_entry(prev, struct iova, node);
>> +   __cached_rbnode_insert_update(iovad, prev_iova);
>> spin_unlock_irqrestore(>iova_rbtree_lock, flags);
>> return -ENOMEM;
>> }
>>
>

[PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-19 Thread Ganapatrao Kulkarni

The performance drop is observed with long hours iperf testing using 40G
cards. This is mainly due to long iterations in finding the free iova
range in 32bit address space.

In current implementation for 64bit PCI devices, there is always first
attempt to allocate iova from 32bit(SAC preferred over DAC) address
range. Once we run out 32bit range, there is allocation from higher range,
however due to cached32_node optimization it does not suppose to be
painful. cached32_node always points to recently allocated 32-bit node.
When address range is full, it will be pointing to last allocated node
(leaf node), so walking rbtree to find the available range is not
expensive affair. However this optimization does not behave well when
one of the middle node is freed. In that case cached32_node is updated
to point to next iova range. The next iova allocation will consume free
range and again update cached32_node to itself. From now on, walking
over 32-bit range is more expensive.

This patch adds fix to update cached node to leaf node when there are no
iova free range left, which avoids unnecessary long iterations.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 drivers/iommu/iova.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..e6ee2ea 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
} while (curr && new_pfn <= curr_iova->pfn_hi);
 
if (limit_pfn < size || new_pfn < iovad->start_pfn) {
+   /* No more cached node points to free hole, update to leaf node.
+*/
+   struct iova *prev_iova;
+
+   prev_iova = rb_entry(prev, struct iova, node);
+   __cached_rbnode_insert_update(iovad, prev_iova);
spin_unlock_irqrestore(>iova_rbtree_lock, flags);
return -ENOMEM;
}
-- 
2.9.4

[PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

2018-04-19 Thread Ganapatrao Kulkarni

The performance drop is observed with long hours iperf testing using 40G
cards. This is mainly due to long iterations in finding the free iova
range in 32bit address space.

In current implementation for 64bit PCI devices, there is always first
attempt to allocate iova from 32bit(SAC preferred over DAC) address
range. Once we run out 32bit range, there is allocation from higher range,
however due to cached32_node optimization it does not suppose to be
painful. cached32_node always points to recently allocated 32-bit node.
When address range is full, it will be pointing to last allocated node
(leaf node), so walking rbtree to find the available range is not
expensive affair. However this optimization does not behave well when
one of the middle node is freed. In that case cached32_node is updated
to point to next iova range. The next iova allocation will consume free
range and again update cached32_node to itself. From now on, walking
over 32-bit range is more expensive.

This patch adds fix to update cached node to leaf node when there are no
iova free range left, which avoids unnecessary long iterations.

Signed-off-by: Ganapatrao Kulkarni 
---
 drivers/iommu/iova.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..e6ee2ea 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
} while (curr && new_pfn <= curr_iova->pfn_hi);
 
if (limit_pfn < size || new_pfn < iovad->start_pfn) {
+   /* No more cached node points to free hole, update to leaf node.
+*/
+   struct iova *prev_iova;
+
+   prev_iova = rb_entry(prev, struct iova, node);
+   __cached_rbnode_insert_update(iovad, prev_iova);
spin_unlock_irqrestore(>iova_rbtree_lock, flags);
return -ENOMEM;
}
-- 
2.9.4

[tip:perf/core] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-20 Thread tip-bot for Ganapatrao Kulkarni

Commit-ID:  a8685f088819d21cd5aea5de4c184de427c3625d
Gitweb: https://git.kernel.org/tip/a8685f088819d21cd5aea5de4c184de427c3625d
Author: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
AuthorDate: Wed, 7 Mar 2018 16:38:03 +0530
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Fri, 16 Mar 2018 13:55:41 -0300

perf vendor events arm64: Enable JSON events for ThunderX2 B0

There is MIDR change on ThunderX2 B0, adding an entry to mapfile to
enable JSON events for B0.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
Acked-by: Will Deacon <will.dea...@arm.com>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Ganapatrao Kulkarni <gpkulka...@gklkml16.com>
Cc: Jayachandran C <jn...@caviumnetworks.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: John Garry <john.ga...@huawei.com>
Cc: Mark Rutland <mark.rutl...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Robert Richter <robert.rich...@cavium.com>
Cc: William Cohen <wco...@redhat.com>
Cc: linux-arm-ker...@lists.infradead.org
Link: 
http://lkml.kernel.org/r/20180307110803.32418-1-ganapatrao.kulka...@cavium.com
[ Fixup wrt recent patchset by John Garry ]
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
b/tools/perf/pmu-events/arch/arm64/mapfile.csv
index 8f11aeb003a9..f03e26ecb658 100644
--- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
+++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
@@ -14,4 +14,5 @@
 #Family-model,Version,Filename,EventType
 0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
 0x420f5160,v1,cavium/thunderx2,core
+0x430f0af0,v1,cavium/thunderx2,core
 0x480fd010,v1,hisilicon/hip08,core

[tip:perf/core] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-20 Thread tip-bot for Ganapatrao Kulkarni

Commit-ID:  a8685f088819d21cd5aea5de4c184de427c3625d
Gitweb: https://git.kernel.org/tip/a8685f088819d21cd5aea5de4c184de427c3625d
Author: Ganapatrao Kulkarni 
AuthorDate: Wed, 7 Mar 2018 16:38:03 +0530
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Fri, 16 Mar 2018 13:55:41 -0300

perf vendor events arm64: Enable JSON events for ThunderX2 B0

There is MIDR change on ThunderX2 B0, adding an entry to mapfile to
enable JSON events for B0.

Signed-off-by: Ganapatrao Kulkarni 
Acked-by: Will Deacon 
Cc: Alexander Shishkin 
Cc: Ganapatrao Kulkarni 
Cc: Jayachandran C 
Cc: Jiri Olsa 
Cc: John Garry 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Cc: Robert Richter 
Cc: William Cohen 
Cc: linux-arm-ker...@lists.infradead.org
Link: 
http://lkml.kernel.org/r/20180307110803.32418-1-ganapatrao.kulka...@cavium.com
[ Fixup wrt recent patchset by John Garry ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
b/tools/perf/pmu-events/arch/arm64/mapfile.csv
index 8f11aeb003a9..f03e26ecb658 100644
--- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
+++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
@@ -14,4 +14,5 @@
 #Family-model,Version,Filename,EventType
 0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
 0x420f5160,v1,cavium/thunderx2,core
+0x430f0af0,v1,cavium/thunderx2,core
 0x480fd010,v1,hisilicon/hip08,core

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

On Fri, Mar 9, 2018 at 11:32 PM, Arnaldo Carvalho de Melo
<arnaldo.m...@gmail.com> wrote:
> Em Fri, Mar 09, 2018 at 03:00:40PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Mar 09, 2018 at 11:15:16PM +0530, Ganapatrao Kulkarni escreveu:
>> > On Fri, Mar 9, 2018 at 11:03 PM, Arnaldo Carvalho de Melo
>> > > +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > > @@ -13,5 +13,5 @@
>> > >  #
>> > >  #Family-model,Version,Filename,EventType
>> > >  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
>> > > -0x420f5160,v1,cavium/thunderx2,core
>> > > +0x430f0af0,v1,cavium/thunderx2,core
>> > >  0x480fd010,v1,hisilicon/hip08,core
>
>> > please do not delete existing entry,  add additional entry.
>> > > +0x430f0af0,v1,cavium/thunderx2,core
>
>> Ok, my bad, I think my eyes are failing me, I swear I saw them in the
>> original patch, which is not the case...
>
>> Ok, will add, together with Will's Acked-by
>
> So this is how it ended up:
>
>
> commit 9245299469de2e02d9c3cb167c0e52f75f1cc180
> Author: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
> Date:   Wed Mar 7 16:38:03 2018 +0530
>
> perf vendor events arm64: Enable JSON events for ThunderX2 B0
>
> There is MIDR change on ThunderX2 B0, adding an entry to mapfile to
> enable JSON events for B0.
>
>     Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
> Acked-by: Will Deacon <will.dea...@arm.com>
> Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
> Cc: Ganapatrao Kulkarni <gpkulka...@gklkml16.com>
> Cc: Jayachandran C <jn...@caviumnetworks.com>
> Cc: Jiri Olsa <jo...@redhat.com>
> Cc: John Garry <john.ga...@huawei.com>
> Cc: Mark Rutland <mark.rutl...@arm.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Robert Richter <robert.rich...@cavium.com>
> Cc: William Cohen <wco...@redhat.com>
> Cc: linux-arm-ker...@lists.infradead.org
> Link: 
> http://lkml.kernel.org/r/20180307110803.32418-1-ganapatrao.kulka...@cavium.com
> [ Fixup wrt recent patchset by John Garry ]
> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
>
> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> index 8f11aeb003a9..f03e26ecb658 100644
> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> @@ -14,4 +14,5 @@
>  #Family-model,Version,Filename,EventType
>  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
>  0x420f5160,v1,cavium/thunderx2,core
> +0x430f0af0,v1,cavium/thunderx2,core
>  0x480fd010,v1,hisilicon/hip08,core

thanks for help Arnaldo!

thanks
ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

On Fri, Mar 9, 2018 at 11:32 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Fri, Mar 09, 2018 at 03:00:40PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Mar 09, 2018 at 11:15:16PM +0530, Ganapatrao Kulkarni escreveu:
>> > On Fri, Mar 9, 2018 at 11:03 PM, Arnaldo Carvalho de Melo
>> > > +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > > @@ -13,5 +13,5 @@
>> > >  #
>> > >  #Family-model,Version,Filename,EventType
>> > >  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
>> > > -0x420f5160,v1,cavium/thunderx2,core
>> > > +0x430f0af0,v1,cavium/thunderx2,core
>> > >  0x480fd010,v1,hisilicon/hip08,core
>
>> > please do not delete existing entry,  add additional entry.
>> > > +0x430f0af0,v1,cavium/thunderx2,core
>
>> Ok, my bad, I think my eyes are failing me, I swear I saw them in the
>> original patch, which is not the case...
>
>> Ok, will add, together with Will's Acked-by
>
> So this is how it ended up:
>
>
> commit 9245299469de2e02d9c3cb167c0e52f75f1cc180
> Author: Ganapatrao Kulkarni 
> Date:   Wed Mar 7 16:38:03 2018 +0530
>
> perf vendor events arm64: Enable JSON events for ThunderX2 B0
>
> There is MIDR change on ThunderX2 B0, adding an entry to mapfile to
> enable JSON events for B0.
>
> Signed-off-by: Ganapatrao Kulkarni 
> Acked-by: Will Deacon 
> Cc: Alexander Shishkin 
> Cc: Ganapatrao Kulkarni 
> Cc: Jayachandran C 
> Cc: Jiri Olsa 
> Cc: John Garry 
> Cc: Mark Rutland 
> Cc: Peter Zijlstra 
> Cc: Robert Richter 
> Cc: William Cohen 
> Cc: linux-arm-ker...@lists.infradead.org
> Link: 
> http://lkml.kernel.org/r/20180307110803.32418-1-ganapatrao.kulka...@cavium.com
> [ Fixup wrt recent patchset by John Garry ]
> Signed-off-by: Arnaldo Carvalho de Melo 
>
> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> index 8f11aeb003a9..f03e26ecb658 100644
> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> @@ -14,4 +14,5 @@
>  #Family-model,Version,Filename,EventType
>  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
>  0x420f5160,v1,cavium/thunderx2,core
> +0x430f0af0,v1,cavium/thunderx2,core
>  0x480fd010,v1,hisilicon/hip08,core

thanks for help Arnaldo!

thanks
ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

Hi Arnaldo,

On Fri, Mar 9, 2018 at 11:03 PM, Arnaldo Carvalho de Melo
<arnaldo.m...@gmail.com> wrote:
> Em Fri, Mar 09, 2018 at 03:58:09PM +, Will Deacon escreveu:
>> On Fri, Mar 09, 2018 at 11:34:15AM -0300, Arnaldo Carvalho de Melo wrote:
>> > Em Fri, Mar 09, 2018 at 07:57:04PM +0530, Ganapatrao Kulkarni escreveu:
>> > > Hi Arnaldo,
>> > >
>> > > can you please pull-in this patch?
>> >
>> > So everybody is Ok with this? Can I have some Acked-by: from subject
>> > matter experts?
>>
>> The original patch looks fine to me:
>>
>> Acked-by: Will Deacon <will.dea...@arm.com>
>
> Ok, so, as John mentioned in another message in his thread, this
> conflicts with a series he sent and that Ganapatrao acked and I already
> applied, I looked at it and think this patch should turn into the patch
> at the end of this message, which I'm applying with the commit log
> message in the original patch in this thread, Ack?

>
> - Arnaldo
>
> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> index 8f11aeb003a9..624e8cae6e86 100644
> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> @@ -13,5 +13,5 @@
>  #
>  #Family-model,Version,Filename,EventType
>  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
> -0x420f5160,v1,cavium/thunderx2,core
> +0x430f0af0,v1,cavium/thunderx2,core
>  0x480fd010,v1,hisilicon/hip08,core

please do not delete existing entry,  add additional entry.
> +0x430f0af0,v1,cavium/thunderx2,core


thanks
Ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

Hi Arnaldo,

On Fri, Mar 9, 2018 at 11:03 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Fri, Mar 09, 2018 at 03:58:09PM +, Will Deacon escreveu:
>> On Fri, Mar 09, 2018 at 11:34:15AM -0300, Arnaldo Carvalho de Melo wrote:
>> > Em Fri, Mar 09, 2018 at 07:57:04PM +0530, Ganapatrao Kulkarni escreveu:
>> > > Hi Arnaldo,
>> > >
>> > > can you please pull-in this patch?
>> >
>> > So everybody is Ok with this? Can I have some Acked-by: from subject
>> > matter experts?
>>
>> The original patch looks fine to me:
>>
>> Acked-by: Will Deacon 
>
> Ok, so, as John mentioned in another message in his thread, this
> conflicts with a series he sent and that Ganapatrao acked and I already
> applied, I looked at it and think this patch should turn into the patch
> at the end of this message, which I'm applying with the commit log
> message in the original patch in this thread, Ack?

>
> - Arnaldo
>
> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> index 8f11aeb003a9..624e8cae6e86 100644
> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
> @@ -13,5 +13,5 @@
>  #
>  #Family-model,Version,Filename,EventType
>  0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core
> -0x420f5160,v1,cavium/thunderx2,core
> +0x430f0af0,v1,cavium/thunderx2,core
>  0x480fd010,v1,hisilicon/hip08,core

please do not delete existing entry,  add additional entry.
> +0x430f0af0,v1,cavium/thunderx2,core


thanks
Ganapat

Re: [PATCH v3 09/11] perf vendor events arm64: fixup ThunderX2 to use recommended events

2018-03-09 Thread Ganapatrao Kulkarni

Hi John,

On Thu, Mar 8, 2018 at 4:28 PM, John Garry <john.ga...@huawei.com> wrote:
> This patch fixes the Cavium ThunderX2 JSON to use event definitions
> from the ARMv8 recommended events.
>
> Cc: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
> Signed-off-by: John Garry <john.ga...@huawei.com>
> ---
>  .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 50 
> +-
>  1 file changed, 10 insertions(+), 40 deletions(-)
>
> diff --git 
> a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
> b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> index 2db45c4..bc03c06 100644
> --- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> +++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> @@ -1,62 +1,32 @@
>  [
>  {
> -"PublicDescription": "Attributable Level 1 data cache access, read",
> -"EventCode": "0x40",
> -"EventName": "l1d_cache_rd",
> -"BriefDescription": "L1D cache read",
> +"ArchStdEvent": "L1D_CACHE_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache access, write 
> ",
> -"EventCode": "0x41",
> -"EventName": "l1d_cache_wr",
> -"BriefDescription": "L1D cache write",
> +"ArchStdEvent": "L1D_CACHE_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache refill, read",
> -"EventCode": "0x42",
> -"EventName": "l1d_cache_refill_rd",
> -"BriefDescription": "L1D cache refill read",
> +"ArchStdEvent": "L1D_CACHE_REFILL_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache refill, write",
> -"EventCode": "0x43",
> -"EventName": "l1d_cache_refill_wr",
> -"BriefDescription": "L1D refill write",
> +"ArchStdEvent": "L1D_CACHE_REFILL_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data TLB refill, read",
> -"EventCode": "0x4C",
> -"EventName": "l1d_tlb_refill_rd",
> -"BriefDescription": "L1D tlb refill read",
> +"ArchStdEvent": "L1D_TLB_REFILL_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data TLB refill, write",
> -"EventCode": "0x4D",
> -"EventName": "l1d_tlb_refill_wr",
> -"BriefDescription": "L1D tlb refill write",
> +"ArchStdEvent": "L1D_TLB_REFILL_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data or unified TLB 
> access, read",
> -"EventCode": "0x4E",
> -"EventName": "l1d_tlb_rd",
> -"BriefDescription": "L1D tlb read",
> +"ArchStdEvent": "L1D_TLB_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data or unified TLB 
> access, write",
> -"EventCode": "0x4F",
> -"EventName": "l1d_tlb_wr",
> -"BriefDescription": "L1D tlb write",
> +"ArchStdEvent": "L1D_TLB_WR",
>  },
>  {
> -"PublicDescription": "Bus access read",
> -"EventCode": "0x60",
> -"EventName": "bus_access_rd",
> -"BriefDescription": "Bus access read",
> +"ArchStdEvent": "BUS_ACCESS_RD",
> },
> {
> -"PublicDescription": "Bus access write",
> -"EventCode": "0x61",
> -"EventName": "bus_access_wr",
> -"BriefDescription": "Bus access write",
> +"ArchStdEvent": "BUS_ACCESS_WR",
> }
>  ]

This patch looks ok to me.
i have tried on thunderx2 for few events and it is working fine.

Tested-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>

> --
> 1.9.1
>

thanks
Ganapat

Re: [PATCH v3 09/11] perf vendor events arm64: fixup ThunderX2 to use recommended events

2018-03-09 Thread Ganapatrao Kulkarni

Hi John,

On Thu, Mar 8, 2018 at 4:28 PM, John Garry  wrote:
> This patch fixes the Cavium ThunderX2 JSON to use event definitions
> from the ARMv8 recommended events.
>
> Cc: Ganapatrao Kulkarni 
> Signed-off-by: John Garry 
> ---
>  .../arch/arm64/cavium/thunderx2/core-imp-def.json  | 50 
> +-
>  1 file changed, 10 insertions(+), 40 deletions(-)
>
> diff --git 
> a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json 
> b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> index 2db45c4..bc03c06 100644
> --- a/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> +++ b/tools/perf/pmu-events/arch/arm64/cavium/thunderx2/core-imp-def.json
> @@ -1,62 +1,32 @@
>  [
>  {
> -"PublicDescription": "Attributable Level 1 data cache access, read",
> -"EventCode": "0x40",
> -"EventName": "l1d_cache_rd",
> -"BriefDescription": "L1D cache read",
> +"ArchStdEvent": "L1D_CACHE_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache access, write 
> ",
> -"EventCode": "0x41",
> -"EventName": "l1d_cache_wr",
> -"BriefDescription": "L1D cache write",
> +"ArchStdEvent": "L1D_CACHE_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache refill, read",
> -"EventCode": "0x42",
> -"EventName": "l1d_cache_refill_rd",
> -"BriefDescription": "L1D cache refill read",
> +"ArchStdEvent": "L1D_CACHE_REFILL_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data cache refill, write",
> -"EventCode": "0x43",
> -"EventName": "l1d_cache_refill_wr",
> -"BriefDescription": "L1D refill write",
> +"ArchStdEvent": "L1D_CACHE_REFILL_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data TLB refill, read",
> -"EventCode": "0x4C",
> -"EventName": "l1d_tlb_refill_rd",
> -"BriefDescription": "L1D tlb refill read",
> +"ArchStdEvent": "L1D_TLB_REFILL_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data TLB refill, write",
> -"EventCode": "0x4D",
> -"EventName": "l1d_tlb_refill_wr",
> -"BriefDescription": "L1D tlb refill write",
> +"ArchStdEvent": "L1D_TLB_REFILL_WR",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data or unified TLB 
> access, read",
> -"EventCode": "0x4E",
> -"EventName": "l1d_tlb_rd",
> -"BriefDescription": "L1D tlb read",
> +"ArchStdEvent": "L1D_TLB_RD",
>  },
>  {
> -"PublicDescription": "Attributable Level 1 data or unified TLB 
> access, write",
> -"EventCode": "0x4F",
> -"EventName": "l1d_tlb_wr",
> -"BriefDescription": "L1D tlb write",
> +"ArchStdEvent": "L1D_TLB_WR",
>  },
>  {
> -"PublicDescription": "Bus access read",
> -"EventCode": "0x60",
> -"EventName": "bus_access_rd",
> -"BriefDescription": "Bus access read",
> +"ArchStdEvent": "BUS_ACCESS_RD",
> },
> {
> -"PublicDescription": "Bus access write",
> -"EventCode": "0x61",
> -"EventName": "bus_access_wr",
> -"BriefDescription": "Bus access write",
> +"ArchStdEvent": "BUS_ACCESS_WR",
> }
>  ]

This patch looks ok to me.
i have tried on thunderx2 for few events and it is working fine.

Tested-by: Ganapatrao Kulkarni 

> --
> 1.9.1
>

thanks
Ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

Hi Arnaldo,

can you please pull-in this patch?

On Thu, Mar 8, 2018 at 9:44 AM, Ganapatrao Kulkarni <gklkm...@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 12:01 AM, William Cohen <wco...@redhat.com> wrote:
>> On 03/07/2018 12:35 PM, Ganapatrao Kulkarni wrote:
>>> Hi Will Cohen,
>>>
>>> On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
>>> <a...@kernel.org> wrote:
>>>> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>>>>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>>>>>> There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>>>>>> to enable JSON events for B0.
>>>>>>
>>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>>
>>>> Ganapatrao, can you please take this in consideration and if agreeing
>>>> send a v2 patch?
>>>>
>>>> With that I can add an Acked-by: wcohen, Right?
>>>>
>>>> - Arnaldo
>>>>>> ---
>>>>>>  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>>>>>>  1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>>>>>> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> index e61c9ca..93c5d14 100644
>>>>>> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> @@ -13,4 +13,5 @@
>>>>>>  #
>>>>>>  #Family-model,Version,Filename,EventType
>>>>>>  0x420f5160,v1,cavium,core
>>>>>> +0x430f0af0,v1,cavium,core
>>>>>>  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>>>>>>
>>>>>
>>>>> Hi,
>>>>> Like the cortex-a53 the last digit '0' of the match for the MIDR should 
>>>>> be replaced with [[:xdigit:]] to allow for possible future revisions of 
>>>>> chip:
>>>
>>> for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
>>> are ignored/dont-care.
>>
>> Thanks for pointing that out.  See the code masking out those bits in 
>> linux/toos/perf/arch/util/header.c. For the ppc64 it just copies the 
>> equivalent of the MIDR including the revision bits. Thus, the need for 
>> regular expression matching to avoid having to create a new entry for each 
>> revision.
>
> It is same for arm64 too, there is no need to add an entry for every
> revision change,  need to add when part number changes.
> This patch is not intended to add entry for revision change, the fact
> of the matter is that, there  is complete MIDR change (vulcan to
> thunderx2) in B0.
> as per current arm64
> implementation(.tools/perf/arch/arm64/util/header.c), it is not
> required to have any dontcare marking in mapfile for revision/variant
> bits.
>
> thanks
> Ganapat
>
>>
>> -Will
>>
>>>
>>>>>
>>>>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>>>>
>>>>>
>>>>> -Will Cohen
>>>>
>>>
>>> thanks
>>> Ganapat
>>>> ___
>>>> linux-arm-kernel mailing list
>>>> linux-arm-ker...@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>

thanks
Ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-09 Thread Ganapatrao Kulkarni

Hi Arnaldo,

can you please pull-in this patch?

On Thu, Mar 8, 2018 at 9:44 AM, Ganapatrao Kulkarni  wrote:
> On Thu, Mar 8, 2018 at 12:01 AM, William Cohen  wrote:
>> On 03/07/2018 12:35 PM, Ganapatrao Kulkarni wrote:
>>> Hi Will Cohen,
>>>
>>> On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
>>>  wrote:
>>>> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>>>>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>>>>>> There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>>>>>> to enable JSON events for B0.
>>>>>>
>>>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>>
>>>> Ganapatrao, can you please take this in consideration and if agreeing
>>>> send a v2 patch?
>>>>
>>>> With that I can add an Acked-by: wcohen, Right?
>>>>
>>>> - Arnaldo
>>>>>> ---
>>>>>>  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>>>>>>  1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>>>>>> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> index e61c9ca..93c5d14 100644
>>>>>> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>>> @@ -13,4 +13,5 @@
>>>>>>  #
>>>>>>  #Family-model,Version,Filename,EventType
>>>>>>  0x420f5160,v1,cavium,core
>>>>>> +0x430f0af0,v1,cavium,core
>>>>>>  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>>>>>>
>>>>>
>>>>> Hi,
>>>>> Like the cortex-a53 the last digit '0' of the match for the MIDR should 
>>>>> be replaced with [[:xdigit:]] to allow for possible future revisions of 
>>>>> chip:
>>>
>>> for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
>>> are ignored/dont-care.
>>
>> Thanks for pointing that out.  See the code masking out those bits in 
>> linux/toos/perf/arch/util/header.c. For the ppc64 it just copies the 
>> equivalent of the MIDR including the revision bits. Thus, the need for 
>> regular expression matching to avoid having to create a new entry for each 
>> revision.
>
> It is same for arm64 too, there is no need to add an entry for every
> revision change,  need to add when part number changes.
> This patch is not intended to add entry for revision change, the fact
> of the matter is that, there  is complete MIDR change (vulcan to
> thunderx2) in B0.
> as per current arm64
> implementation(.tools/perf/arch/arm64/util/header.c), it is not
> required to have any dontcare marking in mapfile for revision/variant
> bits.
>
> thanks
> Ganapat
>
>>
>> -Will
>>
>>>
>>>>>
>>>>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>>>>
>>>>>
>>>>> -Will Cohen
>>>>
>>>
>>> thanks
>>> Ganapat
>>>> ___
>>>> linux-arm-kernel mailing list
>>>> linux-arm-ker...@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>

thanks
Ganapat

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

On Thu, Mar 8, 2018 at 12:01 AM, William Cohen <wco...@redhat.com> wrote:
> On 03/07/2018 12:35 PM, Ganapatrao Kulkarni wrote:
>> Hi Will Cohen,
>>
>> On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
>> <a...@kernel.org> wrote:
>>> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>>>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>>>>> There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>>>>> to enable JSON events for B0.
>>>>>
>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>
>>> Ganapatrao, can you please take this in consideration and if agreeing
>>> send a v2 patch?
>>>
>>> With that I can add an Acked-by: wcohen, Right?
>>>
>>> - Arnaldo
>>>>> ---
>>>>>  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>>>>>  1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>>>>> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> index e61c9ca..93c5d14 100644
>>>>> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> @@ -13,4 +13,5 @@
>>>>>  #
>>>>>  #Family-model,Version,Filename,EventType
>>>>>  0x420f5160,v1,cavium,core
>>>>> +0x430f0af0,v1,cavium,core
>>>>>  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>>>>>
>>>>
>>>> Hi,
>>>> Like the cortex-a53 the last digit '0' of the match for the MIDR should be 
>>>> replaced with [[:xdigit:]] to allow for possible future revisions of chip:
>>
>> for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
>> are ignored/dont-care.
>
> Thanks for pointing that out.  See the code masking out those bits in 
> linux/toos/perf/arch/util/header.c. For the ppc64 it just copies the 
> equivalent of the MIDR including the revision bits. Thus, the need for 
> regular expression matching to avoid having to create a new entry for each 
> revision.

It is same for arm64 too, there is no need to add an entry for every
revision change,  need to add when part number changes.
This patch is not intended to add entry for revision change, the fact
of the matter is that, there  is complete MIDR change (vulcan to
thunderx2) in B0.
as per current arm64
implementation(.tools/perf/arch/arm64/util/header.c), it is not
required to have any dontcare marking in mapfile for revision/variant
bits.

thanks
Ganapat

>
> -Will
>
>>
>>>>
>>>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>>>
>>>>
>>>> -Will Cohen
>>>
>>
>> thanks
>> Ganapat
>>> ___
>>> linux-arm-kernel mailing list
>>> linux-arm-ker...@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

On Thu, Mar 8, 2018 at 12:01 AM, William Cohen  wrote:
> On 03/07/2018 12:35 PM, Ganapatrao Kulkarni wrote:
>> Hi Will Cohen,
>>
>> On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
>>  wrote:
>>> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>>>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>>>>> There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>>>>> to enable JSON events for B0.
>>>>>
>>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>
>>> Ganapatrao, can you please take this in consideration and if agreeing
>>> send a v2 patch?
>>>
>>> With that I can add an Acked-by: wcohen, Right?
>>>
>>> - Arnaldo
>>>>> ---
>>>>>  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>>>>>  1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>>>>> b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> index e61c9ca..93c5d14 100644
>>>>> --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>>>>> @@ -13,4 +13,5 @@
>>>>>  #
>>>>>  #Family-model,Version,Filename,EventType
>>>>>  0x420f5160,v1,cavium,core
>>>>> +0x430f0af0,v1,cavium,core
>>>>>  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>>>>>
>>>>
>>>> Hi,
>>>> Like the cortex-a53 the last digit '0' of the match for the MIDR should be 
>>>> replaced with [[:xdigit:]] to allow for possible future revisions of chip:
>>
>> for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
>> are ignored/dont-care.
>
> Thanks for pointing that out.  See the code masking out those bits in 
> linux/toos/perf/arch/util/header.c. For the ppc64 it just copies the 
> equivalent of the MIDR including the revision bits. Thus, the need for 
> regular expression matching to avoid having to create a new entry for each 
> revision.

It is same for arm64 too, there is no need to add an entry for every
revision change,  need to add when part number changes.
This patch is not intended to add entry for revision change, the fact
of the matter is that, there  is complete MIDR change (vulcan to
thunderx2) in B0.
as per current arm64
implementation(.tools/perf/arch/arm64/util/header.c), it is not
required to have any dontcare marking in mapfile for revision/variant
bits.

thanks
Ganapat

>
> -Will
>
>>
>>>>
>>>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>>>
>>>>
>>>> -Will Cohen
>>>
>>
>> thanks
>> Ganapat
>>> ___
>>> linux-arm-kernel mailing list
>>> linux-arm-ker...@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

Hi Will Cohen,

On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
<a...@kernel.org> wrote:
> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>> > There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>> > to enable JSON events for B0.
>> >
>> > Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>
> Ganapatrao, can you please take this in consideration and if agreeing
> send a v2 patch?
>
> With that I can add an Acked-by: wcohen, Right?
>
> - Arnaldo
>> > ---
>> >  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>> >  1 file changed, 1 insertion(+)
>> >
>> > diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>> > b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > index e61c9ca..93c5d14 100644
>> > --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > @@ -13,4 +13,5 @@
>> >  #
>> >  #Family-model,Version,Filename,EventType
>> >  0x420f5160,v1,cavium,core
>> > +0x430f0af0,v1,cavium,core
>> >  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>> >
>>
>> Hi,
>> Like the cortex-a53 the last digit '0' of the match for the MIDR should be 
>> replaced with [[:xdigit:]] to allow for possible future revisions of chip:

for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
are ignored/dont-care.

>>
>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>
>>
>> -Will Cohen
>

thanks
Ganapat
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

Hi Will Cohen,

On Wed, Mar 7, 2018 at 8:08 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Wed, Mar 07, 2018 at 09:32:05AM -0500, William Cohen escreveu:
>> On 03/07/2018 06:08 AM, Ganapatrao Kulkarni wrote:
>> > There is MIDR change on ThunderX2 B0, adding an entry to mapfile
>> > to enable JSON events for B0.
>> >
>> > Signed-off-by: Ganapatrao Kulkarni 
>
> Ganapatrao, can you please take this in consideration and if agreeing
> send a v2 patch?
>
> With that I can add an Acked-by: wcohen, Right?
>
> - Arnaldo
>> > ---
>> >  tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
>> >  1 file changed, 1 insertion(+)
>> >
>> > diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
>> > b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > index e61c9ca..93c5d14 100644
>> > --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
>> > @@ -13,4 +13,5 @@
>> >  #
>> >  #Family-model,Version,Filename,EventType
>> >  0x420f5160,v1,cavium,core
>> > +0x430f0af0,v1,cavium,core
>> >  0x410fd03[[:xdigit:]],v1,cortex-a53,core
>> >
>>
>> Hi,
>> Like the cortex-a53 the last digit '0' of the match for the MIDR should be 
>> replaced with [[:xdigit:]] to allow for possible future revisions of chip:

for arm64 implementation,  bits 3:0(Revision) and bits 23:20(Variant)
are ignored/dont-care.

>>
>> 0x430f0af[[:xdigit:]],v1,cavium,core
>>
>>
>> -Will Cohen
>

thanks
Ganapat
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

[PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

There is MIDR change on ThunderX2 B0, adding an entry to mapfile
to enable JSON events for B0.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
b/tools/perf/pmu-events/arch/arm64/mapfile.csv
index e61c9ca..93c5d14 100644
--- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
+++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
@@ -13,4 +13,5 @@
 #
 #Family-model,Version,Filename,EventType
 0x420f5160,v1,cavium,core
+0x430f0af0,v1,cavium,core
 0x410fd03[[:xdigit:]],v1,cortex-a53,core
-- 
2.9.4

[PATCH] perf vendor events arm64: Enable JSON events for ThunderX2 B0

2018-03-07 Thread Ganapatrao Kulkarni

There is MIDR change on ThunderX2 B0, adding an entry to mapfile
to enable JSON events for B0.

Signed-off-by: Ganapatrao Kulkarni 
---
 tools/perf/pmu-events/arch/arm64/mapfile.csv | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv 
b/tools/perf/pmu-events/arch/arm64/mapfile.csv
index e61c9ca..93c5d14 100644
--- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
+++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
@@ -13,4 +13,5 @@
 #
 #Family-model,Version,Filename,EventType
 0x420f5160,v1,cavium,core
+0x430f0af0,v1,cavium,core
 0x410fd03[[:xdigit:]],v1,cortex-a53,core
-- 
2.9.4

Re: [PATCH v2 00/11] perf events patches for improved ARM64 support

2018-03-02 Thread Ganapatrao Kulkarni

Hi John,

On Fri, Mar 2, 2018 at 9:35 PM, William Cohen  wrote:
> On 03/02/2018 03:24 AM, John Garry wrote:
>> On 27/02/2018 09:50, Jiri Olsa wrote:
>>> On Sat, Feb 24, 2018 at 12:05:21AM +0800, John Garry wrote:
 This patchset adds support for some perf events features,
 targeted at ARM64, implemented in a generic fashion.

 The two main features are as follows:
 - support for arch/vendor/platform pmu events directory structure
- to support this, topic subdirectory support needs to be dropped
 - support for parsing standard architecture pmu events

 On the back of these, the Cavium ThunderX2, ARM Cortex-A53,
 and HiSilicon hip08 JSONs are relocated/added/updated.

 In addition, there is a patch to drop mutli-mapfile.csv support and
 also a bugfix in jevents.c for an error code value.

 Differences to v1:
 - Address coding issues from Jiri Olsa in adding arch std event
support (https://lkml.org/lkml/2018/2/6/501)
 - add patch to drop topic subdirectory support
 - add patch for bug fix in json_events()
 - add review tags from Jiri Olsa
>>>
>>> can't tell if those json file changes are ok, but for all the code changes:
>>>
>>
>> Hi William, Ganapatrao,
>>
>> Can you check the modifications to the ARM64 JSONs you originally submitted 
>> in the patchset please?>
>> If they are not checked, I'll have to see if the maintainers will accept 
>> without your review. If not, I'll have to drop them.

I am seeing issue(log below) with this patchset on our platfrom.
i have tried using your v2 branch [1]

root@borg-1>perf_acme>> ./perf --version
perf version 4.16.rc1.g087f7ca
root@borg-1>perf_acme>> ./perf stat -e bus_access_rd sleep 1

 Performance counter stats for 'sleep 1':

23,099  bus_access_rd

   1.000708516 seconds time elapsed

root@borg-1>perf_acme>> cd -
/ganapat/perf/linux-hisi/tools/perf
root@borg-1>perf>> ./perf --version
perf version 4.16.rc1.gcb5a74
root@borg-1>perf>> ./perf stat -e bus_access_rd sleep 1

 Performance counter stats for 'sleep 1':

 0  bus_access_rd

   1.000709162 seconds time elapsed

root@borg-1>perf>>


[1] https://github.com/hisilicon/linux-hisi.git

>
> Hi John,
>
> I will take a look at the patches this weekend and give feedback beginning of 
> next week. -Will
>
>>
>> Thanks,
>> John
>>
>>> Acked-by: Jiri Olsa 
>>>
>>> thanks,
>>> jirka
>>>
>>> .

thanks
Ganapat
>>>
>>
>>
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v2 00/11] perf events patches for improved ARM64 support

2018-03-02 Thread Ganapatrao Kulkarni

Hi John,

On Fri, Mar 2, 2018 at 9:35 PM, William Cohen  wrote:
> On 03/02/2018 03:24 AM, John Garry wrote:
>> On 27/02/2018 09:50, Jiri Olsa wrote:
>>> On Sat, Feb 24, 2018 at 12:05:21AM +0800, John Garry wrote:
 This patchset adds support for some perf events features,
 targeted at ARM64, implemented in a generic fashion.

 The two main features are as follows:
 - support for arch/vendor/platform pmu events directory structure
- to support this, topic subdirectory support needs to be dropped
 - support for parsing standard architecture pmu events

 On the back of these, the Cavium ThunderX2, ARM Cortex-A53,
 and HiSilicon hip08 JSONs are relocated/added/updated.

 In addition, there is a patch to drop mutli-mapfile.csv support and
 also a bugfix in jevents.c for an error code value.

 Differences to v1:
 - Address coding issues from Jiri Olsa in adding arch std event
support (https://lkml.org/lkml/2018/2/6/501)
 - add patch to drop topic subdirectory support
 - add patch for bug fix in json_events()
 - add review tags from Jiri Olsa
>>>
>>> can't tell if those json file changes are ok, but for all the code changes:
>>>
>>
>> Hi William, Ganapatrao,
>>
>> Can you check the modifications to the ARM64 JSONs you originally submitted 
>> in the patchset please?>
>> If they are not checked, I'll have to see if the maintainers will accept 
>> without your review. If not, I'll have to drop them.

I am seeing issue(log below) with this patchset on our platfrom.
i have tried using your v2 branch [1]

root@borg-1>perf_acme>> ./perf --version
perf version 4.16.rc1.g087f7ca
root@borg-1>perf_acme>> ./perf stat -e bus_access_rd sleep 1

 Performance counter stats for 'sleep 1':

23,099  bus_access_rd

   1.000708516 seconds time elapsed

root@borg-1>perf_acme>> cd -
/ganapat/perf/linux-hisi/tools/perf
root@borg-1>perf>> ./perf --version
perf version 4.16.rc1.gcb5a74
root@borg-1>perf>> ./perf stat -e bus_access_rd sleep 1

 Performance counter stats for 'sleep 1':

 0  bus_access_rd

   1.000709162 seconds time elapsed

root@borg-1>perf>>


[1] https://github.com/hisilicon/linux-hisi.git

>
> Hi John,
>
> I will take a look at the patches this weekend and give feedback beginning of 
> next week. -Will
>
>>
>> Thanks,
>> John
>>
>>> Acked-by: Jiri Olsa 
>>>
>>> thanks,
>>> jirka
>>>
>>> .

thanks
Ganapat
>>>
>>
>>
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v2 00/11] perf events patches for improved ARM64 support

2018-03-02 Thread Ganapatrao Kulkarni

Hi John,

On Fri, Mar 2, 2018 at 1:54 PM, John Garry  wrote:
> On 27/02/2018 09:50, Jiri Olsa wrote:
>>
>> On Sat, Feb 24, 2018 at 12:05:21AM +0800, John Garry wrote:
>>>
>>> This patchset adds support for some perf events features,
>>> targeted at ARM64, implemented in a generic fashion.
>>>
>>> The two main features are as follows:
>>> - support for arch/vendor/platform pmu events directory structure
>>>- to support this, topic subdirectory support needs to be dropped
>>> - support for parsing standard architecture pmu events
>>>
>>> On the back of these, the Cavium ThunderX2, ARM Cortex-A53,
>>> and HiSilicon hip08 JSONs are relocated/added/updated.
>>>
>>> In addition, there is a patch to drop mutli-mapfile.csv support and
>>> also a bugfix in jevents.c for an error code value.
>>>
>>> Differences to v1:
>>> - Address coding issues from Jiri Olsa in adding arch std event
>>>support (https://lkml.org/lkml/2018/2/6/501)
>>> - add patch to drop topic subdirectory support
>>> - add patch for bug fix in json_events()
>>> - add review tags from Jiri Olsa
>>
>>
>> can't tell if those json file changes are ok, but for all the code
>> changes:
>>
>
> Hi William, Ganapatrao,
>
> Can you check the modifications to the ARM64 JSONs you originally submitted
> in the patchset please?

Sorry,  i have missed to notice. I will go through this series and
share my feedback in couple of days.

>
> If they are not checked, I'll have to see if the maintainers will accept
> without your review. If not, I'll have to drop them.
>
> Thanks,
> John
>
>> Acked-by: Jiri Olsa 
>>
>> thanks,
>> jirka
>>
>> .

thanks
Ganapat
>>
>
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v2 00/11] perf events patches for improved ARM64 support

2018-03-02 Thread Ganapatrao Kulkarni

Hi John,

On Fri, Mar 2, 2018 at 1:54 PM, John Garry  wrote:
> On 27/02/2018 09:50, Jiri Olsa wrote:
>>
>> On Sat, Feb 24, 2018 at 12:05:21AM +0800, John Garry wrote:
>>>
>>> This patchset adds support for some perf events features,
>>> targeted at ARM64, implemented in a generic fashion.
>>>
>>> The two main features are as follows:
>>> - support for arch/vendor/platform pmu events directory structure
>>>- to support this, topic subdirectory support needs to be dropped
>>> - support for parsing standard architecture pmu events
>>>
>>> On the back of these, the Cavium ThunderX2, ARM Cortex-A53,
>>> and HiSilicon hip08 JSONs are relocated/added/updated.
>>>
>>> In addition, there is a patch to drop mutli-mapfile.csv support and
>>> also a bugfix in jevents.c for an error code value.
>>>
>>> Differences to v1:
>>> - Address coding issues from Jiri Olsa in adding arch std event
>>>support (https://lkml.org/lkml/2018/2/6/501)
>>> - add patch to drop topic subdirectory support
>>> - add patch for bug fix in json_events()
>>> - add review tags from Jiri Olsa
>>
>>
>> can't tell if those json file changes are ok, but for all the code
>> changes:
>>
>
> Hi William, Ganapatrao,
>
> Can you check the modifications to the ARM64 JSONs you originally submitted
> in the patchset please?

Sorry,  i have missed to notice. I will go through this series and
share my feedback in couple of days.

>
> If they are not checked, I'll have to see if the maintainers will accept
> without your review. If not, I'll have to drop them.
>
> Thanks,
> John
>
>> Acked-by: Jiri Olsa 
>>
>> thanks,
>> jirka
>>
>> .

thanks
Ganapat
>>
>
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v2] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-19 Thread Ganapatrao Kulkarni

On Fri, Jan 19, 2018 at 5:53 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 18/01/18 05:28, Ganapatrao Kulkarni wrote:
>> This erratum is observed on the ThunderX2 GICv3 ITS. When a
>> MOVI command is used to change affinity of a LPI to a collection/cpu
>> on another node, the LPI is not delivered to the cpu.
>> An additional INV command is required after the MOVI to deliver
>> the LPI to the new destination.
>>
>> If we add INV after MOVI, there is a chance that we lose LPIs which
>> are raised when the affinity is changed. So for now, adding workaround fix
>> to disable inter node affinity change.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>
>> v2: Added workaround to avoid inter node affinity change.
>>
>> v1: Initial patch
>>
>>  Documentation/arm64/silicon-errata.txt |  1 +
>>  arch/arm64/Kconfig | 10 ++
>>  drivers/irqchip/irq-gic-v3-its.c   | 21 -
>>  3 files changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/arm64/silicon-errata.txt 
>> b/Documentation/arm64/silicon-errata.txt
>> index fc1c884..fb27cb5 100644
>> --- a/Documentation/arm64/silicon-errata.txt
>> +++ b/Documentation/arm64/silicon-errata.txt
>> @@ -63,6 +63,7 @@ stable kernels.
>>  | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456 
>>|
>>  | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115 
>>|
>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>|
>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174   
>>|
>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>|
>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>|
>>  || | |  
>>|
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c9a7e9e..0dbf3bd 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -461,6 +461,16 @@ config ARM64_ERRATUM_843419
>>
>> If unsure, say Y.
>>
>> +config CAVIUM_ERRATUM_174
>> + bool "Cavium ThunderX2 erratum 174"
>> + default y
>> + help
>> +   Cavium ThunderX2 dual socket systems may loose interrupts
>> +   on affinity change to a cpu on other node.
>> +   This workaround fix avoids inter node affinity change.
>> +
>> +   If unsure, say Y.
>> +
>>  config CAVIUM_ERRATUM_22375
>>   bool "Cavium erratum 22375, 24313"
>>   default y
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 06f025f..b0cb528 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -46,6 +46,7 @@
>>  #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING(1ULL << 0)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_22375(1ULL << 1)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_23144(1ULL << 2)
>> +#define ITS_FLAGS_WORKAROUND_CAVIUM_174  (1ULL << 3)
>
> Instead of inventing a new flag, please rename the existing one to
> ITS_FLAG_WORKAROUND_RESTRICT_NODE_AFFINITY (or something similar). There
> is really no need to have two flags that do the exact same thing,

#23144 is used to restrict ITS to collection mapping too,
where as 174 is only restricts cross node affinity.
Having said that, Since we are restricting affinity in #174,
i see there is no use of having ITS to other node collection mapping.
There should not be any issue if we club flag. I will post this change
in next version.

>
>>
>>  #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING  (1 << 0)
>>
>> @@ -1102,7 +1103,8 @@ static int its_set_affinity(struct irq_data *d, const 
>> struct cpumask *mask_val,
>>   return -EINVAL;
>>
>> /* lpi cannot be routed to a redistributor that is on a foreign node 
>> */
>> - if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
>> + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144 ||
>> + its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) 
>> {
>>   if (its_dev->its->numa_node >= 0) {
>>   cpu_mask = cpumask_of_node(its_dev->its->numa_node);
>>   if (!cp

Re: [PATCH v2] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-19 Thread Ganapatrao Kulkarni

On Fri, Jan 19, 2018 at 5:53 PM, Marc Zyngier  wrote:
> On 18/01/18 05:28, Ganapatrao Kulkarni wrote:
>> This erratum is observed on the ThunderX2 GICv3 ITS. When a
>> MOVI command is used to change affinity of a LPI to a collection/cpu
>> on another node, the LPI is not delivered to the cpu.
>> An additional INV command is required after the MOVI to deliver
>> the LPI to the new destination.
>>
>> If we add INV after MOVI, there is a chance that we lose LPIs which
>> are raised when the affinity is changed. So for now, adding workaround fix
>> to disable inter node affinity change.
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>>
>> v2: Added workaround to avoid inter node affinity change.
>>
>> v1: Initial patch
>>
>>  Documentation/arm64/silicon-errata.txt |  1 +
>>  arch/arm64/Kconfig | 10 ++
>>  drivers/irqchip/irq-gic-v3-its.c   | 21 -
>>  3 files changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/arm64/silicon-errata.txt 
>> b/Documentation/arm64/silicon-errata.txt
>> index fc1c884..fb27cb5 100644
>> --- a/Documentation/arm64/silicon-errata.txt
>> +++ b/Documentation/arm64/silicon-errata.txt
>> @@ -63,6 +63,7 @@ stable kernels.
>>  | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456 
>>|
>>  | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115 
>>|
>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>|
>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174   
>>|
>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>|
>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>|
>>  || | |  
>>|
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c9a7e9e..0dbf3bd 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -461,6 +461,16 @@ config ARM64_ERRATUM_843419
>>
>> If unsure, say Y.
>>
>> +config CAVIUM_ERRATUM_174
>> + bool "Cavium ThunderX2 erratum 174"
>> + default y
>> + help
>> +   Cavium ThunderX2 dual socket systems may loose interrupts
>> +   on affinity change to a cpu on other node.
>> +   This workaround fix avoids inter node affinity change.
>> +
>> +   If unsure, say Y.
>> +
>>  config CAVIUM_ERRATUM_22375
>>   bool "Cavium erratum 22375, 24313"
>>   default y
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 06f025f..b0cb528 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -46,6 +46,7 @@
>>  #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING(1ULL << 0)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_22375(1ULL << 1)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_23144(1ULL << 2)
>> +#define ITS_FLAGS_WORKAROUND_CAVIUM_174  (1ULL << 3)
>
> Instead of inventing a new flag, please rename the existing one to
> ITS_FLAG_WORKAROUND_RESTRICT_NODE_AFFINITY (or something similar). There
> is really no need to have two flags that do the exact same thing,

#23144 is used to restrict ITS to collection mapping too,
where as 174 is only restricts cross node affinity.
Having said that, Since we are restricting affinity in #174,
i see there is no use of having ITS to other node collection mapping.
There should not be any issue if we club flag. I will post this change
in next version.

>
>>
>>  #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING  (1 << 0)
>>
>> @@ -1102,7 +1103,8 @@ static int its_set_affinity(struct irq_data *d, const 
>> struct cpumask *mask_val,
>>   return -EINVAL;
>>
>> /* lpi cannot be routed to a redistributor that is on a foreign node 
>> */
>> - if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
>> + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144 ||
>> + its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) 
>> {
>>   if (its_dev->its->numa_node >= 0) {
>>   cpu_mask = cpumask_of_node(its_dev->its->numa_node);
>>   if (!cpumask_intersects(mask_val, cpu_mask))
>> @@ -2904,6 +2906,15 @@ static int

[PATCH v2] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-17 Thread Ganapatrao Kulkarni

This erratum is observed on the ThunderX2 GICv3 ITS. When a
MOVI command is used to change affinity of a LPI to a collection/cpu
on another node, the LPI is not delivered to the cpu.
An additional INV command is required after the MOVI to deliver
the LPI to the new destination.

If we add INV after MOVI, there is a chance that we lose LPIs which
are raised when the affinity is changed. So for now, adding workaround fix
to disable inter node affinity change.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---

v2: Added workaround to avoid inter node affinity change.

v1: Initial patch

 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 10 ++
 drivers/irqchip/irq-gic-v3-its.c   | 21 -
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index fc1c884..fb27cb5 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -63,6 +63,7 @@ stable kernels.
 | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456
|
 | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115
|
 | Cavium | ThunderX SMMUv2 | #27704  | N/A 
|
+| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174  
|
 | Cavium | ThunderX2 SMMUv3| #74 | N/A 
|
 | Cavium | ThunderX2 SMMUv3| #126| N/A 
|
 || | | 
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e..0dbf3bd 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -461,6 +461,16 @@ config ARM64_ERRATUM_843419
 
  If unsure, say Y.
 
+config CAVIUM_ERRATUM_174
+   bool "Cavium ThunderX2 erratum 174"
+   default y
+   help
+ Cavium ThunderX2 dual socket systems may loose interrupts
+ on affinity change to a cpu on other node.
+ This workaround fix avoids inter node affinity change.
+
+ If unsure, say Y.
+
 config CAVIUM_ERRATUM_22375
bool "Cavium erratum 22375, 24313"
default y
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 06f025f..b0cb528 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -46,6 +46,7 @@
 #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING  (1ULL << 0)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_22375  (1ULL << 1)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_23144  (1ULL << 2)
+#define ITS_FLAGS_WORKAROUND_CAVIUM_174(1ULL << 3)
 
 #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING(1 << 0)
 
@@ -1102,7 +1103,8 @@ static int its_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
return -EINVAL;
 
/* lpi cannot be routed to a redistributor that is on a foreign node */
-   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
+   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144 ||
+   its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
if (its_dev->its->numa_node >= 0) {
cpu_mask = cpumask_of_node(its_dev->its->numa_node);
if (!cpumask_intersects(mask_val, cpu_mask))
@@ -2904,6 +2906,15 @@ static int its_force_quiescent(void __iomem *base)
}
 }
 
+static bool __maybe_unused its_enable_quirk_cavium_174(void *data)
+{
+   struct its_node *its = data;
+
+   its->flags |= ITS_FLAGS_WORKAROUND_CAVIUM_174;
+
+   return true;
+}
+
 static bool __maybe_unused its_enable_quirk_cavium_22375(void *data)
 {
struct its_node *its = data;
@@ -3031,6 +3042,14 @@ static const struct gic_quirk its_quirks[] = {
.init   = its_enable_quirk_hip07_161600802,
},
 #endif
+#ifdef CONFIG_CAVIUM_ERRATUM_174
+   {
+   .desc   = "ITS: Cavium ThunderX2 erratum 174",
+   .iidr   = 0x13f,/* ThunderX2 pass A1/A2/B0 */
+   .mask   = 0x,
+   .init   = its_enable_quirk_cavium_174,
+   },
+#endif
{
}
 };
-- 
2.9.4

[PATCH v2] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-17 Thread Ganapatrao Kulkarni

This erratum is observed on the ThunderX2 GICv3 ITS. When a
MOVI command is used to change affinity of a LPI to a collection/cpu
on another node, the LPI is not delivered to the cpu.
An additional INV command is required after the MOVI to deliver
the LPI to the new destination.

If we add INV after MOVI, there is a chance that we lose LPIs which
are raised when the affinity is changed. So for now, adding workaround fix
to disable inter node affinity change.

Signed-off-by: Ganapatrao Kulkarni 
---

v2: Added workaround to avoid inter node affinity change.

v1: Initial patch

 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 10 ++
 drivers/irqchip/irq-gic-v3-its.c   | 21 -
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index fc1c884..fb27cb5 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -63,6 +63,7 @@ stable kernels.
 | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456
|
 | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115
|
 | Cavium | ThunderX SMMUv2 | #27704  | N/A 
|
+| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174  
|
 | Cavium | ThunderX2 SMMUv3| #74 | N/A 
|
 | Cavium | ThunderX2 SMMUv3| #126| N/A 
|
 || | | 
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e..0dbf3bd 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -461,6 +461,16 @@ config ARM64_ERRATUM_843419
 
  If unsure, say Y.
 
+config CAVIUM_ERRATUM_174
+   bool "Cavium ThunderX2 erratum 174"
+   default y
+   help
+ Cavium ThunderX2 dual socket systems may loose interrupts
+ on affinity change to a cpu on other node.
+ This workaround fix avoids inter node affinity change.
+
+ If unsure, say Y.
+
 config CAVIUM_ERRATUM_22375
bool "Cavium erratum 22375, 24313"
default y
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 06f025f..b0cb528 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -46,6 +46,7 @@
 #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING  (1ULL << 0)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_22375  (1ULL << 1)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_23144  (1ULL << 2)
+#define ITS_FLAGS_WORKAROUND_CAVIUM_174(1ULL << 3)
 
 #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING(1 << 0)
 
@@ -1102,7 +1103,8 @@ static int its_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
return -EINVAL;
 
/* lpi cannot be routed to a redistributor that is on a foreign node */
-   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
+   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144 ||
+   its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
if (its_dev->its->numa_node >= 0) {
cpu_mask = cpumask_of_node(its_dev->its->numa_node);
if (!cpumask_intersects(mask_val, cpu_mask))
@@ -2904,6 +2906,15 @@ static int its_force_quiescent(void __iomem *base)
}
 }
 
+static bool __maybe_unused its_enable_quirk_cavium_174(void *data)
+{
+   struct its_node *its = data;
+
+   its->flags |= ITS_FLAGS_WORKAROUND_CAVIUM_174;
+
+   return true;
+}
+
 static bool __maybe_unused its_enable_quirk_cavium_22375(void *data)
 {
struct its_node *its = data;
@@ -3031,6 +3042,14 @@ static const struct gic_quirk its_quirks[] = {
.init   = its_enable_quirk_hip07_161600802,
},
 #endif
+#ifdef CONFIG_CAVIUM_ERRATUM_174
+   {
+   .desc   = "ITS: Cavium ThunderX2 erratum 174",
+   .iidr   = 0x13f,/* ThunderX2 pass A1/A2/B0 */
+   .mask   = 0x,
+   .init   = its_enable_quirk_cavium_174,
+   },
+#endif
{
}
 };
-- 
2.9.4

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

On Wed, Jan 3, 2018 at 5:06 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 03/01/18 11:20, Ganapatrao Kulkarni wrote:
>> On Wed, Jan 3, 2018 at 3:43 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>> On 03/01/18 09:35, Ganapatrao Kulkarni wrote:
>>>> Hi Marc,
>>>>
>>>> On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>>>> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>>>>>> When an interrupt is moved across node collections on ThunderX2
>>>>>
>>>>> node collections?
>>>>
>>>> ok, i will rephrase it.
>>>>  i was intended to say cross NUMA node collection/cpu affinity change.
>>>>
>>>>>
>>>>>> multi Socket platform, an interrupt stops routed to new collection
>>>>>> and results in loss of interrupts.
>>>>>>
>>>>>> Adding workaround to issue INV after MOVI for cross-node collection
>>>>>> move to flush out the cached entry.
>>>>>>
>>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>>>> ---
>>>>>>  Documentation/arm64/silicon-errata.txt |  1 +
>>>>>>  arch/arm64/Kconfig | 11 +++
>>>>>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>>>>>  3 files changed, 36 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/arm64/silicon-errata.txt 
>>>>>> b/Documentation/arm64/silicon-errata.txt
>>>>>> index fc1c884..fb27cb5 100644
>>>>>> --- a/Documentation/arm64/silicon-errata.txt
>>>>>> +++ b/Documentation/arm64/silicon-errata.txt
>>>>>> @@ -63,6 +63,7 @@ stable kernels.
>>>>>>  | Cavium | ThunderX Core   | #27456  | 
>>>>>> CAVIUM_ERRATUM_27456|
>>>>>>  | Cavium | ThunderX Core   | #30115  | 
>>>>>> CAVIUM_ERRATUM_30115|
>>>>>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>>>>>|
>>>>>> +| Cavium | ThunderX2 ITS   | #174| 
>>>>>> CAVIUM_ERRATUM_174  |
>>>>>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>>>>>|
>>>>>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>>>>>|
>>>>>>  || | |  
>>>>>>|
>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>> index c9a7e9e..71a7e30 100644
>>>>>> --- a/arch/arm64/Kconfig
>>>>>> +++ b/arch/arm64/Kconfig
>>>>>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>>>>>
>>>>>> If unsure, say Y.
>>>>>>
>>>>>> +config CAVIUM_ERRATUM_174
>>>>>> + bool "Cavium ThunderX2 erratum 174"
>>>>>> + depends on NUMA
>>>>>
>>>>> Why? This system will be affected no matter whether NUMA is selected or 
>>>>> not.
>>>>
>>>> it does not makes sense to enable on non-NUMA/single socket platforms.
>>>> By default NUMA is enabled on ThunderX2 dual socket platforms.
>>>
>>> 
>>> config ARCH_THUNDER2
>>> bool "Cavium ThunderX2 Server Processors"
>>> select GPIOLIB
>>> help
>>>   This enables support for Cavium's ThunderX2 CN99XX family of
>>>   server processors.
>>> 
>>>
>>> Do you see any NUMA here? I can perfectly compile a kernel with both
>>> sockets, and not using NUMA. NUMA has to do with memory, and not interrupts.
>>
>> ok,  i will remote it.
>>>
>>>>
>>>>>
>>>>>> + default y
>>>>>> + help
>>>>>> +   LPI stops routed to redistributors after inter node collection
>>>>>> +   move in ITS. Enable workaround to invalidate ITS entry after
>>>>>> +   inter-node collection move.
>>>>>
>>>>> That's a very terse description. Nobody knows what an LPI, a
>>>>> redis

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

On Wed, Jan 3, 2018 at 5:06 PM, Marc Zyngier  wrote:
> On 03/01/18 11:20, Ganapatrao Kulkarni wrote:
>> On Wed, Jan 3, 2018 at 3:43 PM, Marc Zyngier  wrote:
>>> On 03/01/18 09:35, Ganapatrao Kulkarni wrote:
>>>> Hi Marc,
>>>>
>>>> On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier  wrote:
>>>>> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>>>>>> When an interrupt is moved across node collections on ThunderX2
>>>>>
>>>>> node collections?
>>>>
>>>> ok, i will rephrase it.
>>>>  i was intended to say cross NUMA node collection/cpu affinity change.
>>>>
>>>>>
>>>>>> multi Socket platform, an interrupt stops routed to new collection
>>>>>> and results in loss of interrupts.
>>>>>>
>>>>>> Adding workaround to issue INV after MOVI for cross-node collection
>>>>>> move to flush out the cached entry.
>>>>>>
>>>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>>>> ---
>>>>>>  Documentation/arm64/silicon-errata.txt |  1 +
>>>>>>  arch/arm64/Kconfig | 11 +++
>>>>>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>>>>>  3 files changed, 36 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/arm64/silicon-errata.txt 
>>>>>> b/Documentation/arm64/silicon-errata.txt
>>>>>> index fc1c884..fb27cb5 100644
>>>>>> --- a/Documentation/arm64/silicon-errata.txt
>>>>>> +++ b/Documentation/arm64/silicon-errata.txt
>>>>>> @@ -63,6 +63,7 @@ stable kernels.
>>>>>>  | Cavium | ThunderX Core   | #27456  | 
>>>>>> CAVIUM_ERRATUM_27456|
>>>>>>  | Cavium | ThunderX Core   | #30115  | 
>>>>>> CAVIUM_ERRATUM_30115|
>>>>>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>>>>>|
>>>>>> +| Cavium | ThunderX2 ITS   | #174| 
>>>>>> CAVIUM_ERRATUM_174  |
>>>>>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>>>>>|
>>>>>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>>>>>|
>>>>>>  || | |  
>>>>>>|
>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>> index c9a7e9e..71a7e30 100644
>>>>>> --- a/arch/arm64/Kconfig
>>>>>> +++ b/arch/arm64/Kconfig
>>>>>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>>>>>
>>>>>> If unsure, say Y.
>>>>>>
>>>>>> +config CAVIUM_ERRATUM_174
>>>>>> + bool "Cavium ThunderX2 erratum 174"
>>>>>> + depends on NUMA
>>>>>
>>>>> Why? This system will be affected no matter whether NUMA is selected or 
>>>>> not.
>>>>
>>>> it does not makes sense to enable on non-NUMA/single socket platforms.
>>>> By default NUMA is enabled on ThunderX2 dual socket platforms.
>>>
>>> 
>>> config ARCH_THUNDER2
>>> bool "Cavium ThunderX2 Server Processors"
>>> select GPIOLIB
>>> help
>>>   This enables support for Cavium's ThunderX2 CN99XX family of
>>>   server processors.
>>> 
>>>
>>> Do you see any NUMA here? I can perfectly compile a kernel with both
>>> sockets, and not using NUMA. NUMA has to do with memory, and not interrupts.
>>
>> ok,  i will remote it.
>>>
>>>>
>>>>>
>>>>>> + default y
>>>>>> + help
>>>>>> +   LPI stops routed to redistributors after inter node collection
>>>>>> +   move in ITS. Enable workaround to invalidate ITS entry after
>>>>>> +   inter-node collection move.
>>>>>
>>>>> That's a very terse description. Nobody knows what an LPI, a
>>>>> redistributor or a collection is. Please explain what the erratum is in
>>>>> layman's terms (Cavium ThunderX2 systems may loose int

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

On Wed, Jan 3, 2018 at 3:43 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 03/01/18 09:35, Ganapatrao Kulkarni wrote:
>> Hi Marc,
>>
>> On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>>>> When an interrupt is moved across node collections on ThunderX2
>>>
>>> node collections?
>>
>> ok, i will rephrase it.
>>  i was intended to say cross NUMA node collection/cpu affinity change.
>>
>>>
>>>> multi Socket platform, an interrupt stops routed to new collection
>>>> and results in loss of interrupts.
>>>>
>>>> Adding workaround to issue INV after MOVI for cross-node collection
>>>> move to flush out the cached entry.
>>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>> ---
>>>>  Documentation/arm64/silicon-errata.txt |  1 +
>>>>  arch/arm64/Kconfig | 11 +++
>>>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>>>  3 files changed, 36 insertions(+)
>>>>
>>>> diff --git a/Documentation/arm64/silicon-errata.txt 
>>>> b/Documentation/arm64/silicon-errata.txt
>>>> index fc1c884..fb27cb5 100644
>>>> --- a/Documentation/arm64/silicon-errata.txt
>>>> +++ b/Documentation/arm64/silicon-errata.txt
>>>> @@ -63,6 +63,7 @@ stable kernels.
>>>>  | Cavium | ThunderX Core   | #27456  | 
>>>> CAVIUM_ERRATUM_27456|
>>>>  | Cavium | ThunderX Core   | #30115  | 
>>>> CAVIUM_ERRATUM_30115|
>>>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A
>>>>  |
>>>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174 
>>>>  |
>>>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A
>>>>  |
>>>>  | Cavium | ThunderX2 SMMUv3| #126| N/A
>>>>  |
>>>>  || | |
>>>>  |
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index c9a7e9e..71a7e30 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>>>
>>>> If unsure, say Y.
>>>>
>>>> +config CAVIUM_ERRATUM_174
>>>> + bool "Cavium ThunderX2 erratum 174"
>>>> + depends on NUMA
>>>
>>> Why? This system will be affected no matter whether NUMA is selected or not.
>>
>> it does not makes sense to enable on non-NUMA/single socket platforms.
>> By default NUMA is enabled on ThunderX2 dual socket platforms.
>
> 
> config ARCH_THUNDER2
> bool "Cavium ThunderX2 Server Processors"
> select GPIOLIB
> help
>   This enables support for Cavium's ThunderX2 CN99XX family of
>   server processors.
> 
>
> Do you see any NUMA here? I can perfectly compile a kernel with both
> sockets, and not using NUMA. NUMA has to do with memory, and not interrupts.

ok,  i will remote it.
>
>>
>>>
>>>> + default y
>>>> + help
>>>> +   LPI stops routed to redistributors after inter node collection
>>>> +   move in ITS. Enable workaround to invalidate ITS entry after
>>>> +   inter-node collection move.
>>>
>>> That's a very terse description. Nobody knows what an LPI, a
>>> redistributor or a collection is. Please explain what the erratum is in
>>> layman's terms (Cavium ThunderX2 systems may loose interrupts on
>>> affinity change) so that people understand whether or not they are affected.
>>
>> ok, i will rephrase it in next version.
>>>
>>>> +
>>>> +   If unsure, say Y.
>>>> +
>>>>  config CAVIUM_ERRATUM_22375
>>>>   bool "Cavium erratum 22375, 24313"
>>>>   default y
>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>>>> b/drivers/irqchip/irq-gic-v3-its.c
>>>> index 06f025f..d8b9c96 100644
>>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>>> @@ -46,6 +46,7 @@
>>>>  #define

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

On Wed, Jan 3, 2018 at 3:43 PM, Marc Zyngier  wrote:
> On 03/01/18 09:35, Ganapatrao Kulkarni wrote:
>> Hi Marc,
>>
>> On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier  wrote:
>>> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>>>> When an interrupt is moved across node collections on ThunderX2
>>>
>>> node collections?
>>
>> ok, i will rephrase it.
>>  i was intended to say cross NUMA node collection/cpu affinity change.
>>
>>>
>>>> multi Socket platform, an interrupt stops routed to new collection
>>>> and results in loss of interrupts.
>>>>
>>>> Adding workaround to issue INV after MOVI for cross-node collection
>>>> move to flush out the cached entry.
>>>>
>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>> ---
>>>>  Documentation/arm64/silicon-errata.txt |  1 +
>>>>  arch/arm64/Kconfig | 11 +++
>>>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>>>  3 files changed, 36 insertions(+)
>>>>
>>>> diff --git a/Documentation/arm64/silicon-errata.txt 
>>>> b/Documentation/arm64/silicon-errata.txt
>>>> index fc1c884..fb27cb5 100644
>>>> --- a/Documentation/arm64/silicon-errata.txt
>>>> +++ b/Documentation/arm64/silicon-errata.txt
>>>> @@ -63,6 +63,7 @@ stable kernels.
>>>>  | Cavium | ThunderX Core   | #27456  | 
>>>> CAVIUM_ERRATUM_27456|
>>>>  | Cavium | ThunderX Core   | #30115  | 
>>>> CAVIUM_ERRATUM_30115|
>>>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A
>>>>  |
>>>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174 
>>>>  |
>>>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A
>>>>  |
>>>>  | Cavium | ThunderX2 SMMUv3| #126| N/A
>>>>  |
>>>>  || | |
>>>>  |
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index c9a7e9e..71a7e30 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>>>
>>>> If unsure, say Y.
>>>>
>>>> +config CAVIUM_ERRATUM_174
>>>> + bool "Cavium ThunderX2 erratum 174"
>>>> + depends on NUMA
>>>
>>> Why? This system will be affected no matter whether NUMA is selected or not.
>>
>> it does not makes sense to enable on non-NUMA/single socket platforms.
>> By default NUMA is enabled on ThunderX2 dual socket platforms.
>
> 
> config ARCH_THUNDER2
> bool "Cavium ThunderX2 Server Processors"
> select GPIOLIB
> help
>   This enables support for Cavium's ThunderX2 CN99XX family of
>   server processors.
> 
>
> Do you see any NUMA here? I can perfectly compile a kernel with both
> sockets, and not using NUMA. NUMA has to do with memory, and not interrupts.

ok,  i will remote it.
>
>>
>>>
>>>> + default y
>>>> + help
>>>> +   LPI stops routed to redistributors after inter node collection
>>>> +   move in ITS. Enable workaround to invalidate ITS entry after
>>>> +   inter-node collection move.
>>>
>>> That's a very terse description. Nobody knows what an LPI, a
>>> redistributor or a collection is. Please explain what the erratum is in
>>> layman's terms (Cavium ThunderX2 systems may loose interrupts on
>>> affinity change) so that people understand whether or not they are affected.
>>
>> ok, i will rephrase it in next version.
>>>
>>>> +
>>>> +   If unsure, say Y.
>>>> +
>>>>  config CAVIUM_ERRATUM_22375
>>>>   bool "Cavium erratum 22375, 24313"
>>>>   default y
>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>>>> b/drivers/irqchip/irq-gic-v3-its.c
>>>> index 06f025f..d8b9c96 100644
>>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>>> @@ -46,6 +46,7 @@
>>>>  #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING(1ULL << 0)
>>>>  #define ITS_FLAG

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

Hi Marc,

On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>> When an interrupt is moved across node collections on ThunderX2
>
> node collections?

ok, i will rephrase it.
 i was intended to say cross NUMA node collection/cpu affinity change.

>
>> multi Socket platform, an interrupt stops routed to new collection
>> and results in loss of interrupts.
>>
>> Adding workaround to issue INV after MOVI for cross-node collection
>> move to flush out the cached entry.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>  Documentation/arm64/silicon-errata.txt |  1 +
>>  arch/arm64/Kconfig | 11 +++
>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>  3 files changed, 36 insertions(+)
>>
>> diff --git a/Documentation/arm64/silicon-errata.txt 
>> b/Documentation/arm64/silicon-errata.txt
>> index fc1c884..fb27cb5 100644
>> --- a/Documentation/arm64/silicon-errata.txt
>> +++ b/Documentation/arm64/silicon-errata.txt
>> @@ -63,6 +63,7 @@ stable kernels.
>>  | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456 
>>|
>>  | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115 
>>|
>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>|
>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174   
>>|
>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>|
>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>|
>>  || | |  
>>|
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c9a7e9e..71a7e30 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>
>> If unsure, say Y.
>>
>> +config CAVIUM_ERRATUM_174
>> + bool "Cavium ThunderX2 erratum 174"
>> + depends on NUMA
>
> Why? This system will be affected no matter whether NUMA is selected or not.

it does not makes sense to enable on non-NUMA/single socket platforms.
By default NUMA is enabled on ThunderX2 dual socket platforms.

>
>> + default y
>> + help
>> +   LPI stops routed to redistributors after inter node collection
>> +   move in ITS. Enable workaround to invalidate ITS entry after
>> +   inter-node collection move.
>
> That's a very terse description. Nobody knows what an LPI, a
> redistributor or a collection is. Please explain what the erratum is in
> layman's terms (Cavium ThunderX2 systems may loose interrupts on
> affinity change) so that people understand whether or not they are affected.

ok, i will rephrase it in next version.
>
>> +
>> +   If unsure, say Y.
>> +
>>  config CAVIUM_ERRATUM_22375
>>   bool "Cavium erratum 22375, 24313"
>>   default y
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 06f025f..d8b9c96 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -46,6 +46,7 @@
>>  #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING(1ULL << 0)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_22375(1ULL << 1)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_23144(1ULL << 2)
>> +#define ITS_FLAGS_WORKAROUND_CAVIUM_174  (1ULL << 3)
>>
>>  #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING  (1 << 0)
>>
>> @@ -1119,6 +1120,12 @@ static int its_set_affinity(struct irq_data *d, const 
>> struct cpumask *mask_val,
>>   if (cpu != its_dev->event_map.col_map[id]) {
>>   target_col = _dev->its->collections[cpu];
>>   its_send_movi(its_dev, target_col, id);
>> + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
>> + /* Issue INV for cross node collection move. */
>> + if (cpu_to_node(cpu) !=
>> + cpu_to_node(its_dev->event_map.col_map[id]))
>> + its_send_inv(its_dev, id);
>> + }
>
> What happens if an interrupt happens after the MOV, but before the INV?

there can be drop,  if interrupt happens before INV, however, it is
highly unlikely that we will hit the issue since MOVI a

Re: [PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-03 Thread Ganapatrao Kulkarni

Hi Marc,

On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier  wrote:
> On 03/01/18 06:32, Ganapatrao Kulkarni wrote:
>> When an interrupt is moved across node collections on ThunderX2
>
> node collections?

ok, i will rephrase it.
 i was intended to say cross NUMA node collection/cpu affinity change.

>
>> multi Socket platform, an interrupt stops routed to new collection
>> and results in loss of interrupts.
>>
>> Adding workaround to issue INV after MOVI for cross-node collection
>> move to flush out the cached entry.
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>>  Documentation/arm64/silicon-errata.txt |  1 +
>>  arch/arm64/Kconfig | 11 +++
>>  drivers/irqchip/irq-gic-v3-its.c   | 24 
>>  3 files changed, 36 insertions(+)
>>
>> diff --git a/Documentation/arm64/silicon-errata.txt 
>> b/Documentation/arm64/silicon-errata.txt
>> index fc1c884..fb27cb5 100644
>> --- a/Documentation/arm64/silicon-errata.txt
>> +++ b/Documentation/arm64/silicon-errata.txt
>> @@ -63,6 +63,7 @@ stable kernels.
>>  | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456 
>>|
>>  | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115 
>>|
>>  | Cavium | ThunderX SMMUv2 | #27704  | N/A  
>>|
>> +| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174   
>>|
>>  | Cavium | ThunderX2 SMMUv3| #74 | N/A  
>>|
>>  | Cavium | ThunderX2 SMMUv3| #126| N/A  
>>|
>>  || | |  
>>|
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c9a7e9e..71a7e30 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
>>
>> If unsure, say Y.
>>
>> +config CAVIUM_ERRATUM_174
>> + bool "Cavium ThunderX2 erratum 174"
>> + depends on NUMA
>
> Why? This system will be affected no matter whether NUMA is selected or not.

it does not makes sense to enable on non-NUMA/single socket platforms.
By default NUMA is enabled on ThunderX2 dual socket platforms.

>
>> + default y
>> + help
>> +   LPI stops routed to redistributors after inter node collection
>> +   move in ITS. Enable workaround to invalidate ITS entry after
>> +   inter-node collection move.
>
> That's a very terse description. Nobody knows what an LPI, a
> redistributor or a collection is. Please explain what the erratum is in
> layman's terms (Cavium ThunderX2 systems may loose interrupts on
> affinity change) so that people understand whether or not they are affected.

ok, i will rephrase it in next version.
>
>> +
>> +   If unsure, say Y.
>> +
>>  config CAVIUM_ERRATUM_22375
>>   bool "Cavium erratum 22375, 24313"
>>   default y
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 06f025f..d8b9c96 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -46,6 +46,7 @@
>>  #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING(1ULL << 0)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_22375(1ULL << 1)
>>  #define ITS_FLAGS_WORKAROUND_CAVIUM_23144(1ULL << 2)
>> +#define ITS_FLAGS_WORKAROUND_CAVIUM_174  (1ULL << 3)
>>
>>  #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING  (1 << 0)
>>
>> @@ -1119,6 +1120,12 @@ static int its_set_affinity(struct irq_data *d, const 
>> struct cpumask *mask_val,
>>   if (cpu != its_dev->event_map.col_map[id]) {
>>   target_col = _dev->its->collections[cpu];
>>   its_send_movi(its_dev, target_col, id);
>> + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
>> + /* Issue INV for cross node collection move. */
>> + if (cpu_to_node(cpu) !=
>> + cpu_to_node(its_dev->event_map.col_map[id]))
>> + its_send_inv(its_dev, id);
>> + }
>
> What happens if an interrupt happens after the MOV, but before the INV?

there can be drop,  if interrupt happens before INV, however, it is
highly unlikely that we will hit the issue since MOVI and INV are
executed back to back. this workaround fixed issue seen on c

[PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-02 Thread Ganapatrao Kulkarni

When an interrupt is moved across node collections on ThunderX2
multi Socket platform, an interrupt stops routed to new collection
and results in loss of interrupts.

Adding workaround to issue INV after MOVI for cross-node collection
move to flush out the cached entry.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 11 +++
 drivers/irqchip/irq-gic-v3-its.c   | 24 
 3 files changed, 36 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index fc1c884..fb27cb5 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -63,6 +63,7 @@ stable kernels.
 | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456
|
 | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115
|
 | Cavium | ThunderX SMMUv2 | #27704  | N/A 
|
+| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174  
|
 | Cavium | ThunderX2 SMMUv3| #74 | N/A 
|
 | Cavium | ThunderX2 SMMUv3| #126| N/A 
|
 || | | 
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e..71a7e30 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
 
  If unsure, say Y.
 
+config CAVIUM_ERRATUM_174
+   bool "Cavium ThunderX2 erratum 174"
+   depends on NUMA
+   default y
+   help
+ LPI stops routed to redistributors after inter node collection
+ move in ITS. Enable workaround to invalidate ITS entry after
+ inter-node collection move.
+
+ If unsure, say Y.
+
 config CAVIUM_ERRATUM_22375
bool "Cavium erratum 22375, 24313"
default y
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 06f025f..d8b9c96 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -46,6 +46,7 @@
 #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING  (1ULL << 0)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_22375  (1ULL << 1)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_23144  (1ULL << 2)
+#define ITS_FLAGS_WORKAROUND_CAVIUM_174(1ULL << 3)
 
 #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING(1 << 0)
 
@@ -1119,6 +1120,12 @@ static int its_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
if (cpu != its_dev->event_map.col_map[id]) {
target_col = _dev->its->collections[cpu];
its_send_movi(its_dev, target_col, id);
+   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
+   /* Issue INV for cross node collection move. */
+   if (cpu_to_node(cpu) !=
+   cpu_to_node(its_dev->event_map.col_map[id]))
+   its_send_inv(its_dev, id);
+   }
its_dev->event_map.col_map[id] = cpu;
irq_data_update_effective_affinity(d, cpumask_of(cpu));
}
@@ -2904,6 +2911,15 @@ static int its_force_quiescent(void __iomem *base)
}
 }
 
+static bool __maybe_unused its_enable_quirk_cavium_174(void *data)
+{
+   struct its_node *its = data;
+
+   its->flags |= ITS_FLAGS_WORKAROUND_CAVIUM_174;
+
+   return true;
+}
+
 static bool __maybe_unused its_enable_quirk_cavium_22375(void *data)
 {
struct its_node *its = data;
@@ -3031,6 +3047,14 @@ static const struct gic_quirk its_quirks[] = {
.init   = its_enable_quirk_hip07_161600802,
},
 #endif
+#ifdef CONFIG_CAVIUM_ERRATUM_174
+   {
+   .desc   = "ITS: Cavium ThunderX2 erratum 174",
+   .iidr   = 0x13f,/* ThunderX2 pass A1/A2/B0 */
+   .mask   = 0x,
+   .init   = its_enable_quirk_cavium_174,
+   },
+#endif
{
}
 };
-- 
2.9.4

[PATCH] irqchip/gic-v3-its: Add workaround for ThunderX2 erratum #174

2018-01-02 Thread Ganapatrao Kulkarni

When an interrupt is moved across node collections on ThunderX2
multi Socket platform, an interrupt stops routed to new collection
and results in loss of interrupts.

Adding workaround to issue INV after MOVI for cross-node collection
move to flush out the cached entry.

Signed-off-by: Ganapatrao Kulkarni 
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 11 +++
 drivers/irqchip/irq-gic-v3-its.c   | 24 
 3 files changed, 36 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index fc1c884..fb27cb5 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -63,6 +63,7 @@ stable kernels.
 | Cavium | ThunderX Core   | #27456  | CAVIUM_ERRATUM_27456
|
 | Cavium | ThunderX Core   | #30115  | CAVIUM_ERRATUM_30115
|
 | Cavium | ThunderX SMMUv2 | #27704  | N/A 
|
+| Cavium | ThunderX2 ITS   | #174| CAVIUM_ERRATUM_174  
|
 | Cavium | ThunderX2 SMMUv3| #74 | N/A 
|
 | Cavium | ThunderX2 SMMUv3| #126| N/A 
|
 || | | 
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e..71a7e30 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419
 
  If unsure, say Y.
 
+config CAVIUM_ERRATUM_174
+   bool "Cavium ThunderX2 erratum 174"
+   depends on NUMA
+   default y
+   help
+ LPI stops routed to redistributors after inter node collection
+ move in ITS. Enable workaround to invalidate ITS entry after
+ inter-node collection move.
+
+ If unsure, say Y.
+
 config CAVIUM_ERRATUM_22375
bool "Cavium erratum 22375, 24313"
default y
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 06f025f..d8b9c96 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -46,6 +46,7 @@
 #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING  (1ULL << 0)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_22375  (1ULL << 1)
 #define ITS_FLAGS_WORKAROUND_CAVIUM_23144  (1ULL << 2)
+#define ITS_FLAGS_WORKAROUND_CAVIUM_174(1ULL << 3)
 
 #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING(1 << 0)
 
@@ -1119,6 +1120,12 @@ static int its_set_affinity(struct irq_data *d, const 
struct cpumask *mask_val,
if (cpu != its_dev->event_map.col_map[id]) {
target_col = _dev->its->collections[cpu];
its_send_movi(its_dev, target_col, id);
+   if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) {
+   /* Issue INV for cross node collection move. */
+   if (cpu_to_node(cpu) !=
+   cpu_to_node(its_dev->event_map.col_map[id]))
+   its_send_inv(its_dev, id);
+   }
its_dev->event_map.col_map[id] = cpu;
irq_data_update_effective_affinity(d, cpumask_of(cpu));
}
@@ -2904,6 +2911,15 @@ static int its_force_quiescent(void __iomem *base)
}
 }
 
+static bool __maybe_unused its_enable_quirk_cavium_174(void *data)
+{
+   struct its_node *its = data;
+
+   its->flags |= ITS_FLAGS_WORKAROUND_CAVIUM_174;
+
+   return true;
+}
+
 static bool __maybe_unused its_enable_quirk_cavium_22375(void *data)
 {
struct its_node *its = data;
@@ -3031,6 +3047,14 @@ static const struct gic_quirk its_quirks[] = {
.init   = its_enable_quirk_hip07_161600802,
},
 #endif
+#ifdef CONFIG_CAVIUM_ERRATUM_174
+   {
+   .desc   = "ITS: Cavium ThunderX2 erratum 174",
+   .iidr   = 0x13f,/* ThunderX2 pass A1/A2/B0 */
+   .mask   = 0x,
+   .init   = its_enable_quirk_cavium_174,
+   },
+#endif
{
}
 };
-- 
2.9.4

Re: [PATCH] irqchip/gic-v3-its: Flush GICR caching for a cross node collection move of an irq

2017-12-20 Thread Ganapatrao Kulkarni

On Wed, Dec 20, 2017 at 6:42 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 20/12/17 09:34, Ganapatrao Kulkarni wrote:
>> Hi Marc,
>>
>> On Wed, Dec 20, 2017 at 2:56 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>> On 20/12/17 09:15, Ganapatrao Kulkarni wrote:
>>>> When an interrupt is moved, it is possible that an implementation that
>>>> supports caching might still have cached data for a previous
>>>> (no longer valid) mapping of the interrupt. In particular, in a distributed
>>>> GIC implementation like multi-socket SoC platfroms. Hence it is necessary
>>>> to flush cached entries after cross node collection migration.
>>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>>>> ---
>>>>  drivers/irqchip/irq-gic-v3-its.c | 6 ++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>>>> b/drivers/irqchip/irq-gic-v3-its.c
>>>> index 4039e64..ea849a1 100644
>>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>>> @@ -1119,6 +1119,12 @@ static int its_set_affinity(struct irq_data *d, 
>>>> const struct cpumask *mask_val,
>>>>   if (cpu != its_dev->event_map.col_map[id]) {
>>>>   target_col = _dev->its->collections[cpu];
>>>>   its_send_movi(its_dev, target_col, id);
>>>> + /* Issue INV for cross node collection move on
>>>> +  * multi socket systems.
>>>> +  */
>>>> + if (cpu_to_node(cpu) !=
>>>> + cpu_to_node(its_dev->event_map.col_map[id]))
>>>> + its_send_inv(its_dev, id);
>>>>   its_dev->event_map.col_map[id] = cpu;
>>>>   irq_data_update_effective_affinity(d, cpumask_of(cpu));
>>>>   }
>>>>
>>>
>>> The MOVI command doesn't have any such requirement (it only mandates
>>> synchronization), and doesn't say anything about distributed vs monolithic.
>>
>> GIC-v3 spec do mention to issue ITS INV command or a write to GICR_INVLPIR.
>> pasting below snippet of MOVI command description.
>>
>> "When an interrupt is moved to a collection, it is possible that an
>> implementation that supports speculative caching
>> might still have cached data for a previous (no longer valid) mapping
>> of the interrupt. Hence, implementations
>> must take care to invalidate any data associated with an interrupt
>> when it is moved. In particular, in a distributed
>> implementation, the ITS must write to the appropriate GICR_* register
>> to perform the invalidation in the redistributor."
>
> Doing some documentation archaeology, I found that this wording has been
> dropped from the engineering specification in August 2014, and was never
> included in the architecture specification. I suggest you start using a
> slightly more up-to-date set of documentation...

thanks Marc for digging in to archive.

>
> Now, back to your point: what it says in the bit of (confidential)
> document that you quoted is that the *HW* must perform the invalidation
> (that's what the words "implementations" and "ITS" refer to), not some
> random bits of SW.
>
> If you know of an implementation that suffers from this, please resend a
> patch that handles this as a quirk, with a proper erratum number.

Sure, this is being discussed internally and will repost as errata fix
at the earliest.

>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

thanks
Ganapat

Re: [PATCH] irqchip/gic-v3-its: Flush GICR caching for a cross node collection move of an irq

2017-12-20 Thread Ganapatrao Kulkarni

On Wed, Dec 20, 2017 at 6:42 PM, Marc Zyngier  wrote:
> On 20/12/17 09:34, Ganapatrao Kulkarni wrote:
>> Hi Marc,
>>
>> On Wed, Dec 20, 2017 at 2:56 PM, Marc Zyngier  wrote:
>>> On 20/12/17 09:15, Ganapatrao Kulkarni wrote:
>>>> When an interrupt is moved, it is possible that an implementation that
>>>> supports caching might still have cached data for a previous
>>>> (no longer valid) mapping of the interrupt. In particular, in a distributed
>>>> GIC implementation like multi-socket SoC platfroms. Hence it is necessary
>>>> to flush cached entries after cross node collection migration.
>>>>
>>>> Signed-off-by: Ganapatrao Kulkarni 
>>>> ---
>>>>  drivers/irqchip/irq-gic-v3-its.c | 6 ++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>>>> b/drivers/irqchip/irq-gic-v3-its.c
>>>> index 4039e64..ea849a1 100644
>>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>>> @@ -1119,6 +1119,12 @@ static int its_set_affinity(struct irq_data *d, 
>>>> const struct cpumask *mask_val,
>>>>   if (cpu != its_dev->event_map.col_map[id]) {
>>>>   target_col = _dev->its->collections[cpu];
>>>>   its_send_movi(its_dev, target_col, id);
>>>> + /* Issue INV for cross node collection move on
>>>> +  * multi socket systems.
>>>> +  */
>>>> + if (cpu_to_node(cpu) !=
>>>> + cpu_to_node(its_dev->event_map.col_map[id]))
>>>> + its_send_inv(its_dev, id);
>>>>   its_dev->event_map.col_map[id] = cpu;
>>>>   irq_data_update_effective_affinity(d, cpumask_of(cpu));
>>>>   }
>>>>
>>>
>>> The MOVI command doesn't have any such requirement (it only mandates
>>> synchronization), and doesn't say anything about distributed vs monolithic.
>>
>> GIC-v3 spec do mention to issue ITS INV command or a write to GICR_INVLPIR.
>> pasting below snippet of MOVI command description.
>>
>> "When an interrupt is moved to a collection, it is possible that an
>> implementation that supports speculative caching
>> might still have cached data for a previous (no longer valid) mapping
>> of the interrupt. Hence, implementations
>> must take care to invalidate any data associated with an interrupt
>> when it is moved. In particular, in a distributed
>> implementation, the ITS must write to the appropriate GICR_* register
>> to perform the invalidation in the redistributor."
>
> Doing some documentation archaeology, I found that this wording has been
> dropped from the engineering specification in August 2014, and was never
> included in the architecture specification. I suggest you start using a
> slightly more up-to-date set of documentation...

thanks Marc for digging in to archive.

>
> Now, back to your point: what it says in the bit of (confidential)
> document that you quoted is that the *HW* must perform the invalidation
> (that's what the words "implementations" and "ITS" refer to), not some
> random bits of SW.
>
> If you know of an implementation that suffers from this, please resend a
> patch that handles this as a quirk, with a proper erratum number.

Sure, this is being discussed internally and will repost as errata fix
at the earliest.

>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

thanks
Ganapat

Re: [PATCH] irqchip/gic-v3-its: Flush GICR caching for a cross node collection move of an irq

2017-12-20 Thread Ganapatrao Kulkarni

Hi Marc,

On Wed, Dec 20, 2017 at 2:56 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 20/12/17 09:15, Ganapatrao Kulkarni wrote:
>> When an interrupt is moved, it is possible that an implementation that
>> supports caching might still have cached data for a previous
>> (no longer valid) mapping of the interrupt. In particular, in a distributed
>> GIC implementation like multi-socket SoC platfroms. Hence it is necessary
>> to flush cached entries after cross node collection migration.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulka...@cavium.com>
>> ---
>>  drivers/irqchip/irq-gic-v3-its.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 4039e64..ea849a1 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -1119,6 +1119,12 @@ static int its_set_affinity(struct irq_data *d, const 
>> struct cpumask *mask_val,
>>   if (cpu != its_dev->event_map.col_map[id]) {
>>   target_col = _dev->its->collections[cpu];
>>   its_send_movi(its_dev, target_col, id);
>> + /* Issue INV for cross node collection move on
>> +  * multi socket systems.
>> +  */
>> + if (cpu_to_node(cpu) !=
>> + cpu_to_node(its_dev->event_map.col_map[id]))
>> + its_send_inv(its_dev, id);
>>   its_dev->event_map.col_map[id] = cpu;
>>   irq_data_update_effective_affinity(d, cpumask_of(cpu));
>>   }
>>
>
> The MOVI command doesn't have any such requirement (it only mandates
> synchronization), and doesn't say anything about distributed vs monolithic.

GIC-v3 spec do mention to issue ITS INV command or a write to GICR_INVLPIR.
pasting below snippet of MOVI command description.

"When an interrupt is moved to a collection, it is possible that an
implementation that supports speculative caching
might still have cached data for a previous (no longer valid) mapping
of the interrupt. Hence, implementations
must take care to invalidate any data associated with an interrupt
when it is moved. In particular, in a distributed
implementation, the ITS must write to the appropriate GICR_* register
to perform the invalidation in the redistributor."

>
> What am I missing?
>
> M.
> --
> Jazz is not dead. It just smells funny...

thanks
Ganapat

1 2 3 4 5 6 >

1 - 100 of 519 matches

Mail list logo