[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%)34874.98 ( 91.55%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)84246.26 ( -7.03%) 74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 618537027996148654837791 Kswapd pages reclaimed607687646075578853849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 107357314473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for
[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%)34874.98 ( 91.55%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)84246.26 ( -7.03%) 74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 618537027996148654837791 Kswapd pages reclaimed607687646075578853849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 107357314473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for
[PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2
Changelog since v1 o Rebase to 4.11-rc1 o Add small clarifying comment based on review The series is unusual in that the first patch fixes one problem and introduces of other issues that are noted in the changelog. Patch 2 makes a minor modification that is worth considering on its own but leaves the kernel in a state where it behaves badly. It's not until patch 3 that there is an improvement against baseline. This was mostly motivated by examining Chris Mason's "simoop" benchmark which puts the VM under similar pressure to HADOOP. It has been reported that the benchmark has regressed severely during the last number of releases. While I cannot reproduce all the same problems Chris experienced due to hardware limitations, there was a number of problems on a 2-socket machine with a single disk. simoop latencies 4.11.0-rc14.11.0-rc1 vanilla keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%)16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 125765.57 ( 49.08%) These are latencies. Read/write are threads reading fixed-size random blocks from a simulated database. The allocation latency is mmaping and faulting regions of memory. The p50, 95 and p99 reports the worst latencies for 50% of the samples, 95% and 99% respectively. For example, the report indicates that while the test was running 99% of writes completed 99.75% faster. It's worth noting that on a UMA machine that no difference in performance with simoop was observed so milage will vary. It's noted that there is a slight impact to read latencies but it's mostly due to IO scheduler decisions and offset by the large reduction in other latencies. mm/memory_hotplug.c | 2 +- mm/vmscan.c | 136 ++-- 2 files changed, 79 insertions(+), 59 deletions(-) -- 2.11.0
[PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2
Changelog since v1 o Rebase to 4.11-rc1 o Add small clarifying comment based on review The series is unusual in that the first patch fixes one problem and introduces of other issues that are noted in the changelog. Patch 2 makes a minor modification that is worth considering on its own but leaves the kernel in a state where it behaves badly. It's not until patch 3 that there is an improvement against baseline. This was mostly motivated by examining Chris Mason's "simoop" benchmark which puts the VM under similar pressure to HADOOP. It has been reported that the benchmark has regressed severely during the last number of releases. While I cannot reproduce all the same problems Chris experienced due to hardware limitations, there was a number of problems on a 2-socket machine with a single disk. simoop latencies 4.11.0-rc14.11.0-rc1 vanilla keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%)16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 125765.57 ( 49.08%) These are latencies. Read/write are threads reading fixed-size random blocks from a simulated database. The allocation latency is mmaping and faulting regions of memory. The p50, 95 and p99 reports the worst latencies for 50% of the samples, 95% and 99% respectively. For example, the report indicates that while the test was running 99% of writes completed 99.75% faster. It's worth noting that on a UMA machine that no difference in performance with simoop was observed so milage will vary. It's noted that there is a slight impact to read latencies but it's mostly due to IO scheduler decisions and offset by the large reduction in other latencies. mm/memory_hotplug.c | 2 +- mm/vmscan.c | 136 ++-- 2 files changed, 79 insertions(+), 59 deletions(-) -- 2.11.0
[PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
From: Shantanu GoelThe check in prepare_kswapd_sleep needs to match the one in balance_pgdat since the latter will return as soon as any one of the zones in the classzone is above the watermark. This is specially important for higher order allocations since balance_pgdat will typically reset the order to zero relying on compaction to create the higher order pages. Without this patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone balance check fails. It was first reported against 4.9.7 that kswapd is failing to wake up kcompactd due to a mismatch in the zone balance check between balance_pgdat() and prepare_kswapd_sleep(). balance_pgdat() returns as soon as a single zone satisfies the allocation but prepare_kswapd_sleep() requires all zones to do +the same. This causes prepare_kswapd_sleep() to never succeed except in the order == 0 case and consequently, wakeup_kcompactd() is never called. For the machine that originally motivated this patch, the state of compaction from /proc/vmstat looked this way after a day and a half +of uptime: compact_migrate_scanned 240496 compact_free_scanned 76238632 compact_isolated 123472 compact_stall 1791 compact_fail 29 compact_success 1762 compact_daemon_wake 0 After applying the patch and about 10 hours of uptime the state looks like this: compact_migrate_scanned 59927299 compact_free_scanned 2021075136 compact_isolated 640926 compact_stall 4 compact_fail 2 compact_success 2 compact_daemon_wake 5160 Further notes from Mel that motivated him to pick this patch up and resend it; It was observed for the simoop workload (pressures the VM similar to HADOOP) that kswapd was failing to keep ahead of direct reclaim. The investigation noted that there was a need to rationalise kswapd decisions to reclaim with kswapd decisions to sleep. With this patch on a 2-socket box, there was a 49% reduction in direct reclaim scanning. However, the impact otherwise is extremely negative. Kswapd reclaim efficiency dropped from 98% to 76%. simoop has three latency-related metrics for read, write and allocation (an anonymous mmap and fault). 4.11.0-rc14.11.0-rc1 vanilla fixcheck-v2 Ameanp50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) Ameanp95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) Ameanp99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) Ameanp50-Write1390.30 ( 0.00%) 1395.28 ( -0.36%) Ameanp95-Write 412901.57 ( 0.00%)37737.74 ( 90.86%) Ameanp99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) Ameanp50-Allocation 78714.31 ( 0.00%)86286.22 ( -9.62%) Ameanp95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) Ameanp99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) Of greater concern is that the patch causes swapping and page writes from kswapd context rose from 0 pages to 4189753 pages during the hour the workload ran for. By and large, the patch has very bad behaviour but easily missed as the impact on a UMA machine is negligible. This patch is included with the data in case a bisection leads to this area. This patch is also a pre-requisite for the rest of the series. Signed-off-by: Shantanu Goel Signed-off-by: Mel Gorman Acked-by: Hillf Danton --- mm/vmscan.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bc8031ef994d..4ea444142c2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3134,11 +3134,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (!zone_balanced(zone, order, classzone_idx)) - return false; + if (zone_balanced(zone, order, classzone_idx)) + return true; } - return true; + return false; } /* @@ -3331,7 +3331,13 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE); - /* Try to sleep for a short interval */ + /* +* Try to sleep for a short interval. Note that kcompactd will only be +* woken if it is possible to sleep for a short interval. This is +* deliberate on the assumption that if reclaim cannot keep an +* eligible zone balanced that it's also unlikely that compaction will +* succeed. +*/ if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) { /* * Compaction records what page blocks it recently failed to -- 2.11.0
[PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
A pgdat tracks if recent reclaim encountered too many dirty, writeback or congested pages. The flags control whether kswapd writes pages back from reclaim context, tags pages for immediate reclaim when IO completes, whether processes block on wait_iff_congested and whether kswapd blocks when too many pages marked for immediate reclaim are encountered. The state is cleared in a check function with side-effects. With the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing of when the bits get cleared changed. Due to the way the check works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation because it does not account for lowmem reserves properly. For the simoop workload, kswapd is not stalling when it should due to the premature clearing, writing pages from reclaim context like crazy and generally being unhelpful. This patch resets the pgdat bits related to page reclaim only when kswapd is going to sleep. The comparison with simoop is then 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla fixcheck-v2 clear-v2 Ameanp50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) 19786774.76 ( 8.69%) Ameanp95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) 24101956.27 ( 5.32%) Ameanp99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) 27691872.71 ( 5.71%) Ameanp50-Write1390.30 ( 0.00%) 1395.28 ( -0.36%) 1011.91 ( 27.22%) Ameanp95-Write 412901.57 ( 0.00%)37737.74 ( 90.86%) 34874.98 ( 91.55%) Ameanp99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) 575449.60 ( 91.37%) Ameanp50-Allocation 78714.31 ( 0.00%)86286.22 ( -9.62%) 84246.26 ( -7.03%) Ameanp95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) 400058.43 (-127.91%) Ameanp99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) 10905600.00 (-4315.17%) Read latency is improved, write latency is mostly improved but allocation latency is regressed. kswapd is still reclaiming inefficiently, pages are being written back from writeback context and a host of other issues. However, given the change, it needed to be spelled out why the side-effect was moved. Signed-off-by: Mel Gorman--- mm/vmscan.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4ea444142c2e..17b1afbce88e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3091,17 +3091,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; - /* -* If any eligible zone is balanced then the node is not considered -* to be congested or dirty -*/ - clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); - clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); - clear_bit(PGDAT_WRITEBACK, >zone_pgdat->flags); - return true; } +/* Clear pgdat state for congested, dirty or under writeback. */ +static void clear_pgdat_congested(pg_data_t *pgdat) +{ + clear_bit(PGDAT_CONGESTED, >flags); + clear_bit(PGDAT_DIRTY, >flags); + clear_bit(PGDAT_WRITEBACK, >flags); +} + /* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. @@ -3134,8 +3134,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (zone_balanced(zone, order, classzone_idx)) + if (zone_balanced(zone, order, classzone_idx)) { + clear_pgdat_congested(pgdat); return true; + } } return false; -- 2.11.0
[PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
From: Shantanu Goel The check in prepare_kswapd_sleep needs to match the one in balance_pgdat since the latter will return as soon as any one of the zones in the classzone is above the watermark. This is specially important for higher order allocations since balance_pgdat will typically reset the order to zero relying on compaction to create the higher order pages. Without this patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone balance check fails. It was first reported against 4.9.7 that kswapd is failing to wake up kcompactd due to a mismatch in the zone balance check between balance_pgdat() and prepare_kswapd_sleep(). balance_pgdat() returns as soon as a single zone satisfies the allocation but prepare_kswapd_sleep() requires all zones to do +the same. This causes prepare_kswapd_sleep() to never succeed except in the order == 0 case and consequently, wakeup_kcompactd() is never called. For the machine that originally motivated this patch, the state of compaction from /proc/vmstat looked this way after a day and a half +of uptime: compact_migrate_scanned 240496 compact_free_scanned 76238632 compact_isolated 123472 compact_stall 1791 compact_fail 29 compact_success 1762 compact_daemon_wake 0 After applying the patch and about 10 hours of uptime the state looks like this: compact_migrate_scanned 59927299 compact_free_scanned 2021075136 compact_isolated 640926 compact_stall 4 compact_fail 2 compact_success 2 compact_daemon_wake 5160 Further notes from Mel that motivated him to pick this patch up and resend it; It was observed for the simoop workload (pressures the VM similar to HADOOP) that kswapd was failing to keep ahead of direct reclaim. The investigation noted that there was a need to rationalise kswapd decisions to reclaim with kswapd decisions to sleep. With this patch on a 2-socket box, there was a 49% reduction in direct reclaim scanning. However, the impact otherwise is extremely negative. Kswapd reclaim efficiency dropped from 98% to 76%. simoop has three latency-related metrics for read, write and allocation (an anonymous mmap and fault). 4.11.0-rc14.11.0-rc1 vanilla fixcheck-v2 Ameanp50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) Ameanp95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) Ameanp99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) Ameanp50-Write1390.30 ( 0.00%) 1395.28 ( -0.36%) Ameanp95-Write 412901.57 ( 0.00%)37737.74 ( 90.86%) Ameanp99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) Ameanp50-Allocation 78714.31 ( 0.00%)86286.22 ( -9.62%) Ameanp95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) Ameanp99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) Of greater concern is that the patch causes swapping and page writes from kswapd context rose from 0 pages to 4189753 pages during the hour the workload ran for. By and large, the patch has very bad behaviour but easily missed as the impact on a UMA machine is negligible. This patch is included with the data in case a bisection leads to this area. This patch is also a pre-requisite for the rest of the series. Signed-off-by: Shantanu Goel Signed-off-by: Mel Gorman Acked-by: Hillf Danton --- mm/vmscan.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bc8031ef994d..4ea444142c2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3134,11 +3134,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (!zone_balanced(zone, order, classzone_idx)) - return false; + if (zone_balanced(zone, order, classzone_idx)) + return true; } - return true; + return false; } /* @@ -3331,7 +3331,13 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE); - /* Try to sleep for a short interval */ + /* +* Try to sleep for a short interval. Note that kcompactd will only be +* woken if it is possible to sleep for a short interval. This is +* deliberate on the assumption that if reclaim cannot keep an +* eligible zone balanced that it's also unlikely that compaction will +* succeed. +*/ if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) { /* * Compaction records what page blocks it recently failed to -- 2.11.0
[PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
A pgdat tracks if recent reclaim encountered too many dirty, writeback or congested pages. The flags control whether kswapd writes pages back from reclaim context, tags pages for immediate reclaim when IO completes, whether processes block on wait_iff_congested and whether kswapd blocks when too many pages marked for immediate reclaim are encountered. The state is cleared in a check function with side-effects. With the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing of when the bits get cleared changed. Due to the way the check works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation because it does not account for lowmem reserves properly. For the simoop workload, kswapd is not stalling when it should due to the premature clearing, writing pages from reclaim context like crazy and generally being unhelpful. This patch resets the pgdat bits related to page reclaim only when kswapd is going to sleep. The comparison with simoop is then 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla fixcheck-v2 clear-v2 Ameanp50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) 19786774.76 ( 8.69%) Ameanp95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) 24101956.27 ( 5.32%) Ameanp99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) 27691872.71 ( 5.71%) Ameanp50-Write1390.30 ( 0.00%) 1395.28 ( -0.36%) 1011.91 ( 27.22%) Ameanp95-Write 412901.57 ( 0.00%)37737.74 ( 90.86%) 34874.98 ( 91.55%) Ameanp99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) 575449.60 ( 91.37%) Ameanp50-Allocation 78714.31 ( 0.00%)86286.22 ( -9.62%) 84246.26 ( -7.03%) Ameanp95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) 400058.43 (-127.91%) Ameanp99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) 10905600.00 (-4315.17%) Read latency is improved, write latency is mostly improved but allocation latency is regressed. kswapd is still reclaiming inefficiently, pages are being written back from writeback context and a host of other issues. However, given the change, it needed to be spelled out why the side-effect was moved. Signed-off-by: Mel Gorman --- mm/vmscan.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4ea444142c2e..17b1afbce88e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3091,17 +3091,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; - /* -* If any eligible zone is balanced then the node is not considered -* to be congested or dirty -*/ - clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); - clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); - clear_bit(PGDAT_WRITEBACK, >zone_pgdat->flags); - return true; } +/* Clear pgdat state for congested, dirty or under writeback. */ +static void clear_pgdat_congested(pg_data_t *pgdat) +{ + clear_bit(PGDAT_CONGESTED, >flags); + clear_bit(PGDAT_DIRTY, >flags); + clear_bit(PGDAT_WRITEBACK, >flags); +} + /* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. @@ -3134,8 +3134,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (zone_balanced(zone, order, classzone_idx)) + if (zone_balanced(zone, order, classzone_idx)) { + clear_pgdat_congested(pgdat); return true; + } } return false; -- 2.11.0
Re: [TRIVIAL PATCH 2/2] kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL
On Wed, 8 Mar 2017 22:34:15 +0530 "Naveen N. Rao"wrote: > commit fc62d0207ae0 ("kprobes: Introduce weak variant of > kprobe_exceptions_notify()") used the __kprobes annotation to exclude > kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a > better way to do this enabling the symbol to be discovered as being > blacklisted, change over to using NOKPROBE_SYMBOL(). > Oops, yes it should be. Acked-by: Masami Hiramatsu Thanks, > Signed-off-by: Naveen N. Rao > --- > kernel/kprobes.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 699c5bc51a92..b52d952d6d41 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -1740,11 +1740,12 @@ void unregister_kprobes(struct kprobe **kps, int num) > } > EXPORT_SYMBOL_GPL(unregister_kprobes); > > -int __weak __kprobes kprobe_exceptions_notify(struct notifier_block *self, > - unsigned long val, void *data) > +int __weak kprobe_exceptions_notify(struct notifier_block *self, > + unsigned long val, void *data) > { > return NOTIFY_DONE; > } > +NOKPROBE_SYMBOL(kprobe_exceptions_notify); > > static struct notifier_block kprobe_exceptions_nb = { > .notifier_call = kprobe_exceptions_notify, > -- > 2.11.1 > -- Masami Hiramatsu
Re: [TRIVIAL PATCH 2/2] kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL
On Wed, 8 Mar 2017 22:34:15 +0530 "Naveen N. Rao" wrote: > commit fc62d0207ae0 ("kprobes: Introduce weak variant of > kprobe_exceptions_notify()") used the __kprobes annotation to exclude > kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a > better way to do this enabling the symbol to be discovered as being > blacklisted, change over to using NOKPROBE_SYMBOL(). > Oops, yes it should be. Acked-by: Masami Hiramatsu Thanks, > Signed-off-by: Naveen N. Rao > --- > kernel/kprobes.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 699c5bc51a92..b52d952d6d41 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -1740,11 +1740,12 @@ void unregister_kprobes(struct kprobe **kps, int num) > } > EXPORT_SYMBOL_GPL(unregister_kprobes); > > -int __weak __kprobes kprobe_exceptions_notify(struct notifier_block *self, > - unsigned long val, void *data) > +int __weak kprobe_exceptions_notify(struct notifier_block *self, > + unsigned long val, void *data) > { > return NOTIFY_DONE; > } > +NOKPROBE_SYMBOL(kprobe_exceptions_notify); > > static struct notifier_block kprobe_exceptions_nb = { > .notifier_call = kprobe_exceptions_notify, > -- > 2.11.1 > -- Masami Hiramatsu
Re: [RFC PATCH net] net: Work around lockdep limitation in sockets that use sockets
David Millerwrote: > I guess this is fine, but I think you can use one of the two "sk_padding" > bits in struct sock instead of making the structure larger. It shouldn't make the structure larger since there's a hole in the structure: unsigned intsk_padding : 2, sk_no_check_tx : 1, sk_no_check_rx : 1, sk_userlocks : 4, sk_protocol : 8, sk_type : 16; #define SK_PROTOCOL_MAX U8_MAX kmemcheck_bitfield_end(flags); u16 sk_gso_max_segs; ---> 2-byte hole here unsigned long sk_lingertime; but I'll quite happily shift it to sk_padding. Thanks, David
Re: [RFC PATCH net] net: Work around lockdep limitation in sockets that use sockets
David Miller wrote: > I guess this is fine, but I think you can use one of the two "sk_padding" > bits in struct sock instead of making the structure larger. It shouldn't make the structure larger since there's a hole in the structure: unsigned intsk_padding : 2, sk_no_check_tx : 1, sk_no_check_rx : 1, sk_userlocks : 4, sk_protocol : 8, sk_type : 16; #define SK_PROTOCOL_MAX U8_MAX kmemcheck_bitfield_end(flags); u16 sk_gso_max_segs; ---> 2-byte hole here unsigned long sk_lingertime; but I'll quite happily shift it to sk_padding. Thanks, David
Re: [PATCH v7 kernel 5/5] This patch contains two parts:
On 03/06/2017 09:23 PM, David Hildenbrand wrote:Am 03.03.2017 um 06:40 schrieb Wei Wang: From: Liang LiSorry, I just saw the message due to an email issue. I'd prefer to split this into two parts then and to create proper subjects. Agree, will do. If I remember correctly, the general concept was accepted by most reviewers. Yes, that's also what I was told. Best, Wei
Re: [PATCH v7 kernel 5/5] This patch contains two parts:
On 03/06/2017 09:23 PM, David Hildenbrand wrote:Am 03.03.2017 um 06:40 schrieb Wei Wang: From: Liang Li Sorry, I just saw the message due to an email issue. I'd prefer to split this into two parts then and to create proper subjects. Agree, will do. If I remember correctly, the general concept was accepted by most reviewers. Yes, that's also what I was told. Best, Wei
[PATCH] drivers, scsi: convert fc_fcp_pkt.ref_cnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor Acked-by: Johannes Thumshirn --- drivers/scsi/libfc/fc_fcp.c | 6 +++--- include/scsi/libfc.h| 3 ++- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c index 0e67621..a808e8e 100644 --- a/drivers/scsi/libfc/fc_fcp.c +++ b/drivers/scsi/libfc/fc_fcp.c @@ -154,7 +154,7 @@ static struct fc_fcp_pkt *fc_fcp_pkt_alloc(struct fc_lport *lport, gfp_t gfp) memset(fsp, 0, sizeof(*fsp)); fsp->lp = lport; fsp->xfer_ddp = FC_XID_UNKNOWN; - atomic_set(>ref_cnt, 1); + refcount_set(>ref_cnt, 1); init_timer(>timer); fsp->timer.data = (unsigned long)fsp; INIT_LIST_HEAD(>list); @@ -175,7 +175,7 @@ static struct fc_fcp_pkt *fc_fcp_pkt_alloc(struct fc_lport *lport, gfp_t gfp) */ static void fc_fcp_pkt_release(struct fc_fcp_pkt *fsp) { - if (atomic_dec_and_test(>ref_cnt)) { + if (refcount_dec_and_test(>ref_cnt)) { struct fc_fcp_internal *si = fc_get_scsi_internal(fsp->lp); mempool_free(fsp, si->scsi_pkt_pool); @@ -188,7 +188,7 @@ static void fc_fcp_pkt_release(struct fc_fcp_pkt *fsp) */ static void fc_fcp_pkt_hold(struct fc_fcp_pkt *fsp) { - atomic_inc(>ref_cnt); + refcount_inc(>ref_cnt); } /** diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h index da5033d..2109844 100644 --- a/include/scsi/libfc.h +++ b/include/scsi/libfc.h @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -321,7 +322,7 @@ struct fc_seq_els_data { */ struct fc_fcp_pkt { spinlock_tscsi_pkt_lock; - atomic_t ref_cnt; + refcount_tref_cnt; /* SCSI command and data transfer information */ u32 data_len; -- 2.7.4
[PATCH] drivers, scsi: convert fc_fcp_pkt.ref_cnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova Signed-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor Acked-by: Johannes Thumshirn --- drivers/scsi/libfc/fc_fcp.c | 6 +++--- include/scsi/libfc.h| 3 ++- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c index 0e67621..a808e8e 100644 --- a/drivers/scsi/libfc/fc_fcp.c +++ b/drivers/scsi/libfc/fc_fcp.c @@ -154,7 +154,7 @@ static struct fc_fcp_pkt *fc_fcp_pkt_alloc(struct fc_lport *lport, gfp_t gfp) memset(fsp, 0, sizeof(*fsp)); fsp->lp = lport; fsp->xfer_ddp = FC_XID_UNKNOWN; - atomic_set(>ref_cnt, 1); + refcount_set(>ref_cnt, 1); init_timer(>timer); fsp->timer.data = (unsigned long)fsp; INIT_LIST_HEAD(>list); @@ -175,7 +175,7 @@ static struct fc_fcp_pkt *fc_fcp_pkt_alloc(struct fc_lport *lport, gfp_t gfp) */ static void fc_fcp_pkt_release(struct fc_fcp_pkt *fsp) { - if (atomic_dec_and_test(>ref_cnt)) { + if (refcount_dec_and_test(>ref_cnt)) { struct fc_fcp_internal *si = fc_get_scsi_internal(fsp->lp); mempool_free(fsp, si->scsi_pkt_pool); @@ -188,7 +188,7 @@ static void fc_fcp_pkt_release(struct fc_fcp_pkt *fsp) */ static void fc_fcp_pkt_hold(struct fc_fcp_pkt *fsp) { - atomic_inc(>ref_cnt); + refcount_inc(>ref_cnt); } /** diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h index da5033d..2109844 100644 --- a/include/scsi/libfc.h +++ b/include/scsi/libfc.h @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -321,7 +322,7 @@ struct fc_seq_els_data { */ struct fc_fcp_pkt { spinlock_tscsi_pkt_lock; - atomic_t ref_cnt; + refcount_tref_cnt; /* SCSI command and data transfer information */ u32 data_len; -- 2.7.4
Re: kexec regression since 4.9 caused by efi
On Thu, Mar 09, 2017 at 10:21:36AM +0800, Dave Young wrote: > I have no esrt machine to test, can you share the full kernel log with > efi=debug in kernel cmdline? > > *) normal boot kernel log without the reverting > *) kexec boot log with and without the reverting Attached. [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 12:07:16 PST 2017 [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc1 ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 fsck.repair=yes pcie_pme=nomsi netconsole=+@2401:db00:0011:b03e:face::0009:/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e crashkernel=128M console=tty0 console=ttyS1,57600 efi=debug [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009] usable [0.00] BIOS-e820: [mem 0x0010-0x750bdfff] usable [0.00] BIOS-e820: [mem 0x750be000-0x75ddbfff] reserved [0.00] BIOS-e820: [mem 0x75ddc000-0x75e32fff] ACPI data [0.00] BIOS-e820: [mem 0x75e33000-0x7642dfff] ACPI NVS [0.00] BIOS-e820: [mem 0x7642e000-0x7bcd9fff] reserved [0.00] BIOS-e820: [mem 0x7bcda000-0x7bcdafff] usable [0.00] BIOS-e820: [mem 0x7bcdb000-0x7bd60fff] reserved [0.00] BIOS-e820: [mem 0x7bd61000-0x7bff] usable [0.00] BIOS-e820: [mem 0x7c00-0x8fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed44fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00407fff] usable [0.00] NX (Execute Disable) protection: active [0.00] extended physical RAM map: [0.00] reserve setup_data: [mem 0x-0x0009] usable [0.00] reserve setup_data: [mem 0x0010-0x709fb017] usable [0.00] reserve setup_data: [mem 0x709fb018-0x70a4b857] usable [0.00] reserve setup_data: [mem 0x70a4b858-0x70a4c017] usable [0.00] reserve setup_data: [mem 0x70a4c018-0x70a53c57] usable [0.00] reserve setup_data: [mem 0x70a53c58-0x750bdfff] usable [0.00] reserve setup_data: [mem 0x750be000-0x75ddbfff] reserved [0.00] reserve setup_data: [mem 0x75ddc000-0x75e32fff] ACPI data [0.00] reserve setup_data: [mem 0x75e33000-0x7642dfff] ACPI NVS [0.00] reserve setup_data: [mem 0x7642e000-0x7bcd9fff] reserved [0.00] reserve setup_data: [mem 0x7bcda000-0x7bcdafff] usable [0.00] reserve setup_data: [mem 0x7bcdb000-0x7bd60fff] reserved [0.00] reserve setup_data: [mem 0x7bd61000-0x7bff] usable [0.00] reserve setup_data: [mem 0x7c00-0x8fff] reserved [0.00] reserve setup_data: [mem 0xfed1c000-0xfed44fff] reserved [0.00] reserve setup_data: [mem 0xff00-0x] reserved [0.00] reserve setup_data: [mem 0x0001-0x00407fff] usable [0.00] efi: EFI v2.40 by American Megatrends [0.00] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0 [0.00] efi: mem00: [Boot Code | | | | | | | | |WB|WT|WC|UC] range=[0x-0x7fff] (0MB) [0.00] efi: mem01: [Loader Data| | | | | | | | |WB|WT|WC|UC] range=[0x8000-0x8fff] (0MB) [0.00] efi: mem02: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x9000-0x0003] (0MB) [0.00] efi: mem03: [Boot Code | | | | | | | | |WB|WT|WC|UC] range=[0x0004-0x0009] (0MB) [0.00] efi: mem04: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x0010-0x00ff] (15MB) [0.00] efi: mem05: [Loader Data| | | | | | | | |WB|WT|WC|UC] range=[0x0100-0x024cdfff] (20MB) [0.00] efi: mem06: [Conventional Memory| | | | | | | | |WB|WT|WC|UC]
Re: kexec regression since 4.9 caused by efi
On Thu, Mar 09, 2017 at 10:21:36AM +0800, Dave Young wrote: > I have no esrt machine to test, can you share the full kernel log with > efi=debug in kernel cmdline? > > *) normal boot kernel log without the reverting > *) kexec boot log with and without the reverting Attached. [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 12:07:16 PST 2017 [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc1 ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 fsck.repair=yes pcie_pme=nomsi netconsole=+@2401:db00:0011:b03e:face::0009:/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e crashkernel=128M console=tty0 console=ttyS1,57600 efi=debug [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009] usable [0.00] BIOS-e820: [mem 0x0010-0x750bdfff] usable [0.00] BIOS-e820: [mem 0x750be000-0x75ddbfff] reserved [0.00] BIOS-e820: [mem 0x75ddc000-0x75e32fff] ACPI data [0.00] BIOS-e820: [mem 0x75e33000-0x7642dfff] ACPI NVS [0.00] BIOS-e820: [mem 0x7642e000-0x7bcd9fff] reserved [0.00] BIOS-e820: [mem 0x7bcda000-0x7bcdafff] usable [0.00] BIOS-e820: [mem 0x7bcdb000-0x7bd60fff] reserved [0.00] BIOS-e820: [mem 0x7bd61000-0x7bff] usable [0.00] BIOS-e820: [mem 0x7c00-0x8fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed44fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00407fff] usable [0.00] NX (Execute Disable) protection: active [0.00] extended physical RAM map: [0.00] reserve setup_data: [mem 0x-0x0009] usable [0.00] reserve setup_data: [mem 0x0010-0x709fb017] usable [0.00] reserve setup_data: [mem 0x709fb018-0x70a4b857] usable [0.00] reserve setup_data: [mem 0x70a4b858-0x70a4c017] usable [0.00] reserve setup_data: [mem 0x70a4c018-0x70a53c57] usable [0.00] reserve setup_data: [mem 0x70a53c58-0x750bdfff] usable [0.00] reserve setup_data: [mem 0x750be000-0x75ddbfff] reserved [0.00] reserve setup_data: [mem 0x75ddc000-0x75e32fff] ACPI data [0.00] reserve setup_data: [mem 0x75e33000-0x7642dfff] ACPI NVS [0.00] reserve setup_data: [mem 0x7642e000-0x7bcd9fff] reserved [0.00] reserve setup_data: [mem 0x7bcda000-0x7bcdafff] usable [0.00] reserve setup_data: [mem 0x7bcdb000-0x7bd60fff] reserved [0.00] reserve setup_data: [mem 0x7bd61000-0x7bff] usable [0.00] reserve setup_data: [mem 0x7c00-0x8fff] reserved [0.00] reserve setup_data: [mem 0xfed1c000-0xfed44fff] reserved [0.00] reserve setup_data: [mem 0xff00-0x] reserved [0.00] reserve setup_data: [mem 0x0001-0x00407fff] usable [0.00] efi: EFI v2.40 by American Megatrends [0.00] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0 [0.00] efi: mem00: [Boot Code | | | | | | | | |WB|WT|WC|UC] range=[0x-0x7fff] (0MB) [0.00] efi: mem01: [Loader Data| | | | | | | | |WB|WT|WC|UC] range=[0x8000-0x8fff] (0MB) [0.00] efi: mem02: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x9000-0x0003] (0MB) [0.00] efi: mem03: [Boot Code | | | | | | | | |WB|WT|WC|UC] range=[0x0004-0x0009] (0MB) [0.00] efi: mem04: [Conventional Memory| | | | | | | | |WB|WT|WC|UC] range=[0x0010-0x00ff] (15MB) [0.00] efi: mem05: [Loader Data| | | | | | | | |WB|WT|WC|UC] range=[0x0100-0x024cdfff] (20MB) [0.00] efi: mem06: [Conventional Memory| | | | | | | | |WB|WT|WC|UC]
Re: [PATCH 2/4] ubifs: Fix unlink code wrt. double hash lookups
Richard, this patch works well. but i found some trivial mistakes. On Thu, Feb 09, 2017 at 10:28:35PM +0100, Richard Weinberger wrote: > When removing an encrypted file with a long anem and without having > the key we have to be able to locate and remove the directory entry > via a double hash. This corner case was simply forgotten. > > Reported-by: David Oberhollenzer> Signed-off-by: Richard Weinberger > --- > fs/ubifs/journal.c | 10 - > fs/ubifs/tnc.c | 129 > - > fs/ubifs/ubifs.h | 2 + > 3 files changed, 117 insertions(+), 24 deletions(-) > > diff --git a/fs/ubifs/journal.c b/fs/ubifs/journal.c > index f3b620cbdda4..7aef413ea2a9 100644 > --- a/fs/ubifs/journal.c > +++ b/fs/ubifs/journal.c > @@ -585,7 +585,10 @@ int ubifs_jnl_update(struct ubifs_info *c, const struct > inode *dir, > > if (!xent) { > dent->ch.node_type = UBIFS_DENT_NODE; > - dent_key_init(c, _key, dir->i_ino, nm); > + if (nm->hash) > + dent_key_init_hash(c, _key, dir->i_ino, nm->hash); > + else > + dent_key_init(c, _key, dir->i_ino, nm); > } else { > dent->ch.node_type = UBIFS_XENT_NODE; > xent_key_init(c, _key, dir->i_ino, nm); > @@ -629,7 +632,10 @@ int ubifs_jnl_update(struct ubifs_info *c, const struct > inode *dir, > kfree(dent); > > if (deletion) { > - err = ubifs_tnc_remove_nm(c, _key, nm); > + if (nm->hash) > + err = ubifs_tnc_remove_dh(c, _key, nm->minor_hash); > + else > + err = ubifs_tnc_remove_nm(c, _key, nm); > if (err) > goto out_ro; > err = ubifs_add_dirt(c, lnum, dlen); > diff --git a/fs/ubifs/tnc.c b/fs/ubifs/tnc.c > index 709aa098dd46..d84f4ba467a3 100644 > --- a/fs/ubifs/tnc.c > +++ b/fs/ubifs/tnc.c > @@ -1880,48 +1880,65 @@ int ubifs_tnc_lookup_nm(struct ubifs_info *c, const > union ubifs_key *key, > return do_lookup_nm(c, key, node, nm); > } > > -static int do_lookup_dh(struct ubifs_info *c, const union ubifs_key *key, > - struct ubifs_dent_node *dent, uint32_t cookie) > +static int search_dh_cookie(struct ubifs_info *c, const union ubifs_key *key, > + struct ubifs_dent_node *dent, uint32_t cookie, > + struct ubifs_znode **zn, int *n) > { > - int n, err, type = key_type(c, key); > - struct ubifs_znode *znode; > + int err; i guess that err should be initialized with -ENOENT to avoid the first call of tnc_next(c, , n). > + struct ubifs_znode *znode = *zn; > struct ubifs_zbranch *zbr; > - union ubifs_key *dkey, start_key; > - > - ubifs_assert(is_hash_key(c, key)); > - > - lowest_dent_key(c, _key, key_inum(c, key)); > - > - mutex_lock(>tnc_mutex); > - err = ubifs_lookup_level0(c, _key, , ); > - if (unlikely(err < 0)) > - goto out_unlock; > + union ubifs_key *dkey; > > for (;;) { > if (!err) { > - err = tnc_next(c, , ); > + err = tnc_next(c, , n); > if (err) > - goto out_unlock; > + goto out; > } > > - zbr = >zbranch[n]; > + zbr = >zbranch[*n]; > dkey = >key; > > if (key_inum(c, dkey) != key_inum(c, key) || > - key_type(c, dkey) != type) { > + key_type(c, dkey) != key_type(c, key)) { > err = -ENOENT; > - goto out_unlock; > + goto out; > } > > err = tnc_read_hashed_node(c, zbr, dent); > if (err) > - goto out_unlock; > + goto out; > > if (key_hash(c, key) == key_hash(c, dkey) && > - le32_to_cpu(dent->cookie) == cookie) > - goto out_unlock; > + le32_to_cpu(dent->cookie) == cookie) { > + *zn = znode; > + goto out; > + } > } > > +out: > + > + return err; > +} > + > +static int do_lookup_dh(struct ubifs_info *c, const union ubifs_key *key, > + struct ubifs_dent_node *dent, uint32_t cookie) > +{ > + int n, err; > + struct ubifs_znode *znode; > + union ubifs_key start_key; > + > + ubifs_assert(is_hash_key(c, key)); > + > + lowest_dent_key(c, _key, key_inum(c, key)); > + > + mutex_lock(>tnc_mutex); > + err = ubifs_lookup_level0(c, _key, , ); > + if (unlikely(err < 0)) > + goto out_unlock; > + > + err = search_dh_cookie(c, key, dent, cookie, , ); > + > out_unlock: > mutex_unlock(>tnc_mutex); > return err;
Re: linux-4.11-rc1/drivers/gpu/drm/exynos/exynos5433_drm_decon.c:681: suspicious mask ?
Hello David, Thanks for report. 2017년 03월 06일 19:05에 David Binderman 이(가) 쓴 글: > Hello there, > > linux-4.11-rc1/drivers/gpu/drm/exynos/exynos5433_drm_decon.c:681]: (warning) > Result of operator '|' is always true if one operand is non-zero. Did you > intend to use '&'? > Right. this is known issue and below patch fixes this, http://www.spinics.net/lists/dri-devel/msg132589.html This patch will go to -fixes. > Source code is > > if (ctx->out_type | I80_HW_TRG) { > > Also in the same file: > > [drivers/gpu/drm/exynos/exynos5433_drm_decon.c:131]: (style) Same expression > on both sides of '|'. > > Source code is > >writel(TRIGCON_TE_AUTO_MASK | TRIGCON_SWTRIGEN >| TRIGCON_TE_AUTO_MASK | TRIGCON_SWTRIGEN, In this case, only problem is two flags are set in duplicate. This should be cleaned up. Thanks. :) > > Regards > > David Binderman > >
Re: [PATCH 2/4] ubifs: Fix unlink code wrt. double hash lookups
Richard, this patch works well. but i found some trivial mistakes. On Thu, Feb 09, 2017 at 10:28:35PM +0100, Richard Weinberger wrote: > When removing an encrypted file with a long anem and without having > the key we have to be able to locate and remove the directory entry > via a double hash. This corner case was simply forgotten. > > Reported-by: David Oberhollenzer > Signed-off-by: Richard Weinberger > --- > fs/ubifs/journal.c | 10 - > fs/ubifs/tnc.c | 129 > - > fs/ubifs/ubifs.h | 2 + > 3 files changed, 117 insertions(+), 24 deletions(-) > > diff --git a/fs/ubifs/journal.c b/fs/ubifs/journal.c > index f3b620cbdda4..7aef413ea2a9 100644 > --- a/fs/ubifs/journal.c > +++ b/fs/ubifs/journal.c > @@ -585,7 +585,10 @@ int ubifs_jnl_update(struct ubifs_info *c, const struct > inode *dir, > > if (!xent) { > dent->ch.node_type = UBIFS_DENT_NODE; > - dent_key_init(c, _key, dir->i_ino, nm); > + if (nm->hash) > + dent_key_init_hash(c, _key, dir->i_ino, nm->hash); > + else > + dent_key_init(c, _key, dir->i_ino, nm); > } else { > dent->ch.node_type = UBIFS_XENT_NODE; > xent_key_init(c, _key, dir->i_ino, nm); > @@ -629,7 +632,10 @@ int ubifs_jnl_update(struct ubifs_info *c, const struct > inode *dir, > kfree(dent); > > if (deletion) { > - err = ubifs_tnc_remove_nm(c, _key, nm); > + if (nm->hash) > + err = ubifs_tnc_remove_dh(c, _key, nm->minor_hash); > + else > + err = ubifs_tnc_remove_nm(c, _key, nm); > if (err) > goto out_ro; > err = ubifs_add_dirt(c, lnum, dlen); > diff --git a/fs/ubifs/tnc.c b/fs/ubifs/tnc.c > index 709aa098dd46..d84f4ba467a3 100644 > --- a/fs/ubifs/tnc.c > +++ b/fs/ubifs/tnc.c > @@ -1880,48 +1880,65 @@ int ubifs_tnc_lookup_nm(struct ubifs_info *c, const > union ubifs_key *key, > return do_lookup_nm(c, key, node, nm); > } > > -static int do_lookup_dh(struct ubifs_info *c, const union ubifs_key *key, > - struct ubifs_dent_node *dent, uint32_t cookie) > +static int search_dh_cookie(struct ubifs_info *c, const union ubifs_key *key, > + struct ubifs_dent_node *dent, uint32_t cookie, > + struct ubifs_znode **zn, int *n) > { > - int n, err, type = key_type(c, key); > - struct ubifs_znode *znode; > + int err; i guess that err should be initialized with -ENOENT to avoid the first call of tnc_next(c, , n). > + struct ubifs_znode *znode = *zn; > struct ubifs_zbranch *zbr; > - union ubifs_key *dkey, start_key; > - > - ubifs_assert(is_hash_key(c, key)); > - > - lowest_dent_key(c, _key, key_inum(c, key)); > - > - mutex_lock(>tnc_mutex); > - err = ubifs_lookup_level0(c, _key, , ); > - if (unlikely(err < 0)) > - goto out_unlock; > + union ubifs_key *dkey; > > for (;;) { > if (!err) { > - err = tnc_next(c, , ); > + err = tnc_next(c, , n); > if (err) > - goto out_unlock; > + goto out; > } > > - zbr = >zbranch[n]; > + zbr = >zbranch[*n]; > dkey = >key; > > if (key_inum(c, dkey) != key_inum(c, key) || > - key_type(c, dkey) != type) { > + key_type(c, dkey) != key_type(c, key)) { > err = -ENOENT; > - goto out_unlock; > + goto out; > } > > err = tnc_read_hashed_node(c, zbr, dent); > if (err) > - goto out_unlock; > + goto out; > > if (key_hash(c, key) == key_hash(c, dkey) && > - le32_to_cpu(dent->cookie) == cookie) > - goto out_unlock; > + le32_to_cpu(dent->cookie) == cookie) { > + *zn = znode; > + goto out; > + } > } > > +out: > + > + return err; > +} > + > +static int do_lookup_dh(struct ubifs_info *c, const union ubifs_key *key, > + struct ubifs_dent_node *dent, uint32_t cookie) > +{ > + int n, err; > + struct ubifs_znode *znode; > + union ubifs_key start_key; > + > + ubifs_assert(is_hash_key(c, key)); > + > + lowest_dent_key(c, _key, key_inum(c, key)); > + > + mutex_lock(>tnc_mutex); > + err = ubifs_lookup_level0(c, _key, , ); > + if (unlikely(err < 0)) > + goto out_unlock; > + > + err = search_dh_cookie(c, key, dent, cookie, , ); > + > out_unlock: > mutex_unlock(>tnc_mutex); > return err; > @@ -2663,6 +2680,74 @@ int
Re: linux-4.11-rc1/drivers/gpu/drm/exynos/exynos5433_drm_decon.c:681: suspicious mask ?
Hello David, Thanks for report. 2017년 03월 06일 19:05에 David Binderman 이(가) 쓴 글: > Hello there, > > linux-4.11-rc1/drivers/gpu/drm/exynos/exynos5433_drm_decon.c:681]: (warning) > Result of operator '|' is always true if one operand is non-zero. Did you > intend to use '&'? > Right. this is known issue and below patch fixes this, http://www.spinics.net/lists/dri-devel/msg132589.html This patch will go to -fixes. > Source code is > > if (ctx->out_type | I80_HW_TRG) { > > Also in the same file: > > [drivers/gpu/drm/exynos/exynos5433_drm_decon.c:131]: (style) Same expression > on both sides of '|'. > > Source code is > >writel(TRIGCON_TE_AUTO_MASK | TRIGCON_SWTRIGEN >| TRIGCON_TE_AUTO_MASK | TRIGCON_SWTRIGEN, In this case, only problem is two flags are set in duplicate. This should be cleaned up. Thanks. :) > > Regards > > David Binderman > >
Re: [RFC PATCH net] net: Work around lockdep limitation in sockets that use sockets
From: David HowellsDate: Mon, 06 Mar 2017 15:04:44 + > Fix the general case by: > > (1) Double up all the locking keys used in sockets so that one set are > used if the socket is created by userspace and the other set is used > if the socket is created by the kernel. > > (2) Store the kern parameter passed to sk_alloc() in a variable in the > sock struct (sk_kern_sock). This informs sock_lock_init(), > sock_init_data() and sk_clone_lock() as to the lock keys to be used. > > Note that the child created by sk_clone_lock() inherits the parent's > kern setting. > > (3) Add a 'kern' parameter to ->accept() that is analogous to the one > passed in to ->create() that distinguishes whether kernel_accept() or > sys_accept4() was the caller and can be passed to sk_alloc(). > > Note that a lot of accept functions merely dequeue an already > allocated socket. I haven't touched these as the new socket already > exists before we get the parameter. > > Note also that there are a couple of places where I've made the accepted > socket unconditionally kernel-based: > > irda_accept() > rds_rcp_accept_one() > tcp_accept_from_sock() > > because they follow a sock_create_kern() and accept off of that. I guess this is fine, but I think you can use one of the two "sk_padding" bits in struct sock instead of making the structure larger.
Re: [RFC PATCH net] net: Work around lockdep limitation in sockets that use sockets
From: David Howells Date: Mon, 06 Mar 2017 15:04:44 + > Fix the general case by: > > (1) Double up all the locking keys used in sockets so that one set are > used if the socket is created by userspace and the other set is used > if the socket is created by the kernel. > > (2) Store the kern parameter passed to sk_alloc() in a variable in the > sock struct (sk_kern_sock). This informs sock_lock_init(), > sock_init_data() and sk_clone_lock() as to the lock keys to be used. > > Note that the child created by sk_clone_lock() inherits the parent's > kern setting. > > (3) Add a 'kern' parameter to ->accept() that is analogous to the one > passed in to ->create() that distinguishes whether kernel_accept() or > sys_accept4() was the caller and can be passed to sk_alloc(). > > Note that a lot of accept functions merely dequeue an already > allocated socket. I haven't touched these as the new socket already > exists before we get the parameter. > > Note also that there are a couple of places where I've made the accepted > socket unconditionally kernel-based: > > irda_accept() > rds_rcp_accept_one() > tcp_accept_from_sock() > > because they follow a sock_create_kern() and accept off of that. I guess this is fine, but I think you can use one of the two "sk_padding" bits in struct sock instead of making the structure larger.
RE: [PATCH 22/29] drivers, scsi: convert iscsi_task.refcount from atomic_t to refcount_t
> On Mon, Mar 06, 2017 at 04:21:09PM +0200, Elena Reshetova wrote: > > refcount_t type and corresponding API should be > > used instead of atomic_t when the variable is used as > > a reference counter. This allows to avoid accidental > > refcounter overflows that might lead to use-after-free > > situations. > > > > Signed-off-by: Elena Reshetova> > Signed-off-by: Hans Liljestrand > > Signed-off-by: Kees Cook > > Signed-off-by: David Windsor > > This looks OK to me. > > Acked-by: Chris Leech Thank you for review! Do you have a tree that can take this change? Best Regards, Elena. > > > --- > > drivers/scsi/libiscsi.c| 8 > > drivers/scsi/qedi/qedi_iscsi.c | 2 +- > > include/scsi/libiscsi.h| 3 ++- > > 3 files changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c > > index 834d121..7eb1d2c 100644 > > --- a/drivers/scsi/libiscsi.c > > +++ b/drivers/scsi/libiscsi.c > > @@ -516,13 +516,13 @@ static void iscsi_free_task(struct iscsi_task *task) > > > > void __iscsi_get_task(struct iscsi_task *task) > > { > > - atomic_inc(>refcount); > > + refcount_inc(>refcount); > > } > > EXPORT_SYMBOL_GPL(__iscsi_get_task); > > > > void __iscsi_put_task(struct iscsi_task *task) > > { > > - if (atomic_dec_and_test(>refcount)) > > + if (refcount_dec_and_test(>refcount)) > > iscsi_free_task(task); > > } > > EXPORT_SYMBOL_GPL(__iscsi_put_task); > > @@ -744,7 +744,7 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct > iscsi_hdr *hdr, > > * released by the lld when it has transmitted the task for > > * pdus we do not expect a response for. > > */ > > - atomic_set(>refcount, 1); > > + refcount_set(>refcount, 1); > > task->conn = conn; > > task->sc = NULL; > > INIT_LIST_HEAD(>running); > > @@ -1616,7 +1616,7 @@ static inline struct iscsi_task > > *iscsi_alloc_task(struct > iscsi_conn *conn, > > sc->SCp.phase = conn->session->age; > > sc->SCp.ptr = (char *) task; > > > > - atomic_set(>refcount, 1); > > + refcount_set(>refcount, 1); > > task->state = ISCSI_TASK_PENDING; > > task->conn = conn; > > task->sc = sc; > > diff --git a/drivers/scsi/qedi/qedi_iscsi.c b/drivers/scsi/qedi/qedi_iscsi.c > > index b9f79d3..3895bd5 100644 > > --- a/drivers/scsi/qedi/qedi_iscsi.c > > +++ b/drivers/scsi/qedi/qedi_iscsi.c > > @@ -1372,7 +1372,7 @@ static void qedi_cleanup_task(struct iscsi_task *task) > > { > > if (!task->sc || task->state == ISCSI_TASK_PENDING) { > > QEDI_INFO(NULL, QEDI_LOG_IO, "Returning > ref_cnt=%d\n", > > - atomic_read(>refcount)); > > + refcount_read(>refcount)); > > return; > > } > > > > diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h > > index b0e275d..24d74b5 100644 > > --- a/include/scsi/libiscsi.h > > +++ b/include/scsi/libiscsi.h > > @@ -29,6 +29,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -139,7 +140,7 @@ struct iscsi_task { > > > > /* state set/tested under session->lock */ > > int state; > > - atomic_trefcount; > > + refcount_t refcount; > > struct list_headrunning;/* running cmd list */ > > void*dd_data; /* > driver/transport data */ > > }; > > -- > > 2.7.4 > >
RE: [Xen-devel] [PATCH 29/29] drivers, xen: convert grant_map.users from atomic_t to refcount_t
> On 03/08/2017 08:49 AM, Reshetova, Elena wrote: > >> On 03/06/2017 09:21 AM, Elena Reshetova wrote: > >>> refcount_t type and corresponding API should be > >>> used instead of atomic_t when the variable is used as > >>> a reference counter. This allows to avoid accidental > >>> refcounter overflows that might lead to use-after-free > >>> situations. > >>> > >>> Signed-off-by: Elena Reshetova> >>> Signed-off-by: Hans Liljestrand > >>> Signed-off-by: Kees Cook > >>> Signed-off-by: David Windsor > >>> --- > >>> drivers/xen/gntdev.c | 11 ++- > >>> 1 file changed, 6 insertions(+), 5 deletions(-) > >> Reviewed-by: Boris Ostrovsky > > Is there a tree that can take this change? Turns out it is better to > > propagate > changes via separate trees and only leftovers can be taken via Greg's tree. > > > > Sure, we can take it via Xen tree for rc3. Thank you very much! Best Regards, Elena. > > -boris
RE: [PATCH 22/29] drivers, scsi: convert iscsi_task.refcount from atomic_t to refcount_t
> On Mon, Mar 06, 2017 at 04:21:09PM +0200, Elena Reshetova wrote: > > refcount_t type and corresponding API should be > > used instead of atomic_t when the variable is used as > > a reference counter. This allows to avoid accidental > > refcounter overflows that might lead to use-after-free > > situations. > > > > Signed-off-by: Elena Reshetova > > Signed-off-by: Hans Liljestrand > > Signed-off-by: Kees Cook > > Signed-off-by: David Windsor > > This looks OK to me. > > Acked-by: Chris Leech Thank you for review! Do you have a tree that can take this change? Best Regards, Elena. > > > --- > > drivers/scsi/libiscsi.c| 8 > > drivers/scsi/qedi/qedi_iscsi.c | 2 +- > > include/scsi/libiscsi.h| 3 ++- > > 3 files changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c > > index 834d121..7eb1d2c 100644 > > --- a/drivers/scsi/libiscsi.c > > +++ b/drivers/scsi/libiscsi.c > > @@ -516,13 +516,13 @@ static void iscsi_free_task(struct iscsi_task *task) > > > > void __iscsi_get_task(struct iscsi_task *task) > > { > > - atomic_inc(>refcount); > > + refcount_inc(>refcount); > > } > > EXPORT_SYMBOL_GPL(__iscsi_get_task); > > > > void __iscsi_put_task(struct iscsi_task *task) > > { > > - if (atomic_dec_and_test(>refcount)) > > + if (refcount_dec_and_test(>refcount)) > > iscsi_free_task(task); > > } > > EXPORT_SYMBOL_GPL(__iscsi_put_task); > > @@ -744,7 +744,7 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct > iscsi_hdr *hdr, > > * released by the lld when it has transmitted the task for > > * pdus we do not expect a response for. > > */ > > - atomic_set(>refcount, 1); > > + refcount_set(>refcount, 1); > > task->conn = conn; > > task->sc = NULL; > > INIT_LIST_HEAD(>running); > > @@ -1616,7 +1616,7 @@ static inline struct iscsi_task > > *iscsi_alloc_task(struct > iscsi_conn *conn, > > sc->SCp.phase = conn->session->age; > > sc->SCp.ptr = (char *) task; > > > > - atomic_set(>refcount, 1); > > + refcount_set(>refcount, 1); > > task->state = ISCSI_TASK_PENDING; > > task->conn = conn; > > task->sc = sc; > > diff --git a/drivers/scsi/qedi/qedi_iscsi.c b/drivers/scsi/qedi/qedi_iscsi.c > > index b9f79d3..3895bd5 100644 > > --- a/drivers/scsi/qedi/qedi_iscsi.c > > +++ b/drivers/scsi/qedi/qedi_iscsi.c > > @@ -1372,7 +1372,7 @@ static void qedi_cleanup_task(struct iscsi_task *task) > > { > > if (!task->sc || task->state == ISCSI_TASK_PENDING) { > > QEDI_INFO(NULL, QEDI_LOG_IO, "Returning > ref_cnt=%d\n", > > - atomic_read(>refcount)); > > + refcount_read(>refcount)); > > return; > > } > > > > diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h > > index b0e275d..24d74b5 100644 > > --- a/include/scsi/libiscsi.h > > +++ b/include/scsi/libiscsi.h > > @@ -29,6 +29,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -139,7 +140,7 @@ struct iscsi_task { > > > > /* state set/tested under session->lock */ > > int state; > > - atomic_trefcount; > > + refcount_t refcount; > > struct list_headrunning;/* running cmd list */ > > void*dd_data; /* > driver/transport data */ > > }; > > -- > > 2.7.4 > >
RE: [Xen-devel] [PATCH 29/29] drivers, xen: convert grant_map.users from atomic_t to refcount_t
> On 03/08/2017 08:49 AM, Reshetova, Elena wrote: > >> On 03/06/2017 09:21 AM, Elena Reshetova wrote: > >>> refcount_t type and corresponding API should be > >>> used instead of atomic_t when the variable is used as > >>> a reference counter. This allows to avoid accidental > >>> refcounter overflows that might lead to use-after-free > >>> situations. > >>> > >>> Signed-off-by: Elena Reshetova > >>> Signed-off-by: Hans Liljestrand > >>> Signed-off-by: Kees Cook > >>> Signed-off-by: David Windsor > >>> --- > >>> drivers/xen/gntdev.c | 11 ++- > >>> 1 file changed, 6 insertions(+), 5 deletions(-) > >> Reviewed-by: Boris Ostrovsky > > Is there a tree that can take this change? Turns out it is better to > > propagate > changes via separate trees and only leftovers can be taken via Greg's tree. > > > > Sure, we can take it via Xen tree for rc3. Thank you very much! Best Regards, Elena. > > -boris
Re: [PATCH 1/3] usb: dwc3-omap: Fix missing break in dwc3_omap_set_mailbox()
Felipe, On 15/02/17 13:38, Roger Quadros wrote: > We need to break from all cases if we want to treat > each one of them separately. > > Reported-by: Gustavo A. R. Silva> Fixes: d2728fb3e01f ("usb: dwc3: omap: Pass VBUS and ID events transparently") > Cc: #v4.8+ > Signed-off-by: Roger Quadros Can you please pick this one for v4.11-rc? Thanks. > --- > drivers/usb/dwc3/dwc3-omap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c > index eb1b9cb..35b6351 100644 > --- a/drivers/usb/dwc3/dwc3-omap.c > +++ b/drivers/usb/dwc3/dwc3-omap.c > @@ -250,6 +250,7 @@ static void dwc3_omap_set_mailbox(struct dwc3_omap *omap, > val = dwc3_omap_read_utmi_ctrl(omap); > val |= USBOTGSS_UTMI_OTG_CTRL_IDDIG; > dwc3_omap_write_utmi_ctrl(omap, val); > + break; > > case OMAP_DWC3_VBUS_OFF: > val = dwc3_omap_read_utmi_ctrl(omap); > -- cheers, -roger
Re: [PATCH 1/3] usb: dwc3-omap: Fix missing break in dwc3_omap_set_mailbox()
Felipe, On 15/02/17 13:38, Roger Quadros wrote: > We need to break from all cases if we want to treat > each one of them separately. > > Reported-by: Gustavo A. R. Silva > Fixes: d2728fb3e01f ("usb: dwc3: omap: Pass VBUS and ID events transparently") > Cc: #v4.8+ > Signed-off-by: Roger Quadros Can you please pick this one for v4.11-rc? Thanks. > --- > drivers/usb/dwc3/dwc3-omap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c > index eb1b9cb..35b6351 100644 > --- a/drivers/usb/dwc3/dwc3-omap.c > +++ b/drivers/usb/dwc3/dwc3-omap.c > @@ -250,6 +250,7 @@ static void dwc3_omap_set_mailbox(struct dwc3_omap *omap, > val = dwc3_omap_read_utmi_ctrl(omap); > val |= USBOTGSS_UTMI_OTG_CTRL_IDDIG; > dwc3_omap_write_utmi_ctrl(omap, val); > + break; > > case OMAP_DWC3_VBUS_OFF: > val = dwc3_omap_read_utmi_ctrl(omap); > -- cheers, -roger
Re: [PATCH v2] selinux: check for address length in selinux_socket_bind()
From: Alexander PotapenkoDate: Mon, 6 Mar 2017 19:46:14 +0100 > KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of > uninitialized memory in selinux_socket_bind(): ... > (the line numbers are relative to 4.8-rc6, but the bug persists upstream) > > , when I run the following program as root: ... > (for different values of |size| other error reports are printed). > > This happens because bind() unconditionally copies |size| bytes of > |addr| to the kernel, leaving the rest uninitialized. Then > security_socket_bind() reads the IP address bytes, including the > uninitialized ones, to determine the port, or e.g. pass them further to > sel_netnode_find(), which uses them to calculate a hash. > > Signed-off-by: Alexander Potapenko Are the SELINUX folks going to pick this up or should I?
Re: [PATCH v5 04/19] net: e100: Replace PCI pool old API
On Wed, Mar 08, 2017 at 02:40:25PM -0800, Jeff Kirsher wrote: > On Wed, 2017-03-08 at 17:19 +0100, Romain Perier wrote: > > The PCI pool API is deprecated. This commit replaces the PCI pool old > > API by the appropriate function with the DMA pool API. > > > > Signed-off-by: Romain Perier> > Acked-by: Peter Senna Tschudin > > Tested-by: Peter Senna Tschudin > > --- > > drivers/net/ethernet/intel/e100.c | 12 ++-- > > 1 file changed, 6 insertions(+), 6 deletions(-) > > Acked-by: Jeff Kirsher > > My only concern is: > - what hardware did this get tested with? Since this affects all e100 > parts, it would be hard to believe that all the affected hardware was > used in testing. This was tested by compilation only(See https://lkml.org/lkml/2017/2/8/661). However this series removes macro definitions of the old pci_pool interface and replace call sites by what the macra was calling. Here are the macros that this series removes from include/pci.h: #define pci_pool dma_pool #define pci_pool_create(name, pdev, size, align, allocation) \ dma_pool_create(name, >dev, size, align, allocation) #define pci_pool_destroy(pool) dma_pool_destroy(pool) #define pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, handle) #define pci_pool_zalloc(pool, flags, handle) \ dma_pool_zalloc(pool, flags, handle) #define pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, add So this should not affect run time.
Re: [PATCH v2] selinux: check for address length in selinux_socket_bind()
From: Alexander Potapenko Date: Mon, 6 Mar 2017 19:46:14 +0100 > KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of > uninitialized memory in selinux_socket_bind(): ... > (the line numbers are relative to 4.8-rc6, but the bug persists upstream) > > , when I run the following program as root: ... > (for different values of |size| other error reports are printed). > > This happens because bind() unconditionally copies |size| bytes of > |addr| to the kernel, leaving the rest uninitialized. Then > security_socket_bind() reads the IP address bytes, including the > uninitialized ones, to determine the port, or e.g. pass them further to > sel_netnode_find(), which uses them to calculate a hash. > > Signed-off-by: Alexander Potapenko Are the SELINUX folks going to pick this up or should I?
Re: [PATCH v5 04/19] net: e100: Replace PCI pool old API
On Wed, Mar 08, 2017 at 02:40:25PM -0800, Jeff Kirsher wrote: > On Wed, 2017-03-08 at 17:19 +0100, Romain Perier wrote: > > The PCI pool API is deprecated. This commit replaces the PCI pool old > > API by the appropriate function with the DMA pool API. > > > > Signed-off-by: Romain Perier > > Acked-by: Peter Senna Tschudin > > Tested-by: Peter Senna Tschudin > > --- > > drivers/net/ethernet/intel/e100.c | 12 ++-- > > 1 file changed, 6 insertions(+), 6 deletions(-) > > Acked-by: Jeff Kirsher > > My only concern is: > - what hardware did this get tested with? Since this affects all e100 > parts, it would be hard to believe that all the affected hardware was > used in testing. This was tested by compilation only(See https://lkml.org/lkml/2017/2/8/661). However this series removes macro definitions of the old pci_pool interface and replace call sites by what the macra was calling. Here are the macros that this series removes from include/pci.h: #define pci_pool dma_pool #define pci_pool_create(name, pdev, size, align, allocation) \ dma_pool_create(name, >dev, size, align, allocation) #define pci_pool_destroy(pool) dma_pool_destroy(pool) #define pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, handle) #define pci_pool_zalloc(pool, flags, handle) \ dma_pool_zalloc(pool, flags, handle) #define pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, add So this should not affect run time.
[PATCH 3/6] mm/migrate: Add copy_pages_mthread function
From: Zi YanThis change adds a new function copy_pages_mthread to enable multi threaded page copy which can be utilized during migration. This function splits the page copy request into multiple threads which will handle individual chunk and send them as jobs to system_highpri_wq work queue. Signed-off-by: Zi Yan Signed-off-by: Anshuman Khandual --- * Updated cthread calculations, taking care of divide by zero issues, picking up the right single thread, defining NR_COPYTHREADS, fixing the build problem on i386 etc. include/linux/highmem.h | 2 ++ mm/Makefile | 2 ++ mm/copy_pages_mthread.c | 95 + 3 files changed, 99 insertions(+) create mode 100644 mm/copy_pages_mthread.c diff --git a/include/linux/highmem.h b/include/linux/highmem.h index bb3f329..e1f4f1b 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from, #endif +int copy_pages_mthread(struct page *to, struct page *from, int nr_pages); + static inline void copy_highpage(struct page *to, struct page *from) { char *vfrom, *vto; diff --git a/mm/Makefile b/mm/Makefile index 295bd7a..cc27e76 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \ obj-y += init-mm.o +obj-y += copy_pages_mthread.o + ifdef CONFIG_NO_BOOTMEM obj-y += nobootmem.o else diff --git a/mm/copy_pages_mthread.c b/mm/copy_pages_mthread.c new file mode 100644 index 000..5af861c --- /dev/null +++ b/mm/copy_pages_mthread.c @@ -0,0 +1,95 @@ +/* + * This implements parallel page copy function through multi + * threaded work queues. + * + * Copyright (C) Zi Yan , Nov 2016 + * + * Licensed under the terms of the GNU GPL, version 2. + */ +#include +#include +#include +#include + +/* + * NR_COPYTHREADS can be the highest number of threads for given + * node on any architecture. The actual number of copy threads + * will be limited by the cpumask weight of the target node. + */ +#define NR_COPYTHREADS 8 + +struct copy_info { + struct work_struct copy_work; + char *to; + char *from; + unsigned long chunk_size; +}; + +static void copy_pages(char *vto, char *vfrom, unsigned long size) +{ + memcpy(vto, vfrom, size); +} + +static void copythread(struct work_struct *work) +{ + struct copy_info *info = (struct copy_info *) work; + + copy_pages(info->to, info->from, info->chunk_size); +} + +int copy_pages_mthread(struct page *to, struct page *from, int nr_pages) +{ + struct cpumask *cpumask; + struct copy_info *work_items; + char *vto, *vfrom; + unsigned long i, cthreads, cpu, node, chunk_size; + int cpu_id_list[NR_COPYTHREADS] = {0}; + + node = page_to_nid(to); + cpumask = (struct cpumask *) cpumask_of_node(node); + cthreads = min_t(unsigned int, NR_COPYTHREADS, cpumask_weight(cpumask)); + cthreads = (cthreads / 2) * 2; + if (!cthreads) + cthreads = 1; + + work_items = kcalloc(cthreads, sizeof(struct copy_info), GFP_KERNEL); + if (!work_items) + return -ENOMEM; + + /* +* XXX: On a memory-only CPU-less NUMA node it will +* just fallback using cpu[0] in a single threaded +* manner to do the page copy. On single CPU target +* node that CPU will be used for the page copy. +*/ + i = 0; + for_each_cpu(cpu, cpumask) { + if (i >= cthreads) + break; + cpu_id_list[i] = cpu; + ++i; + } + + vfrom = kmap(from); + vto = kmap(to); + chunk_size = PAGE_SIZE * nr_pages / cthreads; + + for (i = 0; i < cthreads; ++i) { + INIT_WORK((struct work_struct *) _items[i], copythread); + + work_items[i].to = vto + i * chunk_size; + work_items[i].from = vfrom + i * chunk_size; + work_items[i].chunk_size = chunk_size; + + queue_work_on(cpu_id_list[i], system_highpri_wq, + (struct work_struct *) _items[i]); + } + + for (i = 0; i < cthreads; ++i) + flush_work((struct work_struct *) _items[i]); + + kunmap(to); + kunmap(from); + kfree(work_items); + return 0; +} -- 2.1.4
Re: [PATCH v7 kernel 2/5] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
On 03/08/2017 12:01 PM, Michael S. Tsirkin wrote: On Fri, Mar 03, 2017 at 01:40:27PM +0800, Wei Wang wrote: From: Liang LiAdd a new feature bit, VIRTIO_BALLOON_F_CHUNK_TRANSFER. Please check the implementation patch commit for details about this feature. better squash into next patch. OK, will do. Best, Wei
[PATCH 3/6] mm/migrate: Add copy_pages_mthread function
From: Zi Yan This change adds a new function copy_pages_mthread to enable multi threaded page copy which can be utilized during migration. This function splits the page copy request into multiple threads which will handle individual chunk and send them as jobs to system_highpri_wq work queue. Signed-off-by: Zi Yan Signed-off-by: Anshuman Khandual --- * Updated cthread calculations, taking care of divide by zero issues, picking up the right single thread, defining NR_COPYTHREADS, fixing the build problem on i386 etc. include/linux/highmem.h | 2 ++ mm/Makefile | 2 ++ mm/copy_pages_mthread.c | 95 + 3 files changed, 99 insertions(+) create mode 100644 mm/copy_pages_mthread.c diff --git a/include/linux/highmem.h b/include/linux/highmem.h index bb3f329..e1f4f1b 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -236,6 +236,8 @@ static inline void copy_user_highpage(struct page *to, struct page *from, #endif +int copy_pages_mthread(struct page *to, struct page *from, int nr_pages); + static inline void copy_highpage(struct page *to, struct page *from) { char *vfrom, *vto; diff --git a/mm/Makefile b/mm/Makefile index 295bd7a..cc27e76 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -41,6 +41,8 @@ obj-y := filemap.o mempool.o oom_kill.o \ obj-y += init-mm.o +obj-y += copy_pages_mthread.o + ifdef CONFIG_NO_BOOTMEM obj-y += nobootmem.o else diff --git a/mm/copy_pages_mthread.c b/mm/copy_pages_mthread.c new file mode 100644 index 000..5af861c --- /dev/null +++ b/mm/copy_pages_mthread.c @@ -0,0 +1,95 @@ +/* + * This implements parallel page copy function through multi + * threaded work queues. + * + * Copyright (C) Zi Yan , Nov 2016 + * + * Licensed under the terms of the GNU GPL, version 2. + */ +#include +#include +#include +#include + +/* + * NR_COPYTHREADS can be the highest number of threads for given + * node on any architecture. The actual number of copy threads + * will be limited by the cpumask weight of the target node. + */ +#define NR_COPYTHREADS 8 + +struct copy_info { + struct work_struct copy_work; + char *to; + char *from; + unsigned long chunk_size; +}; + +static void copy_pages(char *vto, char *vfrom, unsigned long size) +{ + memcpy(vto, vfrom, size); +} + +static void copythread(struct work_struct *work) +{ + struct copy_info *info = (struct copy_info *) work; + + copy_pages(info->to, info->from, info->chunk_size); +} + +int copy_pages_mthread(struct page *to, struct page *from, int nr_pages) +{ + struct cpumask *cpumask; + struct copy_info *work_items; + char *vto, *vfrom; + unsigned long i, cthreads, cpu, node, chunk_size; + int cpu_id_list[NR_COPYTHREADS] = {0}; + + node = page_to_nid(to); + cpumask = (struct cpumask *) cpumask_of_node(node); + cthreads = min_t(unsigned int, NR_COPYTHREADS, cpumask_weight(cpumask)); + cthreads = (cthreads / 2) * 2; + if (!cthreads) + cthreads = 1; + + work_items = kcalloc(cthreads, sizeof(struct copy_info), GFP_KERNEL); + if (!work_items) + return -ENOMEM; + + /* +* XXX: On a memory-only CPU-less NUMA node it will +* just fallback using cpu[0] in a single threaded +* manner to do the page copy. On single CPU target +* node that CPU will be used for the page copy. +*/ + i = 0; + for_each_cpu(cpu, cpumask) { + if (i >= cthreads) + break; + cpu_id_list[i] = cpu; + ++i; + } + + vfrom = kmap(from); + vto = kmap(to); + chunk_size = PAGE_SIZE * nr_pages / cthreads; + + for (i = 0; i < cthreads; ++i) { + INIT_WORK((struct work_struct *) _items[i], copythread); + + work_items[i].to = vto + i * chunk_size; + work_items[i].from = vfrom + i * chunk_size; + work_items[i].chunk_size = chunk_size; + + queue_work_on(cpu_id_list[i], system_highpri_wq, + (struct work_struct *) _items[i]); + } + + for (i = 0; i < cthreads; ++i) + flush_work((struct work_struct *) _items[i]); + + kunmap(to); + kunmap(from); + kfree(work_items); + return 0; +} -- 2.1.4
Re: [PATCH v7 kernel 2/5] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
On 03/08/2017 12:01 PM, Michael S. Tsirkin wrote: On Fri, Mar 03, 2017 at 01:40:27PM +0800, Wei Wang wrote: From: Liang Li Add a new feature bit, VIRTIO_BALLOON_F_CHUNK_TRANSFER. Please check the implementation patch commit for details about this feature. better squash into next patch. OK, will do. Best, Wei
Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing
This is a reply to a thread from back in Nov 2016... Linus> The thing is, with function tracing, you *can* get the return value Linus> and arguments. Sure, you'll probably need to write eBPF and just Linus> attach it to that fentry call point, and yes, if something is inlined Linus> you're just screwed, but Christ, if you do debugging that way you Linus> shouldn't be writing kernel code in the first place. I've been thinking about this thread ever since then... lately I've been reading about eBPF and looking at Brendan Gregg's bcc tool, and reading about kprobes and I made some test tracepoints too... Orangefs has a function: orangefs_create(struct inode *dir, struct dentry *dentry, umode_t mode, bool exclusive) I found it was easy to use a kprobe to get a function's return code: echo 'r:orangefs_create_retprobe orangefs_create $retval' >> /sys/kernel/debug/tracing/kprobe_events It was also easy to use a kprobe to print the arguments to a function, but since arguments are often pointers, maybe the pointer value isn't really what you want to see, rather you might wish you could "see into" a structure that you have a pointer to... Here's a kprobe for looking at the arguments to a function, the funny di, si, dx, stack stuff has to do with a particular architecture's parameter passing mechanism as it relates to registers and the stack... I mention that because the example in kprobetrace.txt seems to be for a 32-bit computer and I'm (like most of us?) on a 64-bit computer. echo 'p:orangefs_create_probe orangefs_create dir=%di dentry=%si mode=%dx exclusive=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events I wrote a bcc program to extract the name of an object from the dentry struct passed into orangefs_create. The bcc program is kind of verbose, but is mostly "templatey", here it is... is this the kind of thing Linus was talking about? Either way, I think it is kind of cool ... #!/usr/bin/python # # Brendan Gregg's mdflush used as a template. # from __future__ import print_function from bcc import BPF from time import strftime import ctypes as ct # load BPF program b = BPF(text=""" #include #include #include #include #include struct data_t { u64 pid; char comm[TASK_COMM_LEN]; char objname[NAME_MAX]; }; BPF_PERF_OUTPUT(events); int kprobe__orangefs_create(struct pt_regs *ctx, void *dir, struct dentry *dentry) { struct data_t data = {}; u32 pid = bpf_get_current_pid_tgid(); data.pid = pid; bpf_get_current_comm(, sizeof(data.comm)); bpf_probe_read(, sizeof(data.objname), (void *)dentry->d_name.name); events.perf_submit(ctx, , sizeof(data)); return 0; } """) # event data TASK_COMM_LEN = 16 # linux/sched.h NAME_MAX = 255 # uapi/linux/limits.h class Data(ct.Structure): _fields_ = [ ("pid", ct.c_ulonglong), ("comm", ct.c_char * TASK_COMM_LEN), ("objname", ct.c_char * NAME_MAX) ] # header print("orangefs creates... Hit Ctrl-C to end.") print("%-8s %-6s %-16s %s" % ("TIME", "PID", "COMM", "OBJNAME")) # process event # print_event is the callback invoked from open_perf_buffers... # cpu refers to a set of per-cpu ring buffers that will receive the event data. def print_event(cpu, data, size): event = ct.cast(data, ct.POINTER(Data)).contents print("%-8s %-6d %-16s %s" % (strftime("%H:%M:%S"), event.pid, event.comm, event.objname)) # read events b["events"].open_perf_buffer(print_event) while 1: b.kprobe_poll() -Mike On Mon, Nov 28, 2016 at 9:02 PM, Dave Chinnerwrote: > On Mon, Nov 28, 2016 at 03:46:51PM -0700, Ross Zwisler wrote: >> On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote: >> > On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote: >> > > Tracepoints are the standard way to capture debugging and tracing >> > > information in many parts of the kernel, including the XFS and ext4 >> > > filesystems. Create a tracepoint header for FS DAX and add the first DAX >> > > tracepoints to the PMD fault handler. This allows the tracing for DAX to >> > > be done in the same way as the filesystem tracing so that developers can >> > > look at them together and get a coherent idea of what the system is >> > > doing. >> > > >> > > I added both an entry and exit tracepoint because future patches will add >> > > tracepoints to child functions of dax_iomap_pmd_fault() like >> > > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages >> > > to >> > > be wrapped by the parent function tracepoints so the code flow is more >> > > easily understood. Having entry and exit tracepoints for faults also >> > > allows us to easily see what filesystems functions were called during the >> > > fault. These filesystem functions get executed via iomap_begin() and >> > > iomap_end() calls, for
Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing
This is a reply to a thread from back in Nov 2016... Linus> The thing is, with function tracing, you *can* get the return value Linus> and arguments. Sure, you'll probably need to write eBPF and just Linus> attach it to that fentry call point, and yes, if something is inlined Linus> you're just screwed, but Christ, if you do debugging that way you Linus> shouldn't be writing kernel code in the first place. I've been thinking about this thread ever since then... lately I've been reading about eBPF and looking at Brendan Gregg's bcc tool, and reading about kprobes and I made some test tracepoints too... Orangefs has a function: orangefs_create(struct inode *dir, struct dentry *dentry, umode_t mode, bool exclusive) I found it was easy to use a kprobe to get a function's return code: echo 'r:orangefs_create_retprobe orangefs_create $retval' >> /sys/kernel/debug/tracing/kprobe_events It was also easy to use a kprobe to print the arguments to a function, but since arguments are often pointers, maybe the pointer value isn't really what you want to see, rather you might wish you could "see into" a structure that you have a pointer to... Here's a kprobe for looking at the arguments to a function, the funny di, si, dx, stack stuff has to do with a particular architecture's parameter passing mechanism as it relates to registers and the stack... I mention that because the example in kprobetrace.txt seems to be for a 32-bit computer and I'm (like most of us?) on a 64-bit computer. echo 'p:orangefs_create_probe orangefs_create dir=%di dentry=%si mode=%dx exclusive=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events I wrote a bcc program to extract the name of an object from the dentry struct passed into orangefs_create. The bcc program is kind of verbose, but is mostly "templatey", here it is... is this the kind of thing Linus was talking about? Either way, I think it is kind of cool ... #!/usr/bin/python # # Brendan Gregg's mdflush used as a template. # from __future__ import print_function from bcc import BPF from time import strftime import ctypes as ct # load BPF program b = BPF(text=""" #include #include #include #include #include struct data_t { u64 pid; char comm[TASK_COMM_LEN]; char objname[NAME_MAX]; }; BPF_PERF_OUTPUT(events); int kprobe__orangefs_create(struct pt_regs *ctx, void *dir, struct dentry *dentry) { struct data_t data = {}; u32 pid = bpf_get_current_pid_tgid(); data.pid = pid; bpf_get_current_comm(, sizeof(data.comm)); bpf_probe_read(, sizeof(data.objname), (void *)dentry->d_name.name); events.perf_submit(ctx, , sizeof(data)); return 0; } """) # event data TASK_COMM_LEN = 16 # linux/sched.h NAME_MAX = 255 # uapi/linux/limits.h class Data(ct.Structure): _fields_ = [ ("pid", ct.c_ulonglong), ("comm", ct.c_char * TASK_COMM_LEN), ("objname", ct.c_char * NAME_MAX) ] # header print("orangefs creates... Hit Ctrl-C to end.") print("%-8s %-6s %-16s %s" % ("TIME", "PID", "COMM", "OBJNAME")) # process event # print_event is the callback invoked from open_perf_buffers... # cpu refers to a set of per-cpu ring buffers that will receive the event data. def print_event(cpu, data, size): event = ct.cast(data, ct.POINTER(Data)).contents print("%-8s %-6d %-16s %s" % (strftime("%H:%M:%S"), event.pid, event.comm, event.objname)) # read events b["events"].open_perf_buffer(print_event) while 1: b.kprobe_poll() -Mike On Mon, Nov 28, 2016 at 9:02 PM, Dave Chinner wrote: > On Mon, Nov 28, 2016 at 03:46:51PM -0700, Ross Zwisler wrote: >> On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote: >> > On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote: >> > > Tracepoints are the standard way to capture debugging and tracing >> > > information in many parts of the kernel, including the XFS and ext4 >> > > filesystems. Create a tracepoint header for FS DAX and add the first DAX >> > > tracepoints to the PMD fault handler. This allows the tracing for DAX to >> > > be done in the same way as the filesystem tracing so that developers can >> > > look at them together and get a coherent idea of what the system is >> > > doing. >> > > >> > > I added both an entry and exit tracepoint because future patches will add >> > > tracepoints to child functions of dax_iomap_pmd_fault() like >> > > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages >> > > to >> > > be wrapped by the parent function tracepoints so the code flow is more >> > > easily understood. Having entry and exit tracepoints for faults also >> > > allows us to easily see what filesystems functions were called during the >> > > fault. These filesystem functions get executed via iomap_begin() and >> > > iomap_end() calls, for example, and will have
RE: [PATCH 3/5] fs, xfs: convert xlog_ticket.t_ref from atomic_t to refcount_t
> On Wed, Mar 8, 2017 at 7:50 AM, Christoph Hellwigwrote: > >> - ASSERT(atomic_read(>t_ref) > 0); > >> - atomic_inc(>t_ref); > >> + ASSERT(refcount_read(>t_ref) > 0); > >> + refcount_inc(>t_ref); > > > > With strict refcount semantics refcount_inc should check that > > the count is larger than 0, otherwise we'd need to use > > recount_inc_not_zero or whatever you're going to call it. > > > > Is that something the recount code does / could do? > > Yes, refcount_inc() will not increment from 0 and WARNs. It looks like > xfs's ASSERT is also a warn (though with XFS-specific formatting), so > perhaps the ASSERT could be dropped? IIUC, Elena's approach to these > changes was to be conservative about removing the existing checks. I am removing the existing WARNs now where they are 100% not needed, but leaving the ones like this ASSERT, because I am thinking that you might have test cases that are dependent on these ASSERTs with specific formatting and I don't want to break them. If you don't need them, I can remove them also. Best Regards, Elena > > -Kees > > -- > Kees Cook > Pixel Security
RE: [PATCH 3/5] fs, xfs: convert xlog_ticket.t_ref from atomic_t to refcount_t
> On Wed, Mar 8, 2017 at 7:50 AM, Christoph Hellwig wrote: > >> - ASSERT(atomic_read(>t_ref) > 0); > >> - atomic_inc(>t_ref); > >> + ASSERT(refcount_read(>t_ref) > 0); > >> + refcount_inc(>t_ref); > > > > With strict refcount semantics refcount_inc should check that > > the count is larger than 0, otherwise we'd need to use > > recount_inc_not_zero or whatever you're going to call it. > > > > Is that something the recount code does / could do? > > Yes, refcount_inc() will not increment from 0 and WARNs. It looks like > xfs's ASSERT is also a warn (though with XFS-specific formatting), so > perhaps the ASSERT could be dropped? IIUC, Elena's approach to these > changes was to be conservative about removing the existing checks. I am removing the existing WARNs now where they are 100% not needed, but leaving the ones like this ASSERT, because I am thinking that you might have test cases that are dependent on these ASSERTs with specific formatting and I don't want to break them. If you don't need them, I can remove them also. Best Regards, Elena > > -Kees > > -- > Kees Cook > Pixel Security
[PATCH 3/3] Documentation: cpu-freq: cpu-drivers: Fix repetition of word 'to'
The patch replaces 'to to' with 'to' in the documentation. Signed-off-by: sayli karnik--- Documentation/cpu-freq/cpu-drivers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.txt index f71e6be..434c49c 100644 --- a/Documentation/cpu-freq/cpu-drivers.txt +++ b/Documentation/cpu-freq/cpu-drivers.txt @@ -231,7 +231,7 @@ the reference implementation in drivers/cpufreq/longrun.c Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION unset. get_intermediate should return a stable intermediate frequency platform wants to -switch to, and target_intermediate() should set CPU to to that frequency, before +switch to, and target_intermediate() should set CPU to that frequency, before jumping to the frequency corresponding to 'index'. Core will take care of sending notifications and driver doesn't have to handle them in target_intermediate() or target_index(). -- 2.7.4
[PATCH 3/3] Documentation: cpu-freq: cpu-drivers: Fix repetition of word 'to'
The patch replaces 'to to' with 'to' in the documentation. Signed-off-by: sayli karnik --- Documentation/cpu-freq/cpu-drivers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.txt index f71e6be..434c49c 100644 --- a/Documentation/cpu-freq/cpu-drivers.txt +++ b/Documentation/cpu-freq/cpu-drivers.txt @@ -231,7 +231,7 @@ the reference implementation in drivers/cpufreq/longrun.c Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION unset. get_intermediate should return a stable intermediate frequency platform wants to -switch to, and target_intermediate() should set CPU to to that frequency, before +switch to, and target_intermediate() should set CPU to that frequency, before jumping to the frequency corresponding to 'index'. Core will take care of sending notifications and driver doesn't have to handle them in target_intermediate() or target_index(). -- 2.7.4
[PATCH 0/3] Fix unintentional repetition of words
Fix typos in the form of consecutive repetition of words. sayli karnik (3): Documentation: phy: Fix repetition of word 'the' Documentation: ABI: testing: sysfs-bus-pci: Fix repetition of word 'the' Documentation: cpu-freq: cpu-drivers: Fix repetition of word 'to' Documentation/ABI/testing/sysfs-bus-pci | 2 +- Documentation/cpu-freq/cpu-drivers.txt | 2 +- Documentation/phy.txt | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) -- 2.7.4
[PATCH 0/3] Fix unintentional repetition of words
Fix typos in the form of consecutive repetition of words. sayli karnik (3): Documentation: phy: Fix repetition of word 'the' Documentation: ABI: testing: sysfs-bus-pci: Fix repetition of word 'the' Documentation: cpu-freq: cpu-drivers: Fix repetition of word 'to' Documentation/ABI/testing/sysfs-bus-pci | 2 +- Documentation/cpu-freq/cpu-drivers.txt | 2 +- Documentation/phy.txt | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) -- 2.7.4
[PATCH 2/3] Documentation: ABI: testing: sysfs-bus-pci: Fix repetition of word 'the'
The patch replaces 'the the' with 'the' in the documentation. Signed-off-by: sayli karnik--- Documentation/ABI/testing/sysfs-bus-pci | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 5a1732b..e4e9010 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -299,5 +299,5 @@ What: /sys/bus/pci/devices/.../revision Date: November 2016 Contact: Emil Velikov Description: - This file contains the revision field of the the PCI device. + This file contains the revision field of the PCI device. The value comes from device config space. The file is read only. -- 2.7.4
[PATCH 2/3] Documentation: ABI: testing: sysfs-bus-pci: Fix repetition of word 'the'
The patch replaces 'the the' with 'the' in the documentation. Signed-off-by: sayli karnik --- Documentation/ABI/testing/sysfs-bus-pci | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 5a1732b..e4e9010 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -299,5 +299,5 @@ What: /sys/bus/pci/devices/.../revision Date: November 2016 Contact: Emil Velikov Description: - This file contains the revision field of the the PCI device. + This file contains the revision field of the PCI device. The value comes from device config space. The file is read only. -- 2.7.4
[PATCH 1/3] Documentation: phy: Fix repetition of word 'the'
The patch replaces 'the the' with 'the' in the documantation. Signed-off-by: sayli karnik--- Documentation/phy.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/phy.txt b/Documentation/phy.txt index 0aa994b..383cdd8 100644 --- a/Documentation/phy.txt +++ b/Documentation/phy.txt @@ -97,7 +97,7 @@ should contain the phy name as given in the dt data and in the case of non-dt boot, it should contain the label of the PHY. The two devm_phy_get associates the device with the PHY using devres on successful PHY get. On driver detach, release function is invoked on -the the devres data and devres data is freed. phy_optional_get and +the devres data and devres data is freed. phy_optional_get and devm_phy_optional_get should be used when the phy is optional. These two functions will never return -ENODEV, but instead returns NULL when the phy cannot be found.Some generic drivers, such as ehci, may use multiple -- 2.7.4
[PATCH 1/3] Documentation: phy: Fix repetition of word 'the'
The patch replaces 'the the' with 'the' in the documantation. Signed-off-by: sayli karnik --- Documentation/phy.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/phy.txt b/Documentation/phy.txt index 0aa994b..383cdd8 100644 --- a/Documentation/phy.txt +++ b/Documentation/phy.txt @@ -97,7 +97,7 @@ should contain the phy name as given in the dt data and in the case of non-dt boot, it should contain the label of the PHY. The two devm_phy_get associates the device with the PHY using devres on successful PHY get. On driver detach, release function is invoked on -the the devres data and devres data is freed. phy_optional_get and +the devres data and devres data is freed. phy_optional_get and devm_phy_optional_get should be used when the phy is optional. These two functions will never return -ENODEV, but instead returns NULL when the phy cannot be found.Some generic drivers, such as ehci, may use multiple -- 2.7.4
Re: [PATCH] pinctrl: samsung: fix segfault when using external interrupts on s3c24xx
On Thu, Mar 9, 2017 at 7:56 AM, Tomasz Figawrote: > 2017-03-09 1:34 GMT+09:00 Krzysztof Kozlowski : >> On Mon, Mar 06, 2017 at 09:15:16AM -0400, Sergio Prado wrote: >>> Hi Krzysztof, >>> >>> > > This is a regression from commit >>> > > 8b1bd11c1f8f529057369c5b3702d13fd24e2765. >>> > >>> > Checkpatch should complain here about commit format. >>> > >>> > > >>> > > Tested on FriendlyARM mini2440. >>> > > >>> > >>> > Please add: >>> > Fixes: 8b1bd11c1f8f ("pinctrl: samsung: Add the support the multiple >>> > IORESOURCE_MEM for one pin-bank") >>> > Cc: >>> > >>> >>> OK. >>> >>> > > Signed-off-by: Sergio Prado >>> > > --- >>> > > drivers/pinctrl/samsung/pinctrl-s3c24xx.c | 4 ++-- >>> > > 1 file changed, 2 insertions(+), 2 deletions(-) >>> > > >>> > > diff --git a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > index b82a003546ae..1b8d887796e8 100644 >>> > > --- a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > +++ b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > @@ -356,8 +356,8 @@ static inline void s3c24xx_demux_eint(struct >>> > > irq_desc *desc, >>> > > { >>> > > struct s3c24xx_eint_data *data = irq_desc_get_handler_data(desc); >>> > > struct irq_chip *chip = irq_desc_get_chip(desc); >>> > > - struct irq_data *irqd = irq_desc_get_irq_data(desc); >>> > > - struct samsung_pin_bank *bank = irq_data_get_irq_chip_data(irqd); >>> > > + struct samsung_pinctrl_drv_data *d = data->drvdata; >>> > > + struct samsung_pin_bank *bank = d->pin_banks; >>> > >>> > I think 'pin_banks' point to all banks of given controller not to the >>> > currently accessed one. >>> >>> Understood. I think it worked in my tests because on s3c2440 all banks >>> have the same eint base address. >>> >>> So what do you think is the best approach to solve this problem? >> >> Maybe you can get to this through: >> s3c24xx_eint_domain_data = >> s3c24xx_eint_data->domains[virq].host_data; >> s3c24xx_eint_domain_data->bank >> >> It is getting slightly more complicated... > > How about the suggestions I made in my reply from March 4 (JST)? Yes, this also looks like solution. I am not sure how much you would like to revert but wouldn't it create duplicated members in pinctrl structures? One for Exynos and other for S3C? Best regards, Krzysztof
Re: [PATCH] pinctrl: samsung: fix segfault when using external interrupts on s3c24xx
On Thu, Mar 9, 2017 at 7:56 AM, Tomasz Figa wrote: > 2017-03-09 1:34 GMT+09:00 Krzysztof Kozlowski : >> On Mon, Mar 06, 2017 at 09:15:16AM -0400, Sergio Prado wrote: >>> Hi Krzysztof, >>> >>> > > This is a regression from commit >>> > > 8b1bd11c1f8f529057369c5b3702d13fd24e2765. >>> > >>> > Checkpatch should complain here about commit format. >>> > >>> > > >>> > > Tested on FriendlyARM mini2440. >>> > > >>> > >>> > Please add: >>> > Fixes: 8b1bd11c1f8f ("pinctrl: samsung: Add the support the multiple >>> > IORESOURCE_MEM for one pin-bank") >>> > Cc: >>> > >>> >>> OK. >>> >>> > > Signed-off-by: Sergio Prado >>> > > --- >>> > > drivers/pinctrl/samsung/pinctrl-s3c24xx.c | 4 ++-- >>> > > 1 file changed, 2 insertions(+), 2 deletions(-) >>> > > >>> > > diff --git a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > index b82a003546ae..1b8d887796e8 100644 >>> > > --- a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > +++ b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c >>> > > @@ -356,8 +356,8 @@ static inline void s3c24xx_demux_eint(struct >>> > > irq_desc *desc, >>> > > { >>> > > struct s3c24xx_eint_data *data = irq_desc_get_handler_data(desc); >>> > > struct irq_chip *chip = irq_desc_get_chip(desc); >>> > > - struct irq_data *irqd = irq_desc_get_irq_data(desc); >>> > > - struct samsung_pin_bank *bank = irq_data_get_irq_chip_data(irqd); >>> > > + struct samsung_pinctrl_drv_data *d = data->drvdata; >>> > > + struct samsung_pin_bank *bank = d->pin_banks; >>> > >>> > I think 'pin_banks' point to all banks of given controller not to the >>> > currently accessed one. >>> >>> Understood. I think it worked in my tests because on s3c2440 all banks >>> have the same eint base address. >>> >>> So what do you think is the best approach to solve this problem? >> >> Maybe you can get to this through: >> s3c24xx_eint_domain_data = >> s3c24xx_eint_data->domains[virq].host_data; >> s3c24xx_eint_domain_data->bank >> >> It is getting slightly more complicated... > > How about the suggestions I made in my reply from March 4 (JST)? Yes, this also looks like solution. I am not sure how much you would like to revert but wouldn't it create duplicated members in pinctrl structures? One for Exynos and other for S3C? Best regards, Krzysztof
Re: [RFC 08/11] mm: make ttu's return boolean
On 03/08/2017 10:37 PM, Minchan Kim wrote: >[...] I think it's the matter of taste. if (try_to_unmap(xxx)) something else something It's perfectly understandable to me. IOW, if try_to_unmap returns true, it means it did unmap successfully. Otherwise, failed. IMHO, SWAP_SUCCESS or TTU_RESULT_* seems to be an over-engineering. If the user want it, user can do it by introducing right variable name in his context. See below. I'm OK with that approach. Just something to avoid the "what does !ret mean in this function call" is what I was looking for... [...] forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); - kill_procs(, forcekill, trapno, - ret != SWAP_SUCCESS, p, pfn, flags); + kill_procs(, forcekill, trapno, !ret , p, pfn, flags); The kill_procs() invocation was a little more readable before. Indeed but I think it's not a problem of try_to_unmap but ret variable name isn't good any more. How about this? bool unmap_success; unmap_success = try_to_unmap(hpage, ttu); .. kill_procs(, forcekill, trapno, !unmap_success , p, pfn, flags); .. return unmap_success; My point is user can introduce whatever variable name depends on his context. No need to make return variable complicated, IMHO. Yes, the local variable basically achieves what I was hoping for, so sure, works for me. [...] - case SWAP_FAIL: Again: the SWAP_FAIL makes it crystal clear which case we're in. To me, I don't feel it. To me, below is perfectly understandable. if (try_to_unmap()) do something That's why I think it's matter of taste. Okay, I admit I might be biased, too so I will consider what you suggested if others votes it. Yes, if it's really just a matter of taste, then not worth debating. Your change above is fine I think. thanks john h Thanks.
Re: [RFC 08/11] mm: make ttu's return boolean
On 03/08/2017 10:37 PM, Minchan Kim wrote: >[...] I think it's the matter of taste. if (try_to_unmap(xxx)) something else something It's perfectly understandable to me. IOW, if try_to_unmap returns true, it means it did unmap successfully. Otherwise, failed. IMHO, SWAP_SUCCESS or TTU_RESULT_* seems to be an over-engineering. If the user want it, user can do it by introducing right variable name in his context. See below. I'm OK with that approach. Just something to avoid the "what does !ret mean in this function call" is what I was looking for... [...] forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); - kill_procs(, forcekill, trapno, - ret != SWAP_SUCCESS, p, pfn, flags); + kill_procs(, forcekill, trapno, !ret , p, pfn, flags); The kill_procs() invocation was a little more readable before. Indeed but I think it's not a problem of try_to_unmap but ret variable name isn't good any more. How about this? bool unmap_success; unmap_success = try_to_unmap(hpage, ttu); .. kill_procs(, forcekill, trapno, !unmap_success , p, pfn, flags); .. return unmap_success; My point is user can introduce whatever variable name depends on his context. No need to make return variable complicated, IMHO. Yes, the local variable basically achieves what I was hoping for, so sure, works for me. [...] - case SWAP_FAIL: Again: the SWAP_FAIL makes it crystal clear which case we're in. To me, I don't feel it. To me, below is perfectly understandable. if (try_to_unmap()) do something That's why I think it's matter of taste. Okay, I admit I might be biased, too so I will consider what you suggested if others votes it. Yes, if it's really just a matter of taste, then not worth debating. Your change above is fine I think. thanks john h Thanks.
[RESEND PATCH v3 6/7] PCI: dwc: designware: Move _unroll configurations to a separate function
No functional change. Rename dw_pcie_writel_unroll/dw_pcie_readl_unroll to dw_pcie_writel_ob_unroll/dw_pcie_readl_ob_unroll respectively as these functions are used to perform only outbound configurations. Also move these _unroll configurations to a separate function. Signed-off-by: Kishon Vijay Abraham I--- drivers/pci/dwc/pcie-designware.c | 112 ++--- 1 file changed, 67 insertions(+), 45 deletions(-) diff --git a/drivers/pci/dwc/pcie-designware.c b/drivers/pci/dwc/pcie-designware.c index 557ee53..6657a84 100644 --- a/drivers/pci/dwc/pcie-designware.c +++ b/drivers/pci/dwc/pcie-designware.c @@ -92,22 +92,64 @@ void dw_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, dev_err(pci->dev, "write DBI address failed\n"); } -static u32 dw_pcie_readl_unroll(struct dw_pcie *pci, void __iomem *base, - u32 index, u32 reg) +static u32 dw_pcie_readl_ob_unroll(struct dw_pcie *pci, void __iomem *base, + u32 index, u32 reg) { u32 offset = PCIE_GET_ATU_OUTB_UNR_REG_OFFSET(index); return dw_pcie_read_dbi(pci, base, offset + reg, 0x4); } -static void dw_pcie_writel_unroll(struct dw_pcie *pci, void __iomem *base, - u32 index, u32 reg, u32 val) +static void dw_pcie_writel_ob_unroll(struct dw_pcie *pci, void __iomem *base, +u32 index, u32 reg, u32 val) { u32 offset = PCIE_GET_ATU_OUTB_UNR_REG_OFFSET(index); dw_pcie_write_dbi(pci, base, offset + reg, 0x4, val); } +void dw_pcie_prog_outbound_atu_unroll(struct dw_pcie *pci, int index, int type, + u64 cpu_addr, u64 pci_addr, u32 size) +{ + u32 retries, val; + void __iomem *base = pci->dbi_base; + + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_LOWER_BASE, +lower_32_bits(cpu_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_UPPER_BASE, +upper_32_bits(cpu_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, PCIE_ATU_UNR_LIMIT, +lower_32_bits(cpu_addr + size - 1)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_LOWER_TARGET, +lower_32_bits(pci_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_UPPER_TARGET, +upper_32_bits(pci_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_REGION_CTRL1, +type); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_REGION_CTRL2, +PCIE_ATU_ENABLE); + + /* +* Make sure ATU enable takes effect before any subsequent config +* and I/O accesses. +*/ + for (retries = 0; retries < LINK_WAIT_MAX_IATU_RETRIES; retries++) { + val = dw_pcie_readl_ob_unroll(pci, base, index, + PCIE_ATU_UNR_REGION_CTRL2); + if (val & PCIE_ATU_ENABLE) + return; + + usleep_range(LINK_WAIT_IATU_MIN, LINK_WAIT_IATU_MAX); + } + dev_err(pci->dev, "outbound iATU is not being enabled\n"); +} + void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, u64 cpu_addr, u64 pci_addr, u32 size) { @@ -118,59 +160,39 @@ void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, cpu_addr = pci->ops->cpu_addr_fixup(cpu_addr); if (pci->iatu_unroll_enabled) { - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_LOWER_BASE, - lower_32_bits(cpu_addr)); - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_UPPER_BASE, - upper_32_bits(cpu_addr)); - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_LIMIT, - lower_32_bits(cpu_addr + size - 1)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_LOWER_TARGET, - lower_32_bits(pci_addr)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_UPPER_TARGET, - upper_32_bits(pci_addr)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_REGION_CTRL1, - type); - dw_pcie_writel_unroll(pci, base, index, -
[RESEND PATCH v3 6/7] PCI: dwc: designware: Move _unroll configurations to a separate function
No functional change. Rename dw_pcie_writel_unroll/dw_pcie_readl_unroll to dw_pcie_writel_ob_unroll/dw_pcie_readl_ob_unroll respectively as these functions are used to perform only outbound configurations. Also move these _unroll configurations to a separate function. Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pcie-designware.c | 112 ++--- 1 file changed, 67 insertions(+), 45 deletions(-) diff --git a/drivers/pci/dwc/pcie-designware.c b/drivers/pci/dwc/pcie-designware.c index 557ee53..6657a84 100644 --- a/drivers/pci/dwc/pcie-designware.c +++ b/drivers/pci/dwc/pcie-designware.c @@ -92,22 +92,64 @@ void dw_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, u32 reg, dev_err(pci->dev, "write DBI address failed\n"); } -static u32 dw_pcie_readl_unroll(struct dw_pcie *pci, void __iomem *base, - u32 index, u32 reg) +static u32 dw_pcie_readl_ob_unroll(struct dw_pcie *pci, void __iomem *base, + u32 index, u32 reg) { u32 offset = PCIE_GET_ATU_OUTB_UNR_REG_OFFSET(index); return dw_pcie_read_dbi(pci, base, offset + reg, 0x4); } -static void dw_pcie_writel_unroll(struct dw_pcie *pci, void __iomem *base, - u32 index, u32 reg, u32 val) +static void dw_pcie_writel_ob_unroll(struct dw_pcie *pci, void __iomem *base, +u32 index, u32 reg, u32 val) { u32 offset = PCIE_GET_ATU_OUTB_UNR_REG_OFFSET(index); dw_pcie_write_dbi(pci, base, offset + reg, 0x4, val); } +void dw_pcie_prog_outbound_atu_unroll(struct dw_pcie *pci, int index, int type, + u64 cpu_addr, u64 pci_addr, u32 size) +{ + u32 retries, val; + void __iomem *base = pci->dbi_base; + + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_LOWER_BASE, +lower_32_bits(cpu_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_UPPER_BASE, +upper_32_bits(cpu_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, PCIE_ATU_UNR_LIMIT, +lower_32_bits(cpu_addr + size - 1)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_LOWER_TARGET, +lower_32_bits(pci_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_UPPER_TARGET, +upper_32_bits(pci_addr)); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_REGION_CTRL1, +type); + dw_pcie_writel_ob_unroll(pci, base, index, +PCIE_ATU_UNR_REGION_CTRL2, +PCIE_ATU_ENABLE); + + /* +* Make sure ATU enable takes effect before any subsequent config +* and I/O accesses. +*/ + for (retries = 0; retries < LINK_WAIT_MAX_IATU_RETRIES; retries++) { + val = dw_pcie_readl_ob_unroll(pci, base, index, + PCIE_ATU_UNR_REGION_CTRL2); + if (val & PCIE_ATU_ENABLE) + return; + + usleep_range(LINK_WAIT_IATU_MIN, LINK_WAIT_IATU_MAX); + } + dev_err(pci->dev, "outbound iATU is not being enabled\n"); +} + void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, u64 cpu_addr, u64 pci_addr, u32 size) { @@ -118,59 +160,39 @@ void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, cpu_addr = pci->ops->cpu_addr_fixup(cpu_addr); if (pci->iatu_unroll_enabled) { - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_LOWER_BASE, - lower_32_bits(cpu_addr)); - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_UPPER_BASE, - upper_32_bits(cpu_addr)); - dw_pcie_writel_unroll(pci, base, index, PCIE_ATU_UNR_LIMIT, - lower_32_bits(cpu_addr + size - 1)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_LOWER_TARGET, - lower_32_bits(pci_addr)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_UPPER_TARGET, - upper_32_bits(pci_addr)); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_REGION_CTRL1, - type); - dw_pcie_writel_unroll(pci, base, index, - PCIE_ATU_UNR_REGION_CTRL2, -
Re: [PATCH] mm: Do not use double negation for testing page flags
Hi Vlastimil, On Wed, Mar 08, 2017 at 08:51:23AM +0100, Vlastimil Babka wrote: > On 03/08/2017 06:25 AM, Minchan Kim wrote: > > Hi Anshuman, > > > > On Tue, Mar 07, 2017 at 09:31:18PM +0530, Anshuman Khandual wrote: > >> On 03/07/2017 12:06 PM, Minchan Kim wrote: > >>> With the discussion[1], I found it seems there are every PageFlags > >>> functions return bool at this moment so we don't need double > >>> negation any more. > >>> Although it's not a problem to keep it, it makes future users > >>> confused to use dobule negation for them, too. > >>> > >>> Remove such possibility. > >> > >> A quick search of '!!Page' in the source tree does not show any other > >> place having this double negation. So I guess this is all which need > >> to be fixed. > > > > Yeb. That's the why my patch includes only khugepagd part but my > > concern is PageFlags returns int type not boolean so user might > > be confused easily and tempted to use dobule negation. > > > > Other side is they who create new custom PageXXX(e.g., PageMovable) > > should keep it in mind that they should return 0 or 1 although > > fucntion prototype's return value is int type. > > > It shouldn't be > > documented nowhere. > > Was this double negation intentional? :P Nice catch! It seems you have a crystal ball. ;-) > > > Although we can add a little description > > somewhere in page-flags.h, I believe changing to boolean is more > > clear/not-error-prone so Chen's work is enough worth, I think. > > Agree, unless some arches benefit from the int by performance > for some reason (no idea if it's possible). > > Anyway, to your original patch: > > Acked-by: Vlastimil BabkaThanks!
Re: [PATCH] mm: Do not use double negation for testing page flags
Hi Vlastimil, On Wed, Mar 08, 2017 at 08:51:23AM +0100, Vlastimil Babka wrote: > On 03/08/2017 06:25 AM, Minchan Kim wrote: > > Hi Anshuman, > > > > On Tue, Mar 07, 2017 at 09:31:18PM +0530, Anshuman Khandual wrote: > >> On 03/07/2017 12:06 PM, Minchan Kim wrote: > >>> With the discussion[1], I found it seems there are every PageFlags > >>> functions return bool at this moment so we don't need double > >>> negation any more. > >>> Although it's not a problem to keep it, it makes future users > >>> confused to use dobule negation for them, too. > >>> > >>> Remove such possibility. > >> > >> A quick search of '!!Page' in the source tree does not show any other > >> place having this double negation. So I guess this is all which need > >> to be fixed. > > > > Yeb. That's the why my patch includes only khugepagd part but my > > concern is PageFlags returns int type not boolean so user might > > be confused easily and tempted to use dobule negation. > > > > Other side is they who create new custom PageXXX(e.g., PageMovable) > > should keep it in mind that they should return 0 or 1 although > > fucntion prototype's return value is int type. > > > It shouldn't be > > documented nowhere. > > Was this double negation intentional? :P Nice catch! It seems you have a crystal ball. ;-) > > > Although we can add a little description > > somewhere in page-flags.h, I believe changing to boolean is more > > clear/not-error-prone so Chen's work is enough worth, I think. > > Agree, unless some arches benefit from the int by performance > for some reason (no idea if it's possible). > > Anyway, to your original patch: > > Acked-by: Vlastimil Babka Thanks!
[RESEND PATCH v3 0/7] PCI: dwc: Miscellaneous fixes and cleanups
Resending since it bounced from quite a few lists. This should be the final set of cleanups/fixes before endpoint support can be merged. Keerthy's patch is a general fix in dra7xx driver and is not directly related to endpoint mode. This v1 of this series was previously sent with a different cover letter $subject [1] Changes from v2: *) Kconfig changes that was spilled into a patch is removed. *) In addition to renaming _unroll() to _ob_unroll(), all the _unroll configurations is also moved a separate function. Changes from v1: *) included a patch to rename _unroll() to _ob_unroll() as similar thing has to be done for inbound window in the case of EP mode. *) used 'size_t' instead of 'int' for specifying the size in read_dbi/write_dbi function arguments. *) Populate cpu_addr_fixup ops for artpec6 as suggested by Niklas This series is based on 4.11-rc1 [1] -> https://lkml.org/lkml/2017/2/16/270 Keerthy (1): PCI: dwc: dra7xx: Push request_irq call to the bottom of probe Kishon Vijay Abraham I (6): PCI: dwc: designware: Add new *ops* for cpu addr fixup PCI: dwc: dra7xx: Populate cpu_addr_fixup ops PCI: dwc: artpec6: Populate cpu_addr_fixup ops PCI: dwc: all: Modify dbi accessors to take dbi_base as argument PCI: dwc: all: Modify dbi accessors to access data of 4/2/1 bytes PCI: dwc: designware: Move _unroll configurations to a separate function drivers/pci/dwc/pci-dra7xx.c | 35 +++ drivers/pci/dwc/pci-exynos.c | 14 +-- drivers/pci/dwc/pci-imx6.c | 62 +++-- drivers/pci/dwc/pci-keystone-dw.c | 16 ++-- drivers/pci/dwc/pcie-armada8k.c| 39 drivers/pci/dwc/pcie-artpec6.c | 22 +++-- drivers/pci/dwc/pcie-designware-host.c | 20 ++-- drivers/pci/dwc/pcie-designware.c | 156 +--- drivers/pci/dwc/pcie-designware.h | 13 ++- drivers/pci/dwc/pcie-hisi.c| 17 ++-- 10 files changed, 241 insertions(+), 153 deletions(-) -- 1.7.9.5
[RESEND PATCH v3 0/7] PCI: dwc: Miscellaneous fixes and cleanups
Resending since it bounced from quite a few lists. This should be the final set of cleanups/fixes before endpoint support can be merged. Keerthy's patch is a general fix in dra7xx driver and is not directly related to endpoint mode. This v1 of this series was previously sent with a different cover letter $subject [1] Changes from v2: *) Kconfig changes that was spilled into a patch is removed. *) In addition to renaming _unroll() to _ob_unroll(), all the _unroll configurations is also moved a separate function. Changes from v1: *) included a patch to rename _unroll() to _ob_unroll() as similar thing has to be done for inbound window in the case of EP mode. *) used 'size_t' instead of 'int' for specifying the size in read_dbi/write_dbi function arguments. *) Populate cpu_addr_fixup ops for artpec6 as suggested by Niklas This series is based on 4.11-rc1 [1] -> https://lkml.org/lkml/2017/2/16/270 Keerthy (1): PCI: dwc: dra7xx: Push request_irq call to the bottom of probe Kishon Vijay Abraham I (6): PCI: dwc: designware: Add new *ops* for cpu addr fixup PCI: dwc: dra7xx: Populate cpu_addr_fixup ops PCI: dwc: artpec6: Populate cpu_addr_fixup ops PCI: dwc: all: Modify dbi accessors to take dbi_base as argument PCI: dwc: all: Modify dbi accessors to access data of 4/2/1 bytes PCI: dwc: designware: Move _unroll configurations to a separate function drivers/pci/dwc/pci-dra7xx.c | 35 +++ drivers/pci/dwc/pci-exynos.c | 14 +-- drivers/pci/dwc/pci-imx6.c | 62 +++-- drivers/pci/dwc/pci-keystone-dw.c | 16 ++-- drivers/pci/dwc/pcie-armada8k.c| 39 drivers/pci/dwc/pcie-artpec6.c | 22 +++-- drivers/pci/dwc/pcie-designware-host.c | 20 ++-- drivers/pci/dwc/pcie-designware.c | 156 +--- drivers/pci/dwc/pcie-designware.h | 13 ++- drivers/pci/dwc/pcie-hisi.c| 17 ++-- 10 files changed, 241 insertions(+), 153 deletions(-) -- 1.7.9.5
[PATCH v2] serial: 8250_dw: Honor clk_round_rate errors in dw8250_set_termios
clk_round_rate returns a signed long and may possibly return errors in it, for example if there is no possible rate. Till now dw8250_set_termios ignored any error, the signednes and would just use the value as input to clk_set_rate. This of course falls apart if there is an actual error, so check for errors and only try to set a rate if the value is actually valid. This turned up on some Rockchip platforms after commit 6a171b299379 ("serial: 8250_dw: Allow hardware flow control to be used") enabled set_termios callback in all cases, not only ACPI. Fixes: 6a171b299379 ("serial: 8250_dw: Allow hardware flow control to be used") Signed-off-by: Heiko StuebnerReviewed-by: Andy Shevchenko --- There is also another patch floating around, fixing a separate issue on top of this one: "serial: 8250_dw: Fix breakage when HAVE_CLK=n" changes in v2: - adapt commit message to make it more explicit, that this is a somewhat critical fix - add Andy's Reviewed-by drivers/tty/serial/8250/8250_dw.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/tty/serial/8250/8250_dw.c b/drivers/tty/serial/8250/8250_dw.c index 6ee55a2d47bb..223ac234ddb2 100644 --- a/drivers/tty/serial/8250/8250_dw.c +++ b/drivers/tty/serial/8250/8250_dw.c @@ -257,7 +257,7 @@ static void dw8250_set_termios(struct uart_port *p, struct ktermios *termios, { unsigned int baud = tty_termios_baud_rate(termios); struct dw8250_data *d = p->private_data; - unsigned int rate; + long rate; int ret; if (IS_ERR(d->clk) || !old) @@ -265,7 +265,10 @@ static void dw8250_set_termios(struct uart_port *p, struct ktermios *termios, clk_disable_unprepare(d->clk); rate = clk_round_rate(d->clk, baud * 16); - ret = clk_set_rate(d->clk, rate); + if (rate < 0) + ret = rate; + else + ret = clk_set_rate(d->clk, rate); clk_prepare_enable(d->clk); if (!ret) -- 2.11.0
[PATCH v2] serial: 8250_dw: Honor clk_round_rate errors in dw8250_set_termios
clk_round_rate returns a signed long and may possibly return errors in it, for example if there is no possible rate. Till now dw8250_set_termios ignored any error, the signednes and would just use the value as input to clk_set_rate. This of course falls apart if there is an actual error, so check for errors and only try to set a rate if the value is actually valid. This turned up on some Rockchip platforms after commit 6a171b299379 ("serial: 8250_dw: Allow hardware flow control to be used") enabled set_termios callback in all cases, not only ACPI. Fixes: 6a171b299379 ("serial: 8250_dw: Allow hardware flow control to be used") Signed-off-by: Heiko Stuebner Reviewed-by: Andy Shevchenko --- There is also another patch floating around, fixing a separate issue on top of this one: "serial: 8250_dw: Fix breakage when HAVE_CLK=n" changes in v2: - adapt commit message to make it more explicit, that this is a somewhat critical fix - add Andy's Reviewed-by drivers/tty/serial/8250/8250_dw.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/tty/serial/8250/8250_dw.c b/drivers/tty/serial/8250/8250_dw.c index 6ee55a2d47bb..223ac234ddb2 100644 --- a/drivers/tty/serial/8250/8250_dw.c +++ b/drivers/tty/serial/8250/8250_dw.c @@ -257,7 +257,7 @@ static void dw8250_set_termios(struct uart_port *p, struct ktermios *termios, { unsigned int baud = tty_termios_baud_rate(termios); struct dw8250_data *d = p->private_data; - unsigned int rate; + long rate; int ret; if (IS_ERR(d->clk) || !old) @@ -265,7 +265,10 @@ static void dw8250_set_termios(struct uart_port *p, struct ktermios *termios, clk_disable_unprepare(d->clk); rate = clk_round_rate(d->clk, baud * 16); - ret = clk_set_rate(d->clk, rate); + if (rate < 0) + ret = rate; + else + ret = clk_set_rate(d->clk, rate); clk_prepare_enable(d->clk); if (!ret) -- 2.11.0
[RESEND PATCH v3 2/7] PCI: dwc: dra7xx: Populate cpu_addr_fixup ops
Populate cpu_addr_fixup ops to extract the least 28 bits of the corresponding cpu address. Acked-by: Joao PintoSigned-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 0984baf..07c45ec 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -88,6 +88,11 @@ static inline void dra7xx_pcie_writel(struct dra7xx_pcie *pcie, u32 offset, writel(value, pcie->base + offset); } +static u64 dra7xx_pcie_cpu_addr_fixup(u64 pci_addr) +{ + return pci_addr & DRA7XX_CPU_TO_BUS_ADDR; +} + static int dra7xx_pcie_link_up(struct dw_pcie *pci) { struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci); @@ -152,11 +157,6 @@ static void dra7xx_pcie_host_init(struct pcie_port *pp) struct dw_pcie *pci = to_dw_pcie_from_pp(pp); struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci); - pp->io_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->mem_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->cfg0_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->cfg1_base &= DRA7XX_CPU_TO_BUS_ADDR; - dw_pcie_setup_rc(pp); dra7xx_pcie_establish_link(dra7xx); @@ -329,6 +329,7 @@ static int __init dra7xx_add_pcie_port(struct dra7xx_pcie *dra7xx, } static const struct dw_pcie_ops dw_pcie_ops = { + .cpu_addr_fixup = dra7xx_pcie_cpu_addr_fixup, .link_up = dra7xx_pcie_link_up, }; -- 1.7.9.5
[RESEND PATCH v3 2/7] PCI: dwc: dra7xx: Populate cpu_addr_fixup ops
Populate cpu_addr_fixup ops to extract the least 28 bits of the corresponding cpu address. Acked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 0984baf..07c45ec 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -88,6 +88,11 @@ static inline void dra7xx_pcie_writel(struct dra7xx_pcie *pcie, u32 offset, writel(value, pcie->base + offset); } +static u64 dra7xx_pcie_cpu_addr_fixup(u64 pci_addr) +{ + return pci_addr & DRA7XX_CPU_TO_BUS_ADDR; +} + static int dra7xx_pcie_link_up(struct dw_pcie *pci) { struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci); @@ -152,11 +157,6 @@ static void dra7xx_pcie_host_init(struct pcie_port *pp) struct dw_pcie *pci = to_dw_pcie_from_pp(pp); struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci); - pp->io_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->mem_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->cfg0_base &= DRA7XX_CPU_TO_BUS_ADDR; - pp->cfg1_base &= DRA7XX_CPU_TO_BUS_ADDR; - dw_pcie_setup_rc(pp); dra7xx_pcie_establish_link(dra7xx); @@ -329,6 +329,7 @@ static int __init dra7xx_add_pcie_port(struct dra7xx_pcie *dra7xx, } static const struct dw_pcie_ops dw_pcie_ops = { + .cpu_addr_fixup = dra7xx_pcie_cpu_addr_fixup, .link_up = dra7xx_pcie_link_up, }; -- 1.7.9.5
[RESEND PATCH v3 4/7] PCI: dwc: all: Modify dbi accessors to take dbi_base as argument
dwc has 2 dbi address space labeled dbics and dbics2. The existing helper to access dbi address space can access only dbics. However dbics2 has to be accessed for programming the BAR registers in the case of EP mode. This is in preparation for adding EP mode support to dwc driver. Cc: Jingoo HanCc: Richard Zhu Cc: Lucas Stach Cc: Murali Karicheri Cc: Thomas Petazzoni Cc: Niklas Cassel Cc: Jesper Nilsson Cc: Joao Pinto Cc: Zhou Wang Cc: Gabriele Paoloni Acked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 10 +++-- drivers/pci/dwc/pci-exynos.c | 10 +++-- drivers/pci/dwc/pci-imx6.c | 62 +++--- drivers/pci/dwc/pci-keystone-dw.c | 15 --- drivers/pci/dwc/pcie-armada8k.c| 39 +--- drivers/pci/dwc/pcie-artpec6.c |7 +-- drivers/pci/dwc/pcie-designware-host.c | 20 + drivers/pci/dwc/pcie-designware.c | 76 ++-- drivers/pci/dwc/pcie-designware.h | 10 +++-- drivers/pci/dwc/pcie-hisi.c| 17 --- 10 files changed, 152 insertions(+), 114 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 07c45ec..3708bd6 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -495,12 +495,13 @@ static int dra7xx_pcie_suspend(struct device *dev) { struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev); struct dw_pcie *pci = dra7xx->pci; + void __iomem *base = pci->dbi_base; u32 val; /* clear MSE */ - val = dw_pcie_readl_dbi(pci, PCI_COMMAND); + val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); val &= ~PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, PCI_COMMAND, val); + dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); return 0; } @@ -509,12 +510,13 @@ static int dra7xx_pcie_resume(struct device *dev) { struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev); struct dw_pcie *pci = dra7xx->pci; + void __iomem *base = pci->dbi_base; u32 val; /* set MSE */ - val = dw_pcie_readl_dbi(pci, PCI_COMMAND); + val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); val |= PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, PCI_COMMAND, val); + dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); return 0; } diff --git a/drivers/pci/dwc/pci-exynos.c b/drivers/pci/dwc/pci-exynos.c index 993b650..a0d40f7 100644 --- a/drivers/pci/dwc/pci-exynos.c +++ b/drivers/pci/dwc/pci-exynos.c @@ -521,23 +521,25 @@ static void exynos_pcie_enable_interrupts(struct exynos_pcie *ep) exynos_pcie_msi_init(ep); } -static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, u32 reg) +static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, void __iomem *base, +u32 reg) { struct exynos_pcie *ep = to_exynos_pcie(pci); u32 val; exynos_pcie_sideband_dbi_r_mode(ep, true); - val = readl(pci->dbi_base + reg); + val = readl(base + reg); exynos_pcie_sideband_dbi_r_mode(ep, false); return val; } -static void exynos_pcie_writel_dbi(struct dw_pcie *pci, u32 reg, u32 val) +static void exynos_pcie_writel_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, u32 val) { struct exynos_pcie *ep = to_exynos_pcie(pci); exynos_pcie_sideband_dbi_w_mode(ep, true); - writel(val, pci->dbi_base + reg); + writel(val, base + reg); exynos_pcie_sideband_dbi_w_mode(ep, false); } diff --git a/drivers/pci/dwc/pci-imx6.c b/drivers/pci/dwc/pci-imx6.c index 801e46c..85dd901 100644 --- a/drivers/pci/dwc/pci-imx6.c +++ b/drivers/pci/dwc/pci-imx6.c @@ -98,12 +98,13 @@ struct imx6_pcie { static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) { struct dw_pcie *pci = imx6_pcie->pci; + void __iomem *base = pci->dbi_base; u32 val; u32 max_iterations = 10; u32 wait_counter = 0; do { - val = dw_pcie_readl_dbi(pci, PCIE_PHY_STAT); + val = dw_pcie_readl_dbi(pci, base, PCIE_PHY_STAT); val = (val >> PCIE_PHY_STAT_ACK_LOC) & 0x1; wait_counter++; @@ -119,21 +120,22 @@ static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) static int pcie_phy_wait_ack(struct imx6_pcie *imx6_pcie, int addr) { struct dw_pcie *pci = imx6_pcie->pci; + void __iomem *base = pci->dbi_base; u32 val; int ret; val = addr << PCIE_PHY_CTRL_DATA_LOC; - dw_pcie_writel_dbi(pci,
[RESEND PATCH v3 4/7] PCI: dwc: all: Modify dbi accessors to take dbi_base as argument
dwc has 2 dbi address space labeled dbics and dbics2. The existing helper to access dbi address space can access only dbics. However dbics2 has to be accessed for programming the BAR registers in the case of EP mode. This is in preparation for adding EP mode support to dwc driver. Cc: Jingoo Han Cc: Richard Zhu Cc: Lucas Stach Cc: Murali Karicheri Cc: Thomas Petazzoni Cc: Niklas Cassel Cc: Jesper Nilsson Cc: Joao Pinto Cc: Zhou Wang Cc: Gabriele Paoloni Acked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 10 +++-- drivers/pci/dwc/pci-exynos.c | 10 +++-- drivers/pci/dwc/pci-imx6.c | 62 +++--- drivers/pci/dwc/pci-keystone-dw.c | 15 --- drivers/pci/dwc/pcie-armada8k.c| 39 +--- drivers/pci/dwc/pcie-artpec6.c |7 +-- drivers/pci/dwc/pcie-designware-host.c | 20 + drivers/pci/dwc/pcie-designware.c | 76 ++-- drivers/pci/dwc/pcie-designware.h | 10 +++-- drivers/pci/dwc/pcie-hisi.c| 17 --- 10 files changed, 152 insertions(+), 114 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 07c45ec..3708bd6 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -495,12 +495,13 @@ static int dra7xx_pcie_suspend(struct device *dev) { struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev); struct dw_pcie *pci = dra7xx->pci; + void __iomem *base = pci->dbi_base; u32 val; /* clear MSE */ - val = dw_pcie_readl_dbi(pci, PCI_COMMAND); + val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); val &= ~PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, PCI_COMMAND, val); + dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); return 0; } @@ -509,12 +510,13 @@ static int dra7xx_pcie_resume(struct device *dev) { struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev); struct dw_pcie *pci = dra7xx->pci; + void __iomem *base = pci->dbi_base; u32 val; /* set MSE */ - val = dw_pcie_readl_dbi(pci, PCI_COMMAND); + val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); val |= PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, PCI_COMMAND, val); + dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); return 0; } diff --git a/drivers/pci/dwc/pci-exynos.c b/drivers/pci/dwc/pci-exynos.c index 993b650..a0d40f7 100644 --- a/drivers/pci/dwc/pci-exynos.c +++ b/drivers/pci/dwc/pci-exynos.c @@ -521,23 +521,25 @@ static void exynos_pcie_enable_interrupts(struct exynos_pcie *ep) exynos_pcie_msi_init(ep); } -static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, u32 reg) +static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, void __iomem *base, +u32 reg) { struct exynos_pcie *ep = to_exynos_pcie(pci); u32 val; exynos_pcie_sideband_dbi_r_mode(ep, true); - val = readl(pci->dbi_base + reg); + val = readl(base + reg); exynos_pcie_sideband_dbi_r_mode(ep, false); return val; } -static void exynos_pcie_writel_dbi(struct dw_pcie *pci, u32 reg, u32 val) +static void exynos_pcie_writel_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, u32 val) { struct exynos_pcie *ep = to_exynos_pcie(pci); exynos_pcie_sideband_dbi_w_mode(ep, true); - writel(val, pci->dbi_base + reg); + writel(val, base + reg); exynos_pcie_sideband_dbi_w_mode(ep, false); } diff --git a/drivers/pci/dwc/pci-imx6.c b/drivers/pci/dwc/pci-imx6.c index 801e46c..85dd901 100644 --- a/drivers/pci/dwc/pci-imx6.c +++ b/drivers/pci/dwc/pci-imx6.c @@ -98,12 +98,13 @@ struct imx6_pcie { static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) { struct dw_pcie *pci = imx6_pcie->pci; + void __iomem *base = pci->dbi_base; u32 val; u32 max_iterations = 10; u32 wait_counter = 0; do { - val = dw_pcie_readl_dbi(pci, PCIE_PHY_STAT); + val = dw_pcie_readl_dbi(pci, base, PCIE_PHY_STAT); val = (val >> PCIE_PHY_STAT_ACK_LOC) & 0x1; wait_counter++; @@ -119,21 +120,22 @@ static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) static int pcie_phy_wait_ack(struct imx6_pcie *imx6_pcie, int addr) { struct dw_pcie *pci = imx6_pcie->pci; + void __iomem *base = pci->dbi_base; u32 val; int ret; val = addr << PCIE_PHY_CTRL_DATA_LOC; - dw_pcie_writel_dbi(pci, PCIE_PHY_CTRL, val); + dw_pcie_writel_dbi(pci, base, PCIE_PHY_CTRL, val); val |= (0x1 << PCIE_PHY_CTRL_CAP_ADR_LOC); - dw_pcie_writel_dbi(pci, PCIE_PHY_CTRL, val); + dw_pcie_writel_dbi(pci, base, PCIE_PHY_CTRL, val); ret = pcie_phy_poll_ack(imx6_pcie,
[RESEND PATCH v3 1/7] PCI: dwc: designware: Add new *ops* for cpu addr fixup
Some platforms (like dra7xx) require only the least 28 bits of the corresponding 32 bit CPU address to be programmed in the address translation unit. This modified address is stored in io_base/mem_base/ cfg0_base/cfg1_base in dra7xx_pcie_host_init. While this is okay for host mode where the address range is fixed, device mode requires different addresses to be programmed based on the host buffer address. Add a new ops to get the least 28 bits of the corresponding 32 bit CPU address and invoke it before programming the address translation unit. Acked-by: Joao PintoSigned-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pcie-designware.c |3 +++ drivers/pci/dwc/pcie-designware.h |1 + 2 files changed, 4 insertions(+) diff --git a/drivers/pci/dwc/pcie-designware.c b/drivers/pci/dwc/pcie-designware.c index 7e1fb7d..14ee7a3 100644 --- a/drivers/pci/dwc/pcie-designware.c +++ b/drivers/pci/dwc/pcie-designware.c @@ -97,6 +97,9 @@ void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, { u32 retries, val; + if (pp->ops->cpu_addr_fixup) + cpu_addr = pp->ops->cpu_addr_fixup(cpu_addr); + if (pci->iatu_unroll_enabled) { dw_pcie_writel_unroll(pci, index, PCIE_ATU_UNR_LOWER_BASE, lower_32_bits(cpu_addr)); diff --git a/drivers/pci/dwc/pcie-designware.h b/drivers/pci/dwc/pcie-designware.h index cd3b871..8f3dcb2 100644 --- a/drivers/pci/dwc/pcie-designware.h +++ b/drivers/pci/dwc/pcie-designware.h @@ -143,6 +143,7 @@ struct pcie_port { }; struct dw_pcie_ops { + u64 (*cpu_addr_fixup)(u64 cpu_addr); u32 (*readl_dbi)(struct dw_pcie *pcie, u32 reg); void(*writel_dbi)(struct dw_pcie *pcie, u32 reg, u32 val); int (*link_up)(struct dw_pcie *pcie); -- 1.7.9.5
[RESEND PATCH v3 3/7] PCI: dwc: artpec6: Populate cpu_addr_fixup ops
Populate cpu_addr_fixup ops to extract the least 28 bits of the corresponding cpu address. Cc: Niklas CasselAcked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pcie-artpec6.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/pci/dwc/pcie-artpec6.c b/drivers/pci/dwc/pcie-artpec6.c index fcd3ef8..5b3b3af 100644 --- a/drivers/pci/dwc/pcie-artpec6.c +++ b/drivers/pci/dwc/pcie-artpec6.c @@ -78,6 +78,11 @@ static void artpec6_pcie_writel(struct artpec6_pcie *artpec6_pcie, u32 offset, u regmap_write(artpec6_pcie->regmap, offset, val); } +static u64 artpec6_pcie_cpu_addr_fixup(u64 pci_addr) +{ + return pci_addr & ARTPEC6_CPU_TO_BUS_ADDR; +} + static int artpec6_pcie_establish_link(struct artpec6_pcie *artpec6_pcie) { struct dw_pcie *pci = artpec6_pcie->pci; @@ -142,11 +147,6 @@ static int artpec6_pcie_establish_link(struct artpec6_pcie *artpec6_pcie) */ dw_pcie_writel_dbi(pci, MISC_CONTROL_1_OFF, DBI_RO_WR_EN); - pp->io_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->mem_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->cfg0_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->cfg1_base &= ARTPEC6_CPU_TO_BUS_ADDR; - /* setup root complex */ dw_pcie_setup_rc(pp); @@ -234,6 +234,10 @@ static int artpec6_add_pcie_port(struct artpec6_pcie *artpec6_pcie, return 0; } +static const struct dw_pcie_ops dw_pcie_ops = { + .cpu_addr_fixup = artpec6_pcie_cpu_addr_fixup, +}; + static int artpec6_pcie_probe(struct platform_device *pdev) { struct device *dev = >dev; @@ -252,6 +256,7 @@ static int artpec6_pcie_probe(struct platform_device *pdev) return -ENOMEM; pci->dev = dev; + pci->ops = _pcie_ops; artpec6_pcie->pci = pci; -- 1.7.9.5
[RESEND PATCH v3 1/7] PCI: dwc: designware: Add new *ops* for cpu addr fixup
Some platforms (like dra7xx) require only the least 28 bits of the corresponding 32 bit CPU address to be programmed in the address translation unit. This modified address is stored in io_base/mem_base/ cfg0_base/cfg1_base in dra7xx_pcie_host_init. While this is okay for host mode where the address range is fixed, device mode requires different addresses to be programmed based on the host buffer address. Add a new ops to get the least 28 bits of the corresponding 32 bit CPU address and invoke it before programming the address translation unit. Acked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pcie-designware.c |3 +++ drivers/pci/dwc/pcie-designware.h |1 + 2 files changed, 4 insertions(+) diff --git a/drivers/pci/dwc/pcie-designware.c b/drivers/pci/dwc/pcie-designware.c index 7e1fb7d..14ee7a3 100644 --- a/drivers/pci/dwc/pcie-designware.c +++ b/drivers/pci/dwc/pcie-designware.c @@ -97,6 +97,9 @@ void dw_pcie_prog_outbound_atu(struct dw_pcie *pci, int index, int type, { u32 retries, val; + if (pp->ops->cpu_addr_fixup) + cpu_addr = pp->ops->cpu_addr_fixup(cpu_addr); + if (pci->iatu_unroll_enabled) { dw_pcie_writel_unroll(pci, index, PCIE_ATU_UNR_LOWER_BASE, lower_32_bits(cpu_addr)); diff --git a/drivers/pci/dwc/pcie-designware.h b/drivers/pci/dwc/pcie-designware.h index cd3b871..8f3dcb2 100644 --- a/drivers/pci/dwc/pcie-designware.h +++ b/drivers/pci/dwc/pcie-designware.h @@ -143,6 +143,7 @@ struct pcie_port { }; struct dw_pcie_ops { + u64 (*cpu_addr_fixup)(u64 cpu_addr); u32 (*readl_dbi)(struct dw_pcie *pcie, u32 reg); void(*writel_dbi)(struct dw_pcie *pcie, u32 reg, u32 val); int (*link_up)(struct dw_pcie *pcie); -- 1.7.9.5
[RESEND PATCH v3 3/7] PCI: dwc: artpec6: Populate cpu_addr_fixup ops
Populate cpu_addr_fixup ops to extract the least 28 bits of the corresponding cpu address. Cc: Niklas Cassel Acked-by: Joao Pinto Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pcie-artpec6.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/pci/dwc/pcie-artpec6.c b/drivers/pci/dwc/pcie-artpec6.c index fcd3ef8..5b3b3af 100644 --- a/drivers/pci/dwc/pcie-artpec6.c +++ b/drivers/pci/dwc/pcie-artpec6.c @@ -78,6 +78,11 @@ static void artpec6_pcie_writel(struct artpec6_pcie *artpec6_pcie, u32 offset, u regmap_write(artpec6_pcie->regmap, offset, val); } +static u64 artpec6_pcie_cpu_addr_fixup(u64 pci_addr) +{ + return pci_addr & ARTPEC6_CPU_TO_BUS_ADDR; +} + static int artpec6_pcie_establish_link(struct artpec6_pcie *artpec6_pcie) { struct dw_pcie *pci = artpec6_pcie->pci; @@ -142,11 +147,6 @@ static int artpec6_pcie_establish_link(struct artpec6_pcie *artpec6_pcie) */ dw_pcie_writel_dbi(pci, MISC_CONTROL_1_OFF, DBI_RO_WR_EN); - pp->io_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->mem_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->cfg0_base &= ARTPEC6_CPU_TO_BUS_ADDR; - pp->cfg1_base &= ARTPEC6_CPU_TO_BUS_ADDR; - /* setup root complex */ dw_pcie_setup_rc(pp); @@ -234,6 +234,10 @@ static int artpec6_add_pcie_port(struct artpec6_pcie *artpec6_pcie, return 0; } +static const struct dw_pcie_ops dw_pcie_ops = { + .cpu_addr_fixup = artpec6_pcie_cpu_addr_fixup, +}; + static int artpec6_pcie_probe(struct platform_device *pdev) { struct device *dev = >dev; @@ -252,6 +256,7 @@ static int artpec6_pcie_probe(struct platform_device *pdev) return -ENOMEM; pci->dev = dev; + pci->ops = _pcie_ops; artpec6_pcie->pci = pci; -- 1.7.9.5
[RESEND PATCH v3 5/7] PCI: dwc: all: Modify dbi accessors to access data of 4/2/1 bytes
Previously dbi accessors can be used to access data of size 4 bytes. But there might be situations (like accessing MSI_MESSAGE_CONTROL in order to set/get the number of required MSI interrupts in EP mode) where dbi accessors must be used to access data of size 2. This is in preparation for adding endpoint mode support to designware driver. Cc: Jingoo HanCc: Richard Zhu Cc: Lucas Stach Cc: Murali Karicheri Cc: Thomas Petazzoni Cc: Niklas Cassel Cc: Jesper Nilsson Cc: Joao Pinto Cc: Zhou Wang Cc: Gabriele Paoloni Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c |8 ++-- drivers/pci/dwc/pci-exynos.c | 16 +++ drivers/pci/dwc/pci-imx6.c | 54 +++--- drivers/pci/dwc/pci-keystone-dw.c | 13 +++--- drivers/pci/dwc/pcie-armada8k.c| 38 drivers/pci/dwc/pcie-artpec6.c |6 +-- drivers/pci/dwc/pcie-designware-host.c | 18 drivers/pci/dwc/pcie-designware.c | 77 +++- drivers/pci/dwc/pcie-designware.h | 14 +++--- drivers/pci/dwc/pcie-hisi.c| 14 +++--- 10 files changed, 138 insertions(+), 120 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 3708bd6..c6fef0a 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -499,9 +499,9 @@ static int dra7xx_pcie_suspend(struct device *dev) u32 val; /* clear MSE */ - val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); + val = dw_pcie_read_dbi(pci, base, PCI_COMMAND, 0x4); val &= ~PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); + dw_pcie_write_dbi(pci, base, PCI_COMMAND, 0x4, val); return 0; } @@ -514,9 +514,9 @@ static int dra7xx_pcie_resume(struct device *dev) u32 val; /* set MSE */ - val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); + val = dw_pcie_read_dbi(pci, base, PCI_COMMAND, 0x4); val |= PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); + dw_pcie_write_dbi(pci, base, PCI_COMMAND, 0x4, val); return 0; } diff --git a/drivers/pci/dwc/pci-exynos.c b/drivers/pci/dwc/pci-exynos.c index a0d40f7..37d6d2b 100644 --- a/drivers/pci/dwc/pci-exynos.c +++ b/drivers/pci/dwc/pci-exynos.c @@ -521,25 +521,25 @@ static void exynos_pcie_enable_interrupts(struct exynos_pcie *ep) exynos_pcie_msi_init(ep); } -static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, void __iomem *base, -u32 reg) +static u32 exynos_pcie_read_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, size_t size) { struct exynos_pcie *ep = to_exynos_pcie(pci); u32 val; exynos_pcie_sideband_dbi_r_mode(ep, true); - val = readl(base + reg); + dw_pcie_read(base + reg, size, ); exynos_pcie_sideband_dbi_r_mode(ep, false); return val; } -static void exynos_pcie_writel_dbi(struct dw_pcie *pci, void __iomem *base, - u32 reg, u32 val) +static void exynos_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, size_t size, u32 val) { struct exynos_pcie *ep = to_exynos_pcie(pci); exynos_pcie_sideband_dbi_w_mode(ep, true); - writel(val, base + reg); + dw_pcie_write(base + reg, size, val); exynos_pcie_sideband_dbi_w_mode(ep, false); } @@ -646,8 +646,8 @@ static int __init exynos_add_pcie_port(struct exynos_pcie *ep, } static const struct dw_pcie_ops dw_pcie_ops = { - .readl_dbi = exynos_pcie_readl_dbi, - .writel_dbi = exynos_pcie_writel_dbi, + .read_dbi = exynos_pcie_read_dbi, + .write_dbi = exynos_pcie_write_dbi, .link_up = exynos_pcie_link_up, }; diff --git a/drivers/pci/dwc/pci-imx6.c b/drivers/pci/dwc/pci-imx6.c index 85dd901..e58ca7a 100644 --- a/drivers/pci/dwc/pci-imx6.c +++ b/drivers/pci/dwc/pci-imx6.c @@ -104,7 +104,7 @@ static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) u32 wait_counter = 0; do { - val = dw_pcie_readl_dbi(pci, base, PCIE_PHY_STAT); + val = dw_pcie_read_dbi(pci, base, PCIE_PHY_STAT, 0x4); val = (val >> PCIE_PHY_STAT_ACK_LOC) & 0x1; wait_counter++; @@ -125,17 +125,17 @@ static int pcie_phy_wait_ack(struct imx6_pcie *imx6_pcie, int addr) int ret; val = addr << PCIE_PHY_CTRL_DATA_LOC; - dw_pcie_writel_dbi(pci, base, PCIE_PHY_CTRL, val); + dw_pcie_write_dbi(pci, base, PCIE_PHY_CTRL, 0x4, val);
[RESEND PATCH v3 5/7] PCI: dwc: all: Modify dbi accessors to access data of 4/2/1 bytes
Previously dbi accessors can be used to access data of size 4 bytes. But there might be situations (like accessing MSI_MESSAGE_CONTROL in order to set/get the number of required MSI interrupts in EP mode) where dbi accessors must be used to access data of size 2. This is in preparation for adding endpoint mode support to designware driver. Cc: Jingoo Han Cc: Richard Zhu Cc: Lucas Stach Cc: Murali Karicheri Cc: Thomas Petazzoni Cc: Niklas Cassel Cc: Jesper Nilsson Cc: Joao Pinto Cc: Zhou Wang Cc: Gabriele Paoloni Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c |8 ++-- drivers/pci/dwc/pci-exynos.c | 16 +++ drivers/pci/dwc/pci-imx6.c | 54 +++--- drivers/pci/dwc/pci-keystone-dw.c | 13 +++--- drivers/pci/dwc/pcie-armada8k.c| 38 drivers/pci/dwc/pcie-artpec6.c |6 +-- drivers/pci/dwc/pcie-designware-host.c | 18 drivers/pci/dwc/pcie-designware.c | 77 +++- drivers/pci/dwc/pcie-designware.h | 14 +++--- drivers/pci/dwc/pcie-hisi.c| 14 +++--- 10 files changed, 138 insertions(+), 120 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index 3708bd6..c6fef0a 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -499,9 +499,9 @@ static int dra7xx_pcie_suspend(struct device *dev) u32 val; /* clear MSE */ - val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); + val = dw_pcie_read_dbi(pci, base, PCI_COMMAND, 0x4); val &= ~PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); + dw_pcie_write_dbi(pci, base, PCI_COMMAND, 0x4, val); return 0; } @@ -514,9 +514,9 @@ static int dra7xx_pcie_resume(struct device *dev) u32 val; /* set MSE */ - val = dw_pcie_readl_dbi(pci, base, PCI_COMMAND); + val = dw_pcie_read_dbi(pci, base, PCI_COMMAND, 0x4); val |= PCI_COMMAND_MEMORY; - dw_pcie_writel_dbi(pci, base, PCI_COMMAND, val); + dw_pcie_write_dbi(pci, base, PCI_COMMAND, 0x4, val); return 0; } diff --git a/drivers/pci/dwc/pci-exynos.c b/drivers/pci/dwc/pci-exynos.c index a0d40f7..37d6d2b 100644 --- a/drivers/pci/dwc/pci-exynos.c +++ b/drivers/pci/dwc/pci-exynos.c @@ -521,25 +521,25 @@ static void exynos_pcie_enable_interrupts(struct exynos_pcie *ep) exynos_pcie_msi_init(ep); } -static u32 exynos_pcie_readl_dbi(struct dw_pcie *pci, void __iomem *base, -u32 reg) +static u32 exynos_pcie_read_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, size_t size) { struct exynos_pcie *ep = to_exynos_pcie(pci); u32 val; exynos_pcie_sideband_dbi_r_mode(ep, true); - val = readl(base + reg); + dw_pcie_read(base + reg, size, ); exynos_pcie_sideband_dbi_r_mode(ep, false); return val; } -static void exynos_pcie_writel_dbi(struct dw_pcie *pci, void __iomem *base, - u32 reg, u32 val) +static void exynos_pcie_write_dbi(struct dw_pcie *pci, void __iomem *base, + u32 reg, size_t size, u32 val) { struct exynos_pcie *ep = to_exynos_pcie(pci); exynos_pcie_sideband_dbi_w_mode(ep, true); - writel(val, base + reg); + dw_pcie_write(base + reg, size, val); exynos_pcie_sideband_dbi_w_mode(ep, false); } @@ -646,8 +646,8 @@ static int __init exynos_add_pcie_port(struct exynos_pcie *ep, } static const struct dw_pcie_ops dw_pcie_ops = { - .readl_dbi = exynos_pcie_readl_dbi, - .writel_dbi = exynos_pcie_writel_dbi, + .read_dbi = exynos_pcie_read_dbi, + .write_dbi = exynos_pcie_write_dbi, .link_up = exynos_pcie_link_up, }; diff --git a/drivers/pci/dwc/pci-imx6.c b/drivers/pci/dwc/pci-imx6.c index 85dd901..e58ca7a 100644 --- a/drivers/pci/dwc/pci-imx6.c +++ b/drivers/pci/dwc/pci-imx6.c @@ -104,7 +104,7 @@ static int pcie_phy_poll_ack(struct imx6_pcie *imx6_pcie, int exp_val) u32 wait_counter = 0; do { - val = dw_pcie_readl_dbi(pci, base, PCIE_PHY_STAT); + val = dw_pcie_read_dbi(pci, base, PCIE_PHY_STAT, 0x4); val = (val >> PCIE_PHY_STAT_ACK_LOC) & 0x1; wait_counter++; @@ -125,17 +125,17 @@ static int pcie_phy_wait_ack(struct imx6_pcie *imx6_pcie, int addr) int ret; val = addr << PCIE_PHY_CTRL_DATA_LOC; - dw_pcie_writel_dbi(pci, base, PCIE_PHY_CTRL, val); + dw_pcie_write_dbi(pci, base, PCIE_PHY_CTRL, 0x4, val); val |= (0x1 << PCIE_PHY_CTRL_CAP_ADR_LOC); - dw_pcie_writel_dbi(pci, base, PCIE_PHY_CTRL, val); + dw_pcie_write_dbi(pci, base, PCIE_PHY_CTRL, 0x4, val); ret = pcie_phy_poll_ack(imx6_pcie, 1); if (ret) return ret;
[RESEND PATCH v3 7/7] PCI: dwc: dra7xx: Push request_irq call to the bottom of probe
From: KeerthyCurrently devm_request_irq is being called before base, pci fields of dra7xx_pcie structure are populated. It is called even before pm_runtime_enable and pm_runtime_get_sync are called. This will lead to exceptions if in case an interrupt is triggered before the all of the above are done. Hence push the devm_request_irq call to the end of the probe. Signed-off-by: Keerthy Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index c6fef0a..8c53233 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -410,13 +410,6 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev) return -EINVAL; } - ret = devm_request_irq(dev, irq, dra7xx_pcie_irq_handler, - IRQF_SHARED, "dra7xx-pcie-main", dra7xx); - if (ret) { - dev_err(dev, "failed to request irq\n"); - return ret; - } - res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "ti_conf"); base = devm_ioremap_nocache(dev, res->start, resource_size(res)); if (!base) @@ -478,6 +471,13 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev) if (ret < 0) goto err_gpio; + ret = devm_request_irq(dev, irq, dra7xx_pcie_irq_handler, + IRQF_SHARED, "dra7xx-pcie-main", dra7xx); + if (ret) { + dev_err(dev, "failed to request irq\n"); + goto err_gpio; + } + return 0; err_gpio: -- 1.7.9.5
[RESEND PATCH v3 7/7] PCI: dwc: dra7xx: Push request_irq call to the bottom of probe
From: Keerthy Currently devm_request_irq is being called before base, pci fields of dra7xx_pcie structure are populated. It is called even before pm_runtime_enable and pm_runtime_get_sync are called. This will lead to exceptions if in case an interrupt is triggered before the all of the above are done. Hence push the devm_request_irq call to the end of the probe. Signed-off-by: Keerthy Signed-off-by: Kishon Vijay Abraham I --- drivers/pci/dwc/pci-dra7xx.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c index c6fef0a..8c53233 100644 --- a/drivers/pci/dwc/pci-dra7xx.c +++ b/drivers/pci/dwc/pci-dra7xx.c @@ -410,13 +410,6 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev) return -EINVAL; } - ret = devm_request_irq(dev, irq, dra7xx_pcie_irq_handler, - IRQF_SHARED, "dra7xx-pcie-main", dra7xx); - if (ret) { - dev_err(dev, "failed to request irq\n"); - return ret; - } - res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "ti_conf"); base = devm_ioremap_nocache(dev, res->start, resource_size(res)); if (!base) @@ -478,6 +471,13 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev) if (ret < 0) goto err_gpio; + ret = devm_request_irq(dev, irq, dra7xx_pcie_irq_handler, + IRQF_SHARED, "dra7xx-pcie-main", dra7xx); + if (ret) { + dev_err(dev, "failed to request irq\n"); + goto err_gpio; + } + return 0; err_gpio: -- 1.7.9.5
Re: kexec regression since 4.9 caused by efi
Add efi/kexec list. On 03/08/17 at 12:16pm, Omar Sandoval wrote: > Hi, everyone, > > Since 4.9, kexec results in the following panic on some of our servers: > > [0.001000] general protection fault: [#1] SMP > [0.001000] Modules linked in: > [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53 > [0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 > 09/30/2016 > [0.001000] task: 81e0e4c0 task.stack: 81e0 > [0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0 > [0.001000] RSP: :81e03e18 EFLAGS: 00010202 > [0.001000] RAX: afafafafafafafaf RBX: 81e3a4e0 RCX: > 0007 > [0.001000] RDX: 81e03e70 RSI: 81e3a4e0 RDI: > 88407f8c2de0 > [0.001000] RBP: 81e03e60 R08: R09: > > [0.001000] R10: R11: R12: > 81e03e70 > [0.001000] R13: 0007 R14: R15: > > [0.001000] FS: () GS:881fff60() > knlGS: > [0.001000] CS: 0010 DS: ES: CR0: 80050033 > [0.001000] CR2: 88407f30f000 CR3: 001fff102000 CR4: > 000406b0 > [0.001000] DR0: DR1: DR2: > > [0.001000] DR3: DR6: fffe0ff0 DR7: > 0400 > [0.001000] Call Trace: > [0.001000] efi_delete_dummy_variable+0x7a/0x80 > [0.001000] efi_enter_virtual_mode+0x3e2/0x494 > [0.001000] start_kernel+0x392/0x418 > [0.001000] ? set_init_arg+0x55/0x55 > [0.001000] x86_64_start_reservations+0x2a/0x2c > [0.001000] x86_64_start_kernel+0xea/0xed > [0.001000] start_cpu+0x14/0x14 > [0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b > 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> > 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca > [0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: 81e03e18 > [0.001000] ---[ end trace 0bd213e540e9b19f ]--- > [0.001000] Kernel panic - not syncing: Fatal exception > [0.001000] ---[ end Kernel panic - not syncing: Fatal exception > > Booting normally (i.e., not kexec) still works. > > The decoded code is: > > >0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax >6: 43 77 95rex.XB ja 0xff9e >9: 00 00 add%al,(%rax) >b: 75 68 jne0x75 >d: 9c pushfq >e: 8f 04 24popq (%rsp) > 11: 48 8b 05 3e 7d 7e 00mov0x7e7d3e(%rip),%rax# 0x7e7d56 > 18: 48 89 demov%rbx,%rsi > 1b: 4d 89 f9mov%r15,%r9 > 1e: 4d 89 f0mov%r14,%r8 > 21: 44 89 e9mov%r13d,%ecx > 24: 4c 89 e2mov%r12,%rdx > 27: 48 8b 40 58 mov0x58(%rax),%rax > 2b:* 48 8b 78 58 mov0x58(%rax),%rdi <-- trapping > instruction > 2f: 31 c0 xor%eax,%eax > 31: e8 90 e4 92 ff callq 0xff92e4c6 > 36: 48 8b 3c 24 mov(%rsp),%rdi > 3a: 48 rex.W > 3b: c7 .byte 0xc7 > 3c: c6 (bad) > 3d: 2b 0a sub(%rdx),%ecx > 3f: ca .byte 0xca > > If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf, I have no more clue yet from your provided log, but the runtime value is odd to me. It is set in below code: arch/x86/platform/efi/efi.c: efi_systab_init() efi_systab.runtime = data ? (void *)(unsigned long)data->runtime : (void *)(unsigne long)systab64->runtime; Here data is the setup_data passed by kexec-tools from normal kernel to kexec kernel, efi_setup_data structure is like below: struct efi_setup_data { u64 fw_vendor; u64 runtime; u64 tables; u64 smbios; u64 reserved[8]; }; kexec-tools get the runtime address from /sys/firmware/efi/runtime So can you do some debuggin on your side, eg. see the sysfs runtime value is correct or not. And add some printk in efi init path etc. > and we're crashing when we try to dereference that. > > Here is the output of efi=debug from before the crash: > > [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 > 12:07:16 PST 2017 > [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 > ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 > fsck.repair=yes pcie_pme=nomsi >
Re: kexec regression since 4.9 caused by efi
Add efi/kexec list. On 03/08/17 at 12:16pm, Omar Sandoval wrote: > Hi, everyone, > > Since 4.9, kexec results in the following panic on some of our servers: > > [0.001000] general protection fault: [#1] SMP > [0.001000] Modules linked in: > [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53 > [0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 > 09/30/2016 > [0.001000] task: 81e0e4c0 task.stack: 81e0 > [0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0 > [0.001000] RSP: :81e03e18 EFLAGS: 00010202 > [0.001000] RAX: afafafafafafafaf RBX: 81e3a4e0 RCX: > 0007 > [0.001000] RDX: 81e03e70 RSI: 81e3a4e0 RDI: > 88407f8c2de0 > [0.001000] RBP: 81e03e60 R08: R09: > > [0.001000] R10: R11: R12: > 81e03e70 > [0.001000] R13: 0007 R14: R15: > > [0.001000] FS: () GS:881fff60() > knlGS: > [0.001000] CS: 0010 DS: ES: CR0: 80050033 > [0.001000] CR2: 88407f30f000 CR3: 001fff102000 CR4: > 000406b0 > [0.001000] DR0: DR1: DR2: > > [0.001000] DR3: DR6: fffe0ff0 DR7: > 0400 > [0.001000] Call Trace: > [0.001000] efi_delete_dummy_variable+0x7a/0x80 > [0.001000] efi_enter_virtual_mode+0x3e2/0x494 > [0.001000] start_kernel+0x392/0x418 > [0.001000] ? set_init_arg+0x55/0x55 > [0.001000] x86_64_start_reservations+0x2a/0x2c > [0.001000] x86_64_start_kernel+0xea/0xed > [0.001000] start_cpu+0x14/0x14 > [0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b > 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> > 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca > [0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: 81e03e18 > [0.001000] ---[ end trace 0bd213e540e9b19f ]--- > [0.001000] Kernel panic - not syncing: Fatal exception > [0.001000] ---[ end Kernel panic - not syncing: Fatal exception > > Booting normally (i.e., not kexec) still works. > > The decoded code is: > > >0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax >6: 43 77 95rex.XB ja 0xff9e >9: 00 00 add%al,(%rax) >b: 75 68 jne0x75 >d: 9c pushfq >e: 8f 04 24popq (%rsp) > 11: 48 8b 05 3e 7d 7e 00mov0x7e7d3e(%rip),%rax# 0x7e7d56 > 18: 48 89 demov%rbx,%rsi > 1b: 4d 89 f9mov%r15,%r9 > 1e: 4d 89 f0mov%r14,%r8 > 21: 44 89 e9mov%r13d,%ecx > 24: 4c 89 e2mov%r12,%rdx > 27: 48 8b 40 58 mov0x58(%rax),%rax > 2b:* 48 8b 78 58 mov0x58(%rax),%rdi <-- trapping > instruction > 2f: 31 c0 xor%eax,%eax > 31: e8 90 e4 92 ff callq 0xff92e4c6 > 36: 48 8b 3c 24 mov(%rsp),%rdi > 3a: 48 rex.W > 3b: c7 .byte 0xc7 > 3c: c6 (bad) > 3d: 2b 0a sub(%rdx),%ecx > 3f: ca .byte 0xca > > If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf, I have no more clue yet from your provided log, but the runtime value is odd to me. It is set in below code: arch/x86/platform/efi/efi.c: efi_systab_init() efi_systab.runtime = data ? (void *)(unsigned long)data->runtime : (void *)(unsigne long)systab64->runtime; Here data is the setup_data passed by kexec-tools from normal kernel to kexec kernel, efi_setup_data structure is like below: struct efi_setup_data { u64 fw_vendor; u64 runtime; u64 tables; u64 smbios; u64 reserved[8]; }; kexec-tools get the runtime address from /sys/firmware/efi/runtime So can you do some debuggin on your side, eg. see the sysfs runtime value is correct or not. And add some printk in efi init path etc. > and we're crashing when we try to dereference that. > > Here is the output of efi=debug from before the crash: > > [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 > 12:07:16 PST 2017 > [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 > ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 > fsck.repair=yes pcie_pme=nomsi >
Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset
"Naveen N. Rao"writes: > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote: >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for >> > v4.11. If so, it would be good to take this patch through the powerpc >> > tree. Otherwise, this can go via Ingo's tree. >> >> If you guys convince Ingo that this should go _now_, then just cherry >> pick what was merged into tip/perf/core that is needed for the arch >> specific stuff and go from there. > > Ok, in hindsight, I think Michael's concern was actually for v4.12 Yes I was talking about 4.12, sorry I thought that was implied :) > itself, in which case this particular patch can go via powerpc tree, > while the rest of the patches in this series can go via your tree. > > Michael? Yeah I think that's the easiest option. The function will be temporarily unused until the two trees are merged, but I think that's fine. cheers
Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset
"Naveen N. Rao" writes: > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote: >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for >> > v4.11. If so, it would be good to take this patch through the powerpc >> > tree. Otherwise, this can go via Ingo's tree. >> >> If you guys convince Ingo that this should go _now_, then just cherry >> pick what was merged into tip/perf/core that is needed for the arch >> specific stuff and go from there. > > Ok, in hindsight, I think Michael's concern was actually for v4.12 Yes I was talking about 4.12, sorry I thought that was implied :) > itself, in which case this particular patch can go via powerpc tree, > while the rest of the patches in this series can go via your tree. > > Michael? Yeah I think that's the easiest option. The function will be temporarily unused until the two trees are merged, but I think that's fine. cheers
Re: [f2fs-dev] [PATCH] f2fs: allocate a bio for discarding when actually issuing it
Hi Jaegeuk, On 2017/3/8 10:33, Jaegeuk Kim wrote: > Let's allocate a bio when issuing discard commands later. > > Signed-off-by: Jaegeuk Kim> --- > fs/f2fs/f2fs.h| 4 +- > fs/f2fs/segment.c | 113 > -- > 2 files changed, 62 insertions(+), 55 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index a58c2e43bd2a..870bb4d9bc65 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -197,10 +197,12 @@ enum { > struct discard_cmd { > struct list_head list; /* command list */ > struct completion wait; /* compleation */ > + struct block_device *bdev; /* bdev */ > block_t lstart; /* logical start address */ > + block_t start; /* actual start address in dev */ > block_t len;/* length */ > - struct bio *bio;/* bio */ > int state; /* state */ > + int error; /* bio error */ > }; > > struct discard_cmd_control { > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index 50c65cc4645a..d8f9e6c895cd 100644 > --- a/fs/f2fs/segment.c > +++ b/fs/f2fs/segment.c > @@ -636,7 +636,8 @@ static void locate_dirty_segment(struct f2fs_sb_info > *sbi, unsigned int segno) > } > > static void __add_discard_cmd(struct f2fs_sb_info *sbi, > - struct bio *bio, block_t lstart, block_t len) > + struct block_device *bdev, block_t lstart, > + block_t start, block_t len) > { > struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; > struct list_head *cmd_list = &(dcc->discard_cmd_list); > @@ -644,9 +645,9 @@ static void __add_discard_cmd(struct f2fs_sb_info *sbi, > > dc = f2fs_kmem_cache_alloc(discard_cmd_slab, GFP_NOFS); > INIT_LIST_HEAD(>list); > - dc->bio = bio; > - bio->bi_private = dc; > + dc->bdev = bdev; > dc->lstart = lstart; > + dc->start = start; > dc->len = len; > dc->state = D_PREP; > init_completion(>wait); > @@ -658,22 +659,66 @@ static void __add_discard_cmd(struct f2fs_sb_info *sbi, > > static void __remove_discard_cmd(struct f2fs_sb_info *sbi, struct > discard_cmd *dc) > { > - int err = dc->bio->bi_error; > - > if (dc->state == D_DONE) > atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard)); > > - if (err == -EOPNOTSUPP) > - err = 0; > + if (dc->error == -EOPNOTSUPP) > + dc->error = 0; > > - if (err) > + if (dc->error) > f2fs_msg(sbi->sb, KERN_INFO, > - "Issue discard failed, ret: %d", err); > - bio_put(dc->bio); > + "Issue discard failed, ret: %d", dc->error); > list_del(>list); > kmem_cache_free(discard_cmd_slab, dc); > } > > +static void f2fs_submit_discard_endio(struct bio *bio) > +{ > + struct discard_cmd *dc = (struct discard_cmd *)bio->bi_private; > + > + complete(>wait); > + dc->error = bio->bi_error; > + dc->state = D_DONE; > + bio_put(bio); > +} > + > +/* this function is copied from blkdev_issue_discard from block/blk-lib.c */ > +static int __submit_discard_cmd(struct discard_cmd *dc) > +{ > + struct bio *bio = NULL; > + int err; > + > + err = __blkdev_issue_discard(dc->bdev, > + SECTOR_FROM_BLOCK(dc->start), > + SECTOR_FROM_BLOCK(dc->len), > + GFP_NOFS, 0, ); > + if (!err && bio) { > + bio->bi_private = dc; > + bio->bi_end_io = f2fs_submit_discard_endio; > + bio->bi_opf |= REQ_SYNC; > + submit_bio(bio); Should set flag only if __blkdev_issue_discard is successful? dc->state = D_SUBMIT; And how about moving atomic_inc(>submit_discard); here? Thanks, > + } > + dc->state = D_SUBMIT; > + return err; > +} > + > +static int __queue_discard_cmd(struct f2fs_sb_info *sbi, > + struct block_device *bdev, block_t blkstart, block_t blklen) > +{ > + block_t lblkstart = blkstart; > + > + trace_f2fs_issue_discard(bdev, blkstart, blklen); > + > + if (sbi->s_ndevs) { > + int devi = f2fs_target_device_index(sbi, blkstart); > + > + blkstart -= FDEV(devi).start_blk; > + } > + __add_discard_cmd(sbi, bdev, lblkstart, blkstart, blklen); > + wake_up(_I(sbi)->dcc_info->discard_wait_queue); > + return 0; > +} > + > /* This should be covered by global mutex, _i->sentry_lock */ > void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, block_t blkaddr) > { > @@ -690,8 +735,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, > block_t blkaddr) > > if (blkaddr == NULL_ADDR) { > if (dc->state == D_PREP) { > - dc->state = D_SUBMIT; > - submit_bio(dc->bio);
Re: [RFC 08/11] mm: make ttu's return boolean
Hi John, On Tue, Mar 07, 2017 at 11:13:26PM -0800, John Hubbard wrote: > On 03/01/2017 10:39 PM, Minchan Kim wrote: > >try_to_unmap returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for > >boolean return. This patch changes it. > > Hi Minchan, > > So, up until this patch, I definitely like the cleanup, because as you > observed, the return values didn't need so many different values. However, > at this point, I think you should stop, and keep the SWAP_SUCCESS and > SWAP_FAIL (or maybe even rename them to UNMAP_* or TTU_RESULT_*, to match > their functions' names better), because removing them makes the code > considerably less readable. > > And since this is billed as a cleanup, we care here, even though this is a > minor point. :) > > Bool return values are sometimes perfect, such as when asking a question: > >bool mode_changed = needs_modeset(crtc_state); > > The above is very nice. However, for returning success or failure, bools are > not as nice, because *usually* success == true, except when you use the > errno-based system, in which success == 0 (which would translate to false, > if you mistakenly treated it as a bool). That leads to the reader having to > remember which system is in use, usually with no visual cues to help. I think it's the matter of taste. if (try_to_unmap(xxx)) something else something It's perfectly understandable to me. IOW, if try_to_unmap returns true, it means it did unmap successfully. Otherwise, failed. IMHO, SWAP_SUCCESS or TTU_RESULT_* seems to be an over-engineering. If the user want it, user can do it by introducing right variable name in his context. See below. > > > > [...] > > if (PageSwapCache(p)) { > >@@ -971,7 +971,7 @@ static int hwpoison_user_mappings(struct page *p, > >unsigned long pfn, > > collect_procs(hpage, , flags & MF_ACTION_REQUIRED); > > > > ret = try_to_unmap(hpage, ttu); > >-if (ret != SWAP_SUCCESS) > >+if (!ret) > > pr_err("Memory failure: %#lx: failed to unmap page > > (mapcount=%d)\n", > >pfn, page_mapcount(hpage)); > > > >@@ -986,8 +986,7 @@ static int hwpoison_user_mappings(struct page *p, > >unsigned long pfn, > > * any accesses to the poisoned memory. > > */ > > forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); > >-kill_procs(, forcekill, trapno, > >- ret != SWAP_SUCCESS, p, pfn, flags); > >+kill_procs(, forcekill, trapno, !ret , p, pfn, flags); > > The kill_procs() invocation was a little more readable before. Indeed but I think it's not a problem of try_to_unmap but ret variable name isn't good any more. How about this? bool unmap_success; unmap_success = try_to_unmap(hpage, ttu); .. kill_procs(, forcekill, trapno, !unmap_success , p, pfn, flags); .. return unmap_success; My point is user can introduce whatever variable name depends on his context. No need to make return variable complicated, IMHO. > > > > [...] > >diff --git a/mm/vmscan.c b/mm/vmscan.c > >index 170c61f..e4b74f1 100644 > >--- a/mm/vmscan.c > >+++ b/mm/vmscan.c > >@@ -966,7 +966,6 @@ static unsigned long shrink_page_list(struct list_head > >*page_list, > > int may_enter_fs; > > enum page_references references = PAGEREF_RECLAIM_CLEAN; > > bool dirty, writeback; > >-int ret = SWAP_SUCCESS; > > > > cond_resched(); > > > >@@ -1139,13 +1138,9 @@ static unsigned long shrink_page_list(struct > >list_head *page_list, > > * processes. Try to unmap it here. > > */ > > if (page_mapped(page)) { > >-switch (ret = try_to_unmap(page, > >-ttu_flags | TTU_BATCH_FLUSH)) { > >-case SWAP_FAIL: > > Again: the SWAP_FAIL makes it crystal clear which case we're in. To me, I don't feel it. To me, below is perfectly understandable. if (try_to_unmap()) do something That's why I think it's matter of taste. Okay, I admit I might be biased, too so I will consider what you suggested if others votes it. Thanks. > > I also wonder if UNMAP_FAIL or TTU_RESULT_FAIL is a better name? > > thanks, > John Hubbard > NVIDIA > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [f2fs-dev] [PATCH] f2fs: allocate a bio for discarding when actually issuing it
Hi Jaegeuk, On 2017/3/8 10:33, Jaegeuk Kim wrote: > Let's allocate a bio when issuing discard commands later. > > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/f2fs.h| 4 +- > fs/f2fs/segment.c | 113 > -- > 2 files changed, 62 insertions(+), 55 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index a58c2e43bd2a..870bb4d9bc65 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -197,10 +197,12 @@ enum { > struct discard_cmd { > struct list_head list; /* command list */ > struct completion wait; /* compleation */ > + struct block_device *bdev; /* bdev */ > block_t lstart; /* logical start address */ > + block_t start; /* actual start address in dev */ > block_t len;/* length */ > - struct bio *bio;/* bio */ > int state; /* state */ > + int error; /* bio error */ > }; > > struct discard_cmd_control { > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index 50c65cc4645a..d8f9e6c895cd 100644 > --- a/fs/f2fs/segment.c > +++ b/fs/f2fs/segment.c > @@ -636,7 +636,8 @@ static void locate_dirty_segment(struct f2fs_sb_info > *sbi, unsigned int segno) > } > > static void __add_discard_cmd(struct f2fs_sb_info *sbi, > - struct bio *bio, block_t lstart, block_t len) > + struct block_device *bdev, block_t lstart, > + block_t start, block_t len) > { > struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; > struct list_head *cmd_list = &(dcc->discard_cmd_list); > @@ -644,9 +645,9 @@ static void __add_discard_cmd(struct f2fs_sb_info *sbi, > > dc = f2fs_kmem_cache_alloc(discard_cmd_slab, GFP_NOFS); > INIT_LIST_HEAD(>list); > - dc->bio = bio; > - bio->bi_private = dc; > + dc->bdev = bdev; > dc->lstart = lstart; > + dc->start = start; > dc->len = len; > dc->state = D_PREP; > init_completion(>wait); > @@ -658,22 +659,66 @@ static void __add_discard_cmd(struct f2fs_sb_info *sbi, > > static void __remove_discard_cmd(struct f2fs_sb_info *sbi, struct > discard_cmd *dc) > { > - int err = dc->bio->bi_error; > - > if (dc->state == D_DONE) > atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard)); > > - if (err == -EOPNOTSUPP) > - err = 0; > + if (dc->error == -EOPNOTSUPP) > + dc->error = 0; > > - if (err) > + if (dc->error) > f2fs_msg(sbi->sb, KERN_INFO, > - "Issue discard failed, ret: %d", err); > - bio_put(dc->bio); > + "Issue discard failed, ret: %d", dc->error); > list_del(>list); > kmem_cache_free(discard_cmd_slab, dc); > } > > +static void f2fs_submit_discard_endio(struct bio *bio) > +{ > + struct discard_cmd *dc = (struct discard_cmd *)bio->bi_private; > + > + complete(>wait); > + dc->error = bio->bi_error; > + dc->state = D_DONE; > + bio_put(bio); > +} > + > +/* this function is copied from blkdev_issue_discard from block/blk-lib.c */ > +static int __submit_discard_cmd(struct discard_cmd *dc) > +{ > + struct bio *bio = NULL; > + int err; > + > + err = __blkdev_issue_discard(dc->bdev, > + SECTOR_FROM_BLOCK(dc->start), > + SECTOR_FROM_BLOCK(dc->len), > + GFP_NOFS, 0, ); > + if (!err && bio) { > + bio->bi_private = dc; > + bio->bi_end_io = f2fs_submit_discard_endio; > + bio->bi_opf |= REQ_SYNC; > + submit_bio(bio); Should set flag only if __blkdev_issue_discard is successful? dc->state = D_SUBMIT; And how about moving atomic_inc(>submit_discard); here? Thanks, > + } > + dc->state = D_SUBMIT; > + return err; > +} > + > +static int __queue_discard_cmd(struct f2fs_sb_info *sbi, > + struct block_device *bdev, block_t blkstart, block_t blklen) > +{ > + block_t lblkstart = blkstart; > + > + trace_f2fs_issue_discard(bdev, blkstart, blklen); > + > + if (sbi->s_ndevs) { > + int devi = f2fs_target_device_index(sbi, blkstart); > + > + blkstart -= FDEV(devi).start_blk; > + } > + __add_discard_cmd(sbi, bdev, lblkstart, blkstart, blklen); > + wake_up(_I(sbi)->dcc_info->discard_wait_queue); > + return 0; > +} > + > /* This should be covered by global mutex, _i->sentry_lock */ > void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, block_t blkaddr) > { > @@ -690,8 +735,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, > block_t blkaddr) > > if (blkaddr == NULL_ADDR) { > if (dc->state == D_PREP) { > - dc->state = D_SUBMIT; > - submit_bio(dc->bio); > +
Re: [RFC 08/11] mm: make ttu's return boolean
Hi John, On Tue, Mar 07, 2017 at 11:13:26PM -0800, John Hubbard wrote: > On 03/01/2017 10:39 PM, Minchan Kim wrote: > >try_to_unmap returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for > >boolean return. This patch changes it. > > Hi Minchan, > > So, up until this patch, I definitely like the cleanup, because as you > observed, the return values didn't need so many different values. However, > at this point, I think you should stop, and keep the SWAP_SUCCESS and > SWAP_FAIL (or maybe even rename them to UNMAP_* or TTU_RESULT_*, to match > their functions' names better), because removing them makes the code > considerably less readable. > > And since this is billed as a cleanup, we care here, even though this is a > minor point. :) > > Bool return values are sometimes perfect, such as when asking a question: > >bool mode_changed = needs_modeset(crtc_state); > > The above is very nice. However, for returning success or failure, bools are > not as nice, because *usually* success == true, except when you use the > errno-based system, in which success == 0 (which would translate to false, > if you mistakenly treated it as a bool). That leads to the reader having to > remember which system is in use, usually with no visual cues to help. I think it's the matter of taste. if (try_to_unmap(xxx)) something else something It's perfectly understandable to me. IOW, if try_to_unmap returns true, it means it did unmap successfully. Otherwise, failed. IMHO, SWAP_SUCCESS or TTU_RESULT_* seems to be an over-engineering. If the user want it, user can do it by introducing right variable name in his context. See below. > > > > [...] > > if (PageSwapCache(p)) { > >@@ -971,7 +971,7 @@ static int hwpoison_user_mappings(struct page *p, > >unsigned long pfn, > > collect_procs(hpage, , flags & MF_ACTION_REQUIRED); > > > > ret = try_to_unmap(hpage, ttu); > >-if (ret != SWAP_SUCCESS) > >+if (!ret) > > pr_err("Memory failure: %#lx: failed to unmap page > > (mapcount=%d)\n", > >pfn, page_mapcount(hpage)); > > > >@@ -986,8 +986,7 @@ static int hwpoison_user_mappings(struct page *p, > >unsigned long pfn, > > * any accesses to the poisoned memory. > > */ > > forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); > >-kill_procs(, forcekill, trapno, > >- ret != SWAP_SUCCESS, p, pfn, flags); > >+kill_procs(, forcekill, trapno, !ret , p, pfn, flags); > > The kill_procs() invocation was a little more readable before. Indeed but I think it's not a problem of try_to_unmap but ret variable name isn't good any more. How about this? bool unmap_success; unmap_success = try_to_unmap(hpage, ttu); .. kill_procs(, forcekill, trapno, !unmap_success , p, pfn, flags); .. return unmap_success; My point is user can introduce whatever variable name depends on his context. No need to make return variable complicated, IMHO. > > > > [...] > >diff --git a/mm/vmscan.c b/mm/vmscan.c > >index 170c61f..e4b74f1 100644 > >--- a/mm/vmscan.c > >+++ b/mm/vmscan.c > >@@ -966,7 +966,6 @@ static unsigned long shrink_page_list(struct list_head > >*page_list, > > int may_enter_fs; > > enum page_references references = PAGEREF_RECLAIM_CLEAN; > > bool dirty, writeback; > >-int ret = SWAP_SUCCESS; > > > > cond_resched(); > > > >@@ -1139,13 +1138,9 @@ static unsigned long shrink_page_list(struct > >list_head *page_list, > > * processes. Try to unmap it here. > > */ > > if (page_mapped(page)) { > >-switch (ret = try_to_unmap(page, > >-ttu_flags | TTU_BATCH_FLUSH)) { > >-case SWAP_FAIL: > > Again: the SWAP_FAIL makes it crystal clear which case we're in. To me, I don't feel it. To me, below is perfectly understandable. if (try_to_unmap()) do something That's why I think it's matter of taste. Okay, I admit I might be biased, too so I will consider what you suggested if others votes it. Thanks. > > I also wonder if UNMAP_FAIL or TTU_RESULT_FAIL is a better name? > > thanks, > John Hubbard > NVIDIA > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [PATCH] mm, add_memory_resource: hold device_hotplug lock over mem_hotplug_{begin, done}
On Mon, Mar 6, 2017 at 12:22 AM, Heiko Carstenswrote: > Hello Dan, > >> > If you look at commit 5e33bc4165f3 ("driver core / ACPI: Avoid device hot >> > remove locking issues") then lock_device_hotplug_sysfs() was introduced to >> > avoid a different subtle deadlock, but it also sleeps uninterruptible, but >> > not for more than 5ms ;) >> > >> > However I'm not sure if the device hotplug lock should also be used to fix >> > an unrelated bug that was introduced with the get_online_mems() / >> > put_online_mems() interface. Should it? >> >> No, I don't think it should. >> >> I like your proposed direction of creating a new lock internal to >> mem_hotplug_begin() to protect active_writer, and stop relying on >> lock_device_hotplug to serve this purpose. >> >> > If so, we need to sprinkle around a couple of lock_device_hotplug() calls >> > near mem_hotplug_begin() calls, like Sebastian already started, and give it >> > additional semantics (protecting mem_hotplug.active_writer), and hope it >> > doesn't lead to deadlocks anywhere. >> >> I'll put your proposed patch through some testing. > > On s390 it _seems_ to work. Did it pass your testing too? > If so I would send a patch with proper patch description for inclusion. Looks ok here. No lockdep warnings running it through it paces with the persistent memory use case.
Re: [PATCH] mm, add_memory_resource: hold device_hotplug lock over mem_hotplug_{begin, done}
On Mon, Mar 6, 2017 at 12:22 AM, Heiko Carstens wrote: > Hello Dan, > >> > If you look at commit 5e33bc4165f3 ("driver core / ACPI: Avoid device hot >> > remove locking issues") then lock_device_hotplug_sysfs() was introduced to >> > avoid a different subtle deadlock, but it also sleeps uninterruptible, but >> > not for more than 5ms ;) >> > >> > However I'm not sure if the device hotplug lock should also be used to fix >> > an unrelated bug that was introduced with the get_online_mems() / >> > put_online_mems() interface. Should it? >> >> No, I don't think it should. >> >> I like your proposed direction of creating a new lock internal to >> mem_hotplug_begin() to protect active_writer, and stop relying on >> lock_device_hotplug to serve this purpose. >> >> > If so, we need to sprinkle around a couple of lock_device_hotplug() calls >> > near mem_hotplug_begin() calls, like Sebastian already started, and give it >> > additional semantics (protecting mem_hotplug.active_writer), and hope it >> > doesn't lead to deadlocks anywhere. >> >> I'll put your proposed patch through some testing. > > On s390 it _seems_ to work. Did it pass your testing too? > If so I would send a patch with proper patch description for inclusion. Looks ok here. No lockdep warnings running it through it paces with the persistent memory use case.
Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)
On 9.03.2017 03:58, Theodore Ts'o wrote: > On Tue, Mar 07, 2017 at 10:40:53PM +0200, Nikolay Borisov wrote: >> So this is wrong, the reason why the issues seemed fix is because I >> switched my compiler to version 5.4.0. So this manifests only if I'm >> using gcc 4.7.4. With the pr_info added here is the output of a boot. So >> there are multiple invocations of ext4_ext_map_blocks and the freeing, >> including with the address being used in subsequent kasan reports : >> 88006ae8fdb0 > > Can you help bisect this, then? I'm using Debian Testing, and the > default gcc is gcc 6.3.0. I'm currently forcing the use of gcc 5.4.1 > because I was running into problems with gcc 6.x a while back. (TBH, > I was thinking about trying to see if gcc 6.3 was stable for kernel > compiles when I had some spare time.) But I don't have access to > *any* gcc 4.x on my development system, and I don't think I've tried > using gcc 4.x in a long, Long, LONG time. > > I'm currently kicking off a test run using 5.4.1 with KASAN enabled to > see if I can trigger it myself. Can you send me a copy of your > .config so I can see what else might be interesting with your config? > (e.g., SLAB vs SLUB, etc.) Attached the config. FUrther debugging and talking with the kasan developers I think this actually might be a kasan problem when used with an old compiler. I bisected this all the way to 1771c6e1a567ea0ba2, which is the commit introducing the user access instrumentation. Here is a mail thread where I confirmed that this might be a kasan issue : https://lkml.org/lkml/2017/3/8/69 What I believe is happening is that the manual checks inserted in user access code misses some context information due to instrumentation not inserted by the compiler. Kasan gets confused as a result, hence the warnings. > > Thanks, > > - Ted > # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.7.0 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-nbor" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y #
Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)
On 9.03.2017 03:58, Theodore Ts'o wrote: > On Tue, Mar 07, 2017 at 10:40:53PM +0200, Nikolay Borisov wrote: >> So this is wrong, the reason why the issues seemed fix is because I >> switched my compiler to version 5.4.0. So this manifests only if I'm >> using gcc 4.7.4. With the pr_info added here is the output of a boot. So >> there are multiple invocations of ext4_ext_map_blocks and the freeing, >> including with the address being used in subsequent kasan reports : >> 88006ae8fdb0 > > Can you help bisect this, then? I'm using Debian Testing, and the > default gcc is gcc 6.3.0. I'm currently forcing the use of gcc 5.4.1 > because I was running into problems with gcc 6.x a while back. (TBH, > I was thinking about trying to see if gcc 6.3 was stable for kernel > compiles when I had some spare time.) But I don't have access to > *any* gcc 4.x on my development system, and I don't think I've tried > using gcc 4.x in a long, Long, LONG time. > > I'm currently kicking off a test run using 5.4.1 with KASAN enabled to > see if I can trigger it myself. Can you send me a copy of your > .config so I can see what else might be interesting with your config? > (e.g., SLAB vs SLUB, etc.) Attached the config. FUrther debugging and talking with the kasan developers I think this actually might be a kasan problem when used with an old compiler. I bisected this all the way to 1771c6e1a567ea0ba2, which is the commit introducing the user access instrumentation. Here is a mail thread where I confirmed that this might be a kasan issue : https://lkml.org/lkml/2017/3/8/69 What I believe is happening is that the manual checks inserted in user access code misses some context information due to instrumentation not inserted by the compiler. Kasan gets confused as a result, hence the warnings. > > Thanks, > > - Ted > # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.7.0 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-nbor" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y #
[PATCH 1/6] mm/migrate: Add new mode parameter to migrate_page_copy() function
From: Zi YanThis is a prerequisite change required to make page migration framewok copy in different modes like the default single threaded or the new multi threaded one yet to be introduced in follow up patches. This does not change any existing functionality. Only migrate_page_copy() and copy_huge_page() function's signatures are affected. Signed-off-by: Zi Yan Signed-off-by: Anshuman Khandual --- * Updated include/linux/migrate_mode.h comment as per Naoya fs/aio.c | 2 +- fs/f2fs/data.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/ubifs/file.c | 2 +- include/linux/migrate.h | 6 -- include/linux/migrate_mode.h | 2 ++ mm/migrate.c | 14 -- 7 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 873b4ca..ba3f6eb 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, * events from being lost. */ spin_lock_irqsave(>completion_lock, flags); - migrate_page_copy(new, old); + migrate_page_copy(new, old, MIGRATE_ST); BUG_ON(ctx->ring_pages[idx] != old); ctx->ring_pages[idx] = new; spin_unlock_irqrestore(>completion_lock, flags); diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 9ac2625..ad41356 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1997,7 +1997,7 @@ int f2fs_migrate_page(struct address_space *mapping, SetPagePrivate(newpage); set_page_private(newpage, page_private(page)); - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 54de77e..0e16512f 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -850,7 +850,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping, rc = migrate_huge_page_move_mapping(mapping, newpage, page); if (rc != MIGRATEPAGE_SUCCESS) return rc; - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index b0d7837..293616f 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1482,7 +1482,7 @@ static int ubifs_migrate_page(struct address_space *mapping, SetPagePrivate(newpage); } - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ae8d475..d843b8f 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -42,7 +42,8 @@ extern void putback_movable_page(struct page *page); extern int migrate_prep(void); extern int migrate_prep_local(void); -extern void migrate_page_copy(struct page *newpage, struct page *page); +extern void migrate_page_copy(struct page *newpage, struct page *page, + enum migrate_mode mode); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, @@ -61,7 +62,8 @@ static inline int migrate_prep(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } static inline void migrate_page_copy(struct page *newpage, -struct page *page) {} +struct page *page, +enum migrate_mode mode) {} static inline int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page) diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index ebf3d89..deaeba5 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -6,11 +6,13 @@ * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages + * MIGRATE_ST will use single thread when migrating pages */ enum migrate_mode { MIGRATE_ASYNC, MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, + MIGRATE_ST }; #endif /* MIGRATE_MODE_H_INCLUDED */ diff --git a/mm/migrate.c b/mm/migrate.c index e8ae11a..5ef4aa4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -631,7 +631,8 @@ static void __copy_gigantic_page(struct page *dst, struct page *src, } } -static void copy_huge_page(struct page *dst, struct page *src) +static void copy_huge_page(struct page *dst, struct page *src, + enum migrate_mode mode) { int i; int nr_pages; @@ -660,12 +661,13 @@
[PATCH 1/6] mm/migrate: Add new mode parameter to migrate_page_copy() function
From: Zi Yan This is a prerequisite change required to make page migration framewok copy in different modes like the default single threaded or the new multi threaded one yet to be introduced in follow up patches. This does not change any existing functionality. Only migrate_page_copy() and copy_huge_page() function's signatures are affected. Signed-off-by: Zi Yan Signed-off-by: Anshuman Khandual --- * Updated include/linux/migrate_mode.h comment as per Naoya fs/aio.c | 2 +- fs/f2fs/data.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/ubifs/file.c | 2 +- include/linux/migrate.h | 6 -- include/linux/migrate_mode.h | 2 ++ mm/migrate.c | 14 -- 7 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 873b4ca..ba3f6eb 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -418,7 +418,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, * events from being lost. */ spin_lock_irqsave(>completion_lock, flags); - migrate_page_copy(new, old); + migrate_page_copy(new, old, MIGRATE_ST); BUG_ON(ctx->ring_pages[idx] != old); ctx->ring_pages[idx] = new; spin_unlock_irqrestore(>completion_lock, flags); diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 9ac2625..ad41356 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1997,7 +1997,7 @@ int f2fs_migrate_page(struct address_space *mapping, SetPagePrivate(newpage); set_page_private(newpage, page_private(page)); - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 54de77e..0e16512f 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -850,7 +850,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping, rc = migrate_huge_page_move_mapping(mapping, newpage, page); if (rc != MIGRATEPAGE_SUCCESS) return rc; - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index b0d7837..293616f 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1482,7 +1482,7 @@ static int ubifs_migrate_page(struct address_space *mapping, SetPagePrivate(newpage); } - migrate_page_copy(newpage, page); + migrate_page_copy(newpage, page, MIGRATE_ST); return MIGRATEPAGE_SUCCESS; } #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ae8d475..d843b8f 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -42,7 +42,8 @@ extern void putback_movable_page(struct page *page); extern int migrate_prep(void); extern int migrate_prep_local(void); -extern void migrate_page_copy(struct page *newpage, struct page *page); +extern void migrate_page_copy(struct page *newpage, struct page *page, + enum migrate_mode mode); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, @@ -61,7 +62,8 @@ static inline int migrate_prep(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } static inline void migrate_page_copy(struct page *newpage, -struct page *page) {} +struct page *page, +enum migrate_mode mode) {} static inline int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page) diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index ebf3d89..deaeba5 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -6,11 +6,13 @@ * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages + * MIGRATE_ST will use single thread when migrating pages */ enum migrate_mode { MIGRATE_ASYNC, MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, + MIGRATE_ST }; #endif /* MIGRATE_MODE_H_INCLUDED */ diff --git a/mm/migrate.c b/mm/migrate.c index e8ae11a..5ef4aa4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -631,7 +631,8 @@ static void __copy_gigantic_page(struct page *dst, struct page *src, } } -static void copy_huge_page(struct page *dst, struct page *src) +static void copy_huge_page(struct page *dst, struct page *src, + enum migrate_mode mode) { int i; int nr_pages; @@ -660,12 +661,13 @@ static void copy_huge_page(struct page *dst, struct page *src) /* *
[GIT PULL] xen features and fixes for 4.11 rc1
Linus, Please git pull the following tag: git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git for-linus-4.11-rc1-tag features and fixes for 4.11 rc1 It contains one fix for MSIX handling under Xen and a trivial cleanup patch. Thanks. Juergen arch/x86/pci/xen.c | 23 +++ drivers/xen/xenbus/xenbus_dev_frontend.c | 1 - 2 files changed, 7 insertions(+), 17 deletions(-) Dan Streetman (1): xen: do not re-use pirq number cached in pci device msi msg data Masanari Iida (1): xenbus: Remove duplicate inclusion of linux/init.h
[GIT PULL] xen features and fixes for 4.11 rc1
Linus, Please git pull the following tag: git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git for-linus-4.11-rc1-tag features and fixes for 4.11 rc1 It contains one fix for MSIX handling under Xen and a trivial cleanup patch. Thanks. Juergen arch/x86/pci/xen.c | 23 +++ drivers/xen/xenbus/xenbus_dev_frontend.c | 1 - 2 files changed, 7 insertions(+), 17 deletions(-) Dan Streetman (1): xen: do not re-use pirq number cached in pci device msi msg data Masanari Iida (1): xenbus: Remove duplicate inclusion of linux/init.h
Re: [PATCH 3/4] phy: rockchip-typec: support DP phy switch
Hi, On Thu, Mar 09, 2017 at 02:02:54AM +0100, Heiko Stuebner wrote: > Am Mittwoch, 8. März 2017, 16:39:23 CET schrieb Brian Norris: > > On Fri, Feb 10, 2017 at 03:44:13PM +0800, Chris Zhong wrote: > > > There are 2 Type-c PHYs in RK3399, but only one DP controller. Hence > > > only one PHY can connect to DP controller at one time, the other should > > > be disconnected. The GRF_SOC_CON26 register has a switch bit to do it, > > > set this bit means enable PHY 1, clear this bit means enable PHY 0. > > > > > > Signed-off-by: Chris Zhong> > > --- > > > > > > drivers/phy/phy-rockchip-typec.c | 9 + > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/drivers/phy/phy-rockchip-typec.c > > > b/drivers/phy/phy-rockchip-typec.c index 7cfb0f8..1604aaa 100644 > > > --- a/drivers/phy/phy-rockchip-typec.c > > > +++ b/drivers/phy/phy-rockchip-typec.c ... > > > @@ -869,6 +873,11 @@ static int tcphy_parse_dt(struct rockchip_typec_phy > > > *tcphy,> > > > if (ret) > > > > > > return ret; > > > > > > + ret = tcphy_get_param(dev, >uphy_dp_sel, > > > + "rockchip,uphy-dp-sel"); > > > + if (ret) > > > + return ret; > > > > What about existing device trees? You're essentially adding this > > new property and requiring it at the same time. > > > > Or are we considering no RK3399 DP stable at the moment? I guess we > > haven't actually merged any device trees that support this yet, no? > > An interesting situation we're in here. On the one hand, you're right this > breaks "backwards compatiblity". > > But on the other hand, the type-c phy is currently very much unused. The only > current board rk3399-evb.dts does not enable them (so they're disabled > everywhere) and we have neither dwc3 nor dp nodes in any rk3399 devicetrees > so > far. Also Rob was ok with the binding change :-) . > > So from my pov, I'd say it _should_ be ok, as nothing is using the phys at > all > yet and thus there is nothing that could get broken. Yeah, I guess it's OK... but BTW out-of-tree DTs are perfectly legit, once the bindings are accepted. Another random point of contention (not worth too much, as the pattern is already set), but why do these deserve DT properties at all? The device already has a "rk3399" compatible property, so can't we derive GRF offsets from that? Brian
Re: [PATCH 3/4] phy: rockchip-typec: support DP phy switch
Hi, On Thu, Mar 09, 2017 at 02:02:54AM +0100, Heiko Stuebner wrote: > Am Mittwoch, 8. März 2017, 16:39:23 CET schrieb Brian Norris: > > On Fri, Feb 10, 2017 at 03:44:13PM +0800, Chris Zhong wrote: > > > There are 2 Type-c PHYs in RK3399, but only one DP controller. Hence > > > only one PHY can connect to DP controller at one time, the other should > > > be disconnected. The GRF_SOC_CON26 register has a switch bit to do it, > > > set this bit means enable PHY 1, clear this bit means enable PHY 0. > > > > > > Signed-off-by: Chris Zhong > > > --- > > > > > > drivers/phy/phy-rockchip-typec.c | 9 + > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/drivers/phy/phy-rockchip-typec.c > > > b/drivers/phy/phy-rockchip-typec.c index 7cfb0f8..1604aaa 100644 > > > --- a/drivers/phy/phy-rockchip-typec.c > > > +++ b/drivers/phy/phy-rockchip-typec.c ... > > > @@ -869,6 +873,11 @@ static int tcphy_parse_dt(struct rockchip_typec_phy > > > *tcphy,> > > > if (ret) > > > > > > return ret; > > > > > > + ret = tcphy_get_param(dev, >uphy_dp_sel, > > > + "rockchip,uphy-dp-sel"); > > > + if (ret) > > > + return ret; > > > > What about existing device trees? You're essentially adding this > > new property and requiring it at the same time. > > > > Or are we considering no RK3399 DP stable at the moment? I guess we > > haven't actually merged any device trees that support this yet, no? > > An interesting situation we're in here. On the one hand, you're right this > breaks "backwards compatiblity". > > But on the other hand, the type-c phy is currently very much unused. The only > current board rk3399-evb.dts does not enable them (so they're disabled > everywhere) and we have neither dwc3 nor dp nodes in any rk3399 devicetrees > so > far. Also Rob was ok with the binding change :-) . > > So from my pov, I'd say it _should_ be ok, as nothing is using the phys at > all > yet and thus there is nothing that could get broken. Yeah, I guess it's OK... but BTW out-of-tree DTs are perfectly legit, once the bindings are accepted. Another random point of contention (not worth too much, as the pattern is already set), but why do these deserve DT properties at all? The device already has a "rk3399" compatible property, so can't we derive GRF offsets from that? Brian
Re: [PATCH net] team: use ETH_MAX_MTU as max mtu
From: Jarod WilsonDate: Mon, 6 Mar 2017 08:48:58 -0500 > This restores the ability to set a team device's mtu to anything higher > than 1500. Similar to the reported issue with bonding, the team driver > calls ether_setup(), which sets an initial max_mtu of 1500, while the > underlying hardware can handle something much larger. Just set it to > ETH_MAX_MTU to support all possible values, and the limitations of the > underlying devices will prevent setting anything too large. > > Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") > CC: Cong Wang > CC: Jiri Pirko > CC: net...@vger.kernel.org > Signed-off-by: Jarod Wilson Applied and queued up for -stable, thanks.
Re: [PATCH net] team: use ETH_MAX_MTU as max mtu
From: Jarod Wilson Date: Mon, 6 Mar 2017 08:48:58 -0500 > This restores the ability to set a team device's mtu to anything higher > than 1500. Similar to the reported issue with bonding, the team driver > calls ether_setup(), which sets an initial max_mtu of 1500, while the > underlying hardware can handle something much larger. Just set it to > ETH_MAX_MTU to support all possible values, and the limitations of the > underlying devices will prevent setting anything too large. > > Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") > CC: Cong Wang > CC: Jiri Pirko > CC: net...@vger.kernel.org > Signed-off-by: Jarod Wilson Applied and queued up for -stable, thanks.
Re: [PATCH] net: toshiba: spider_net: use new api ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Sun, 5 Mar 2017 23:46:00 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: toshiba: spider_net: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Sun, 5 Mar 2017 23:46:00 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: toshiba: ps3_genic_net: use new api ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Sun, 5 Mar 2017 23:21:06 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH] net: toshiba: ps3_genic_net: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes Date: Sun, 5 Mar 2017 23:21:06 +0100 > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > As I don't have the hardware, I'd be very pleased if > someone may test this patch. > > Signed-off-by: Philippe Reynes Applied.
Re: blk: improve order of bio handling in generic_make_request()
On Wed, Mar 08 2017, Mikulas Patocka wrote: > On Wed, 8 Mar 2017, NeilBrown wrote: >> >> I don't think this will fix the DM snapshot deadlock by itself. >> Rather, it make it possible for some internal changes to DM to fix it. >> The DM change might be something vaguely like: >> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index 3086da5664f3..06ee0960e415 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct >> clone_info *ci) >> >> len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count); >> >> +if (len < ci->sector_count) { >> +struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set); > > fs_bio_set is a shared bio set, so it is prone to deadlocks. For this > change, we would need two bio sets per dm device, one for the split bio > and one for the outgoing bio. (this also means having one more kernel > thread per dm device) Yes, two local bio_sets would be best. But we don't really need those extra kernel threads. I'll start working on patches to make them optional, and then to start removing them. Thanks, NeilBrown signature.asc Description: PGP signature
Re: blk: improve order of bio handling in generic_make_request()
On Wed, Mar 08 2017, Mikulas Patocka wrote: > On Wed, 8 Mar 2017, NeilBrown wrote: >> >> I don't think this will fix the DM snapshot deadlock by itself. >> Rather, it make it possible for some internal changes to DM to fix it. >> The DM change might be something vaguely like: >> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index 3086da5664f3..06ee0960e415 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct >> clone_info *ci) >> >> len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count); >> >> +if (len < ci->sector_count) { >> +struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set); > > fs_bio_set is a shared bio set, so it is prone to deadlocks. For this > change, we would need two bio sets per dm device, one for the split bio > and one for the outgoing bio. (this also means having one more kernel > thread per dm device) Yes, two local bio_sets would be best. But we don't really need those extra kernel threads. I'll start working on patches to make them optional, and then to start removing them. Thanks, NeilBrown signature.asc Description: PGP signature