On 2014-11-14 15:59, Meelis Roos wrote:
The second oops is in blk_mq_map_queue() which is a trivial
two level cpu lookup.  I wonder if there's something odd about
cpu numbers on these big old sparc systems?

CPU numbers are sparse - they are determined by hardware slot number and
some models only fill every other mainboard slot, and first slots can be
free. I have first board offline and currently have CPUs numbered
10,11,14,15 online.

Here is debug with Jens's patch:

[  133.971050] CPU 11: synchronized TICK with master CPU (last diff -1 cycles, 
maxerr 516 cycles)
[  133.975491] CPU 14: synchronized TICK with master CPU (last diff -3 cycles, 
maxerr 531 cycles)
[  133.979943] CPU 15: synchronized TICK with master CPU (last diff -3 cycles, 
maxerr 531 cycles)
[  133.980146] Brought up 4 CPUs

So this looks like this might be the issue. On a scsi-mq disabled boot,
you have 4 CPUs, but how are they numbered?

The numbers are always the same.

I would hope so, my question was really on what CPU numbers you see. But I guess that 10, 11, 14, and 15?

But everything seems to be mapped to queue 0?

As it should, scsi-mq only supports a single hw queue for now.

We might need Christophs debug patch on top this to fully know...

Applied it too, dmesg is below. Yes it does spam the log a lot, and over
9600bps console its' somewhat slow :)

There is another detail to note  -this server contains a faulty disk as
sdc that times out spinup. I left it in the server because it helped to
pinpoint and fix a previous error in esp scsi driver. This can be a
factor here too - the error handling details.

It could be. So we have tons of mappings from CPU10 to queue 0, but then we see this:

[  256.236742] cpu: 10
[  256.236749] queue: 809119744

and it turns to crap. This is pretty weird. Try with this debug patch - get rid of the other ones first. It should reduce your noise level too.

--
Jens Axboe

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1065d7c65fa1..9200e2aee746 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -81,6 +81,9 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
 			map[i] = map[first_sibling];
 	}
 
+	for (i = 0; i < queue; i++)
+		printk(KERN_ERR "cpumap %d -> %d\n", i, map[i]);
+
 	free_cpumask_var(cpus);
 	return 0;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68929bad9a6a..1678da3505ea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1265,12 +1265,25 @@ run_queue:
 	blk_mq_put_ctx(data.ctx);
 }
 
+static int did_warn;
+
 /*
  * Default mapping to a software queue, since we use one per CPU.
  */
 struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
 {
-	return q->queue_hw_ctx[q->mq_map[cpu]];
+	int i;
+
+	i = q->mq_map[cpu];
+	if (!i || did_warn)
+		return q->queue_hw_ctx[0];
+
+	printk(KERN_ERR "blk-mq: cpu %u got queue %u\n", cpu, i);
+	for_each_online_cpu(i)
+		printk(KERN_ERR "  cpu%d -> queue index %u\n", i, q->mq_map[i]);
+
+	did_warn = 1;
+	return q->queue_hw_ctx[0];
 }
 EXPORT_SYMBOL(blk_mq_map_queue);
 

Reply via email to