Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-11 Thread Evgenii Shatokhin

On 11.08.2016 05:10, Bob Liu wrote:


On 08/10/2016 10:54 PM, Evgenii Shatokhin wrote:

On 10.08.2016 15:49, Bob Liu wrote:


On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:

On 14.07.2016 15:04, Bob Liu wrote:


On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob



Unfortunately, it did not help. The same BUG_ON() in blkfront_setup_indirect() 
still triggers in our kernel based on RHEL's 3.10.0-327.18.2, where I added the 
patch.

As far as I can see, the patch makes sure the indirect pages are added to the list 
only if (!info->feature_persistent) holds. I suppose it holds in our case and 
the pages are added to the list because the triggered BUG_ON() is here:

   if (!info->feature_persistent && info->max_indirect_segments) {
   <...>
   BUG_ON(!list_empty(&info->indirect_pages));
   <...>
   }



That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob


No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
initrd, I suppose, so the system does not even boot).

However, the problem reproduced with the stable upstream kernel 3.14.74. After 
the system booted the second time with this kernel, that BUG_ON triggered:
   kernel BUG at drivers/block/xen-blkfront.c:1701



Could you please provide more detail on how to reproduce this bug? I'd like to 
have a test.

Thanks!
Bob


As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 AMI on a 
c3.large instance with EBS magnetic storage.



Oh, then it would be difficult to debug this issue.
The xen-blkfront communicates with xen-blkback(in dom0 or driver domain), but 
that part is a black box when running Amazon EC2.
We can't see the source code of the backend side!


Yes, and another problem is, I am still unable to reproduce the issue in 
my EC2 instance. However, the problem shows up rather often in the 
user's instance.




Can this bug be reproduced on your own environment(xen + dom0)?


I haven't tried this yet.




At least 2 LVM partitions are needed:
* /, 20-30 Gb should be enough, ext4
* /vz, 5-10 Gb should be enough, ext4

Kernel 3.14.74 I was talking about: 
https://www.dropbox.com/s/bhus3mubza87z86/kernel-3.14.74-1.test.x86_64.rpm?dl=1

Not sure if it is relevant, but the user may have installed additional packages 
from https://download.openvz.org/virtuozzo/releases/7.0-rtm/x86_64/os/ 
repository. Namely: vzctl, vzmigrate, vzprocps, vztt-lib, vzctcalc, ploop, 
prlctl, centos-7-x86_64-ez.

After the kernel and the other mentioned packages have been installed,
the user rebooted the instance to run that kernel 3.14.74.

Then - start the instance, wait 5 minutes, stop the instance, repeat. 2-20 such 
iterations were usually enough to reproduce the problem. Can be automated with 
the help of Amazon's API.

BTW, before the BUG_ON triggered this time, there was the following in dmesg. 
Not sure if it is related but still:



Attach the full dmesg would be better.


Well, there is not much in the part the user was able to retrieve 
besides what I have sent and the BUG_ON() splat. But here it is, anyway.


Regards,
Evgenii



Regards,
Bob


--
[2.835034] scsi0 : ata_piix
[2.840317] scsi1 : ata_piix
[2.842267] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc100 irq 14
[2.845861] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc108 irq 15
[2.853840] AVX version of gcm_enc/dec engaged.
[2.859963] xen_netfront: Initialising Xen virtual ethernet driver
[2.867156] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
[2.885861] blkfront: xvda: barrier or flush: disabled; persistent grants: 
disabled; indirect descriptors: enabled;
[2.889046] alg: No test for crc32 (crc32-pclmul)
[2.899290]  xvda: xvda1
[2.997751] blkfront: xvdc: flush diskcache: 

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 10:54 PM, Evgenii Shatokhin wrote:
> On 10.08.2016 15:49, Bob Liu wrote:
>>
>> On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
>>> On 14.07.2016 15:04, Bob Liu wrote:

 On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
> On 11.07.2016 15:04, Bob Liu wrote:
>>
>>
>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>> On 06.06.2016 11:42, Dario Faggioli wrote:
 Just Cc-ing some Linux, block, and Xen on CentOS people...

>>>
>>> Ping.
>>>
>>> Any suggestions how to debug this or what might cause the problem?
>>>
>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
>>> there is something we can do at the kernel's side, is it?
>>>
 On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
> (Resending this bug report because the message I sent last week did
> not
> make it to the mailing list somehow.)
>
> Hi,
>
> One of our users gets kernel panics from time to time when he tries
> to
> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
> happens within minutes from the moment the instance starts. The
> problem
> does not show up every time, however.
>
> The user first observed the problem with a custom kernel, but it was
> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
> CentOS7 was affected as well.
>>
>> Please try this patch:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>
>> Regards,
>> Bob
>>
>
> Unfortunately, it did not help. The same BUG_ON() in 
> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
> 3.10.0-327.18.2, where I added the patch.
>
> As far as I can see, the patch makes sure the indirect pages are added to 
> the list only if (!info->feature_persistent) holds. I suppose it holds in 
> our case and the pages are added to the list because the triggered 
> BUG_ON() is here:
>
>   if (!info->feature_persistent && info->max_indirect_segments) {
>   <...>
>   BUG_ON(!list_empty(&info->indirect_pages));
>   <...>
>   }
>

 That's odd.
 Could you please try to reproduce this issue with a recent upstream kernel?

 Thanks,
 Bob
>>>
>>> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
>>> initrd, I suppose, so the system does not even boot).
>>>
>>> However, the problem reproduced with the stable upstream kernel 3.14.74. 
>>> After the system booted the second time with this kernel, that BUG_ON 
>>> triggered:
>>>   kernel BUG at drivers/block/xen-blkfront.c:1701
>>>
>>
>> Could you please provide more detail on how to reproduce this bug? I'd like 
>> to have a test.
>>
>> Thanks!
>> Bob
> 
> As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 AMI on 
> a c3.large instance with EBS magnetic storage.
> 

Oh, then it would be difficult to debug this issue.
The xen-blkfront communicates with xen-blkback(in dom0 or driver domain), but 
that part is a black box when running Amazon EC2.
We can't see the source code of the backend side!

Can this bug be reproduced on your own environment(xen + dom0)?

> At least 2 LVM partitions are needed:
> * /, 20-30 Gb should be enough, ext4
> * /vz, 5-10 Gb should be enough, ext4
> 
> Kernel 3.14.74 I was talking about: 
> https://www.dropbox.com/s/bhus3mubza87z86/kernel-3.14.74-1.test.x86_64.rpm?dl=1
> 
> Not sure if it is relevant, but the user may have installed additional 
> packages from 
> https://download.openvz.org/virtuozzo/releases/7.0-rtm/x86_64/os/ repository. 
> Namely: vzctl, vzmigrate, vzprocps, vztt-lib, vzctcalc, ploop, prlctl, 
> centos-7-x86_64-ez.
> 
> After the kernel and the other mentioned packages have been installed,
> the user rebooted the instance to run that kernel 3.14.74.
> 
> Then - start the instance, wait 5 minutes, stop the instance, repeat. 2-20 
> such iterations were usually enough to reproduce the problem. Can be 
> automated with the help of Amazon's API.
> 
> BTW, before the BUG_ON triggered this time, there was the following in dmesg. 
> Not sure if it is related but still:
> 

Attach the full dmesg would be better.

Regards,
Bob

> --
> [2.835034] scsi0 : ata_piix
> [2.840317] scsi1 : ata_piix
> [2.842267] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc100 irq 14
> [2.845861] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc108 irq 15
> [2.853840] AVX version of gcm_enc/dec engaged.
> [2.859963] xen_netfront: Initialising Xen virtual ethernet driver
> [2.867156] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
> [2.885861] blkfront: xvda: barrier or flush: disabled; persistent grants: 
> disa

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Evgenii Shatokhin

On 10.08.2016 15:49, Bob Liu wrote:


On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:

On 14.07.2016 15:04, Bob Liu wrote:


On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob



Unfortunately, it did not help. The same BUG_ON() in blkfront_setup_indirect() 
still triggers in our kernel based on RHEL's 3.10.0-327.18.2, where I added the 
patch.

As far as I can see, the patch makes sure the indirect pages are added to the list 
only if (!info->feature_persistent) holds. I suppose it holds in our case and 
the pages are added to the list because the triggered BUG_ON() is here:

  if (!info->feature_persistent && info->max_indirect_segments) {
  <...>
  BUG_ON(!list_empty(&info->indirect_pages));
  <...>
  }



That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob


No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
initrd, I suppose, so the system does not even boot).

However, the problem reproduced with the stable upstream kernel 3.14.74. After 
the system booted the second time with this kernel, that BUG_ON triggered:
  kernel BUG at drivers/block/xen-blkfront.c:1701



Could you please provide more detail on how to reproduce this bug? I'd like to 
have a test.

Thanks!
Bob


As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 
AMI on a c3.large instance with EBS magnetic storage.


At least 2 LVM partitions are needed:
* /, 20-30 Gb should be enough, ext4
* /vz, 5-10 Gb should be enough, ext4

Kernel 3.14.74 I was talking about: 
https://www.dropbox.com/s/bhus3mubza87z86/kernel-3.14.74-1.test.x86_64.rpm?dl=1


Not sure if it is relevant, but the user may have installed additional 
packages from 
https://download.openvz.org/virtuozzo/releases/7.0-rtm/x86_64/os/ 
repository. Namely: vzctl, vzmigrate, vzprocps, vztt-lib, vzctcalc, 
ploop, prlctl, centos-7-x86_64-ez.


After the kernel and the other mentioned packages have been installed,
the user rebooted the instance to run that kernel 3.14.74.

Then - start the instance, wait 5 minutes, stop the instance, repeat. 
2-20 such iterations were usually enough to reproduce the problem. Can 
be automated with the help of Amazon's API.


BTW, before the BUG_ON triggered this time, there was the following in 
dmesg. Not sure if it is related but still:


--
[2.835034] scsi0 : ata_piix
[2.840317] scsi1 : ata_piix
[2.842267] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc100 irq 14
[2.845861] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc108 irq 15
[2.853840] AVX version of gcm_enc/dec engaged.
[2.859963] xen_netfront: Initialising Xen virtual ethernet driver
[2.867156] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
[2.885861] blkfront: xvda: barrier or flush: disabled; persistent 
grants: disabled; indirect descriptors: enabled;

[2.889046] alg: No test for crc32 (crc32-pclmul)
[2.899290]  xvda: xvda1
[2.997751] blkfront: xvdc: flush diskcache: enabled; persistent 
grants: disabled; indirect descriptors: enabled;

[3.007401]  xvdc: unknown partition table
[3.010465] Setting capacity to 31992832
[3.012922] xvdc: detected capacity change from 0 to 16380329984
[3.017408] blkfront: xvdd: flush diskcache: enabled; persistent 
grants: disabled; indirect descriptors: enabled;

[3.023861]  xvdd: unknown partition table
[3.026481] Setting capacity to 31992832
[3.029051] xvdd: detected capacity change from 0 to 16380329984
[3.033320] blkfront: xvdf: barrier or flush: disabled; persistent 
grants: disabled; indirect descriptors: enabled;

[3.040712] random: nonblocking pool is initialized
[3.057432]  xvdf: unknown partition table
[3.060807] Setting capacity to 41943040
[3.063194] xvdf: detected

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Bob Liu

On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
> On 14.07.2016 15:04, Bob Liu wrote:
>>
>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>> On 11.07.2016 15:04, Bob Liu wrote:


 On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
> On 06.06.2016 11:42, Dario Faggioli wrote:
>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>
>
> Ping.
>
> Any suggestions how to debug this or what might cause the problem?
>
> Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
> there is something we can do at the kernel's side, is it?
>
>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>> (Resending this bug report because the message I sent last week did
>>> not
>>> make it to the mailing list somehow.)
>>>
>>> Hi,
>>>
>>> One of our users gets kernel panics from time to time when he tries
>>> to
>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>> happens within minutes from the moment the instance starts. The
>>> problem
>>> does not show up every time, however.
>>>
>>> The user first observed the problem with a custom kernel, but it was
>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>> CentOS7 was affected as well.

 Please try this patch:
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

 Regards,
 Bob

>>>
>>> Unfortunately, it did not help. The same BUG_ON() in 
>>> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
>>> 3.10.0-327.18.2, where I added the patch.
>>>
>>> As far as I can see, the patch makes sure the indirect pages are added to 
>>> the list only if (!info->feature_persistent) holds. I suppose it holds in 
>>> our case and the pages are added to the list because the triggered BUG_ON() 
>>> is here:
>>>
>>>  if (!info->feature_persistent && info->max_indirect_segments) {
>>>  <...>
>>>  BUG_ON(!list_empty(&info->indirect_pages));
>>>  <...>
>>>  }
>>>
>>
>> That's odd.
>> Could you please try to reproduce this issue with a recent upstream kernel?
>>
>> Thanks,
>> Bob
> 
> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad 
> initrd, I suppose, so the system does not even boot).
> 
> However, the problem reproduced with the stable upstream kernel 3.14.74. 
> After the system booted the second time with this kernel, that BUG_ON 
> triggered:
>  kernel BUG at drivers/block/xen-blkfront.c:1701
> 

Could you please provide more detail on how to reproduce this bug? I'd like to 
have a test.

Thanks!
Bob

>>
>>> So the problem is still out there somewhere, it seems.
>>>
>>> Regards,
>>> Evgenii
>>>
>>>
>>> The part of the system log he was able to retrieve is attached. Here
>>> is
>>> the bug info, for convenience:
>>>
>>> 
>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>> [2.246912] invalid opcode:  [#1] SMP
>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>> crct10dif_pclmul
>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>> dm_mirror
>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>> 3.10.0-327.18.2.el7.x86_64 #1
>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>> 12/07/2015
>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>> 8800e98bc000
>>> [2.246912] RIP: 0010:[]  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>> 0020
>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>> 8800353e15d0
>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>> 8800eb403c00
>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>> 8800e98c4000
>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>> 8800353e15c0
>>> [2.246912] FS:  () GS:8800efc2()
>>> knlGS:
>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
>>> 001406e0
>>> [2.246912] DR0:  DR1:  DR2:
>>> 
>>> [2.246912] DR3:  DR6: 0ff0 DR7:
>>> 0400
>>> [2.246912] Stack:
>>> [2.246912]  00

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-08-10 Thread Evgenii Shatokhin

On 14.07.2016 15:04, Bob Liu wrote:


On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob



Unfortunately, it did not help. The same BUG_ON() in blkfront_setup_indirect() 
still triggers in our kernel based on RHEL's 3.10.0-327.18.2, where I added the 
patch.

As far as I can see, the patch makes sure the indirect pages are added to the list 
only if (!info->feature_persistent) holds. I suppose it holds in our case and 
the pages are added to the list because the triggered BUG_ON() is here:

 if (!info->feature_persistent && info->max_indirect_segments) {
 <...>
 BUG_ON(!list_empty(&info->indirect_pages));
 <...>
 }



That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob


No luck with the upstream kernel 4.7.0 so far due to unrelated issues 
(bad initrd, I suppose, so the system does not even boot).


However, the problem reproduced with the stable upstream kernel 3.14.74. 
After the system booted the second time with this kernel, that BUG_ON 
triggered:

 kernel BUG at drivers/block/xen-blkfront.c:1701




So the problem is still out there somewhere, it seems.

Regards,
Evgenii



The part of the system log he was able to retrieve is attached. Here
is
the bug info, for convenience:


[2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
[2.246912] invalid opcode:  [#1] SMP
[2.246912] Modules linked in: ata_generic pata_acpi
crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
dm_mirror
dm_region_hash dm_log dm_mod scsi_transport_iscsi
[2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
3.10.0-327.18.2.el7.x86_64 #1
[2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
12/07/2015
[2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
8800e98bc000
[2.246912] RIP: 0010:[]  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
[2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
0020
[2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
8800353e15d0
[2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
8800eb403c00
[2.246912] R10: a0155532 R11: ffe8 R12:
8800e98c4000
[2.246912] R13: 8800e98c52b8 R14: 0020 R15:
8800353e15c0
[2.246912] FS:  () GS:8800efc2()
knlGS:
[2.246912] CS:  0010 DS:  ES:  CR0: 80050033
[2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
001406e0
[2.246912] DR0:  DR1:  DR2:

[2.246912] DR3:  DR6: 0ff0 DR7:
0400
[2.246912] Stack:
[2.246912]  0020 0001 0020a0157217
0100e98bfdbc
[2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
8800e98c4000
[2.246912]  8800e98ce040 0001 8800e98bfe08
a0155d4c
[2.246912] Call Trace:
[2.246912]  [] blkback_changed+0x4ec/0xfc8
[xen_blkfront]
[2.246912]  [] ? xenbus_gather+0x170/0x190
[2.246912]  [] ? __slab_free+0x10e/0x277
[2.246912]  []
xenbus_otherend_changed+0xad/0x110
[2.246912]  [] ? xenwatch_thread+0x77/0x180
[2.246912]  [] backend_changed+0x13/0x20
[2.246912]  [] xenwatch_thread+0x66/0x180
[2.246912]  [] ? wake_up_atomic_t+0x30/0x30
[2.246912]  [] ?
unregister_xenbus_watch+0x1f0/0x1f0
[2.246912]  [] kthread+0xcf/0xe0
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912]  [] ret_from_fork+0x58/0x90
[2.246912] 

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-14 Thread Evgenii Shatokhin

On 14.07.2016 15:04, Bob Liu wrote:


On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob



Unfortunately, it did not help. The same BUG_ON() in blkfront_setup_indirect() 
still triggers in our kernel based on RHEL's 3.10.0-327.18.2, where I added the 
patch.

As far as I can see, the patch makes sure the indirect pages are added to the list 
only if (!info->feature_persistent) holds. I suppose it holds in our case and 
the pages are added to the list because the triggered BUG_ON() is here:

 if (!info->feature_persistent && info->max_indirect_segments) {
 <...>
 BUG_ON(!list_empty(&info->indirect_pages));
 <...>
 }



That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?


Yes, will try to.



Thanks,
Bob


So the problem is still out there somewhere, it seems.

Regards,
Evgenii



The part of the system log he was able to retrieve is attached. Here
is
the bug info, for convenience:


[2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
[2.246912] invalid opcode:  [#1] SMP
[2.246912] Modules linked in: ata_generic pata_acpi
crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
dm_mirror
dm_region_hash dm_log dm_mod scsi_transport_iscsi
[2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
3.10.0-327.18.2.el7.x86_64 #1
[2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
12/07/2015
[2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
8800e98bc000
[2.246912] RIP: 0010:[]  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
[2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
0020
[2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
8800353e15d0
[2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
8800eb403c00
[2.246912] R10: a0155532 R11: ffe8 R12:
8800e98c4000
[2.246912] R13: 8800e98c52b8 R14: 0020 R15:
8800353e15c0
[2.246912] FS:  () GS:8800efc2()
knlGS:
[2.246912] CS:  0010 DS:  ES:  CR0: 80050033
[2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
001406e0
[2.246912] DR0:  DR1:  DR2:

[2.246912] DR3:  DR6: 0ff0 DR7:
0400
[2.246912] Stack:
[2.246912]  0020 0001 0020a0157217
0100e98bfdbc
[2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
8800e98c4000
[2.246912]  8800e98ce040 0001 8800e98bfe08
a0155d4c
[2.246912] Call Trace:
[2.246912]  [] blkback_changed+0x4ec/0xfc8
[xen_blkfront]
[2.246912]  [] ? xenbus_gather+0x170/0x190
[2.246912]  [] ? __slab_free+0x10e/0x277
[2.246912]  []
xenbus_otherend_changed+0xad/0x110
[2.246912]  [] ? xenwatch_thread+0x77/0x180
[2.246912]  [] backend_changed+0x13/0x20
[2.246912]  [] xenwatch_thread+0x66/0x180
[2.246912]  [] ? wake_up_atomic_t+0x30/0x30
[2.246912]  [] ?
unregister_xenbus_watch+0x1f0/0x1f0
[2.246912]  [] kthread+0xcf/0xe0
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912]  [] ret_from_fork+0x58/0x90
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
45
b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
fe
ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
00
[2.246912] RIP  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkf

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-14 Thread Bob Liu

On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
> On 11.07.2016 15:04, Bob Liu wrote:
>>
>>
>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>> On 06.06.2016 11:42, Dario Faggioli wrote:
 Just Cc-ing some Linux, block, and Xen on CentOS people...

>>>
>>> Ping.
>>>
>>> Any suggestions how to debug this or what might cause the problem?
>>>
>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
>>> is something we can do at the kernel's side, is it?
>>>
 On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
> (Resending this bug report because the message I sent last week did
> not
> make it to the mailing list somehow.)
>
> Hi,
>
> One of our users gets kernel panics from time to time when he tries
> to
> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
> happens within minutes from the moment the instance starts. The
> problem
> does not show up every time, however.
>
> The user first observed the problem with a custom kernel, but it was
> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
> CentOS7 was affected as well.
>>
>> Please try this patch:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>
>> Regards,
>> Bob
>>
> 
> Unfortunately, it did not help. The same BUG_ON() in 
> blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
> 3.10.0-327.18.2, where I added the patch.
> 
> As far as I can see, the patch makes sure the indirect pages are added to the 
> list only if (!info->feature_persistent) holds. I suppose it holds in our 
> case and the pages are added to the list because the triggered BUG_ON() is 
> here:
> 
> if (!info->feature_persistent && info->max_indirect_segments) {
> <...>
> BUG_ON(!list_empty(&info->indirect_pages));
> <...>
> }
> 

That's odd.
Could you please try to reproduce this issue with a recent upstream kernel?

Thanks,
Bob

> So the problem is still out there somewhere, it seems.
> 
> Regards,
> Evgenii
> 
>
> The part of the system log he was able to retrieve is attached. Here
> is
> the bug info, for convenience:
>
> 
> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
> [2.246912] invalid opcode:  [#1] SMP
> [2.246912] Modules linked in: ata_generic pata_acpi
> crct10dif_pclmul
> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
> dm_mirror
> dm_region_hash dm_log dm_mod scsi_transport_iscsi
> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
> 3.10.0-327.18.2.el7.x86_64 #1
> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
> 12/07/2015
> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
> 8800e98bc000
> [2.246912] RIP: 0010:[]  []
> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
> 0020
> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
> 8800353e15d0
> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
> 8800eb403c00
> [2.246912] R10: a0155532 R11: ffe8 R12:
> 8800e98c4000
> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
> 8800353e15c0
> [2.246912] FS:  () GS:8800efc2()
> knlGS:
> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
> 001406e0
> [2.246912] DR0:  DR1:  DR2:
> 
> [2.246912] DR3:  DR6: 0ff0 DR7:
> 0400
> [2.246912] Stack:
> [2.246912]  0020 0001 0020a0157217
> 0100e98bfdbc
> [2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
> 8800e98c4000
> [2.246912]  8800e98ce040 0001 8800e98bfe08
> a0155d4c
> [2.246912] Call Trace:
> [2.246912]  [] blkback_changed+0x4ec/0xfc8
> [xen_blkfront]
> [2.246912]  [] ? xenbus_gather+0x170/0x190
> [2.246912]  [] ? __slab_free+0x10e/0x277
> [2.246912]  []
> xenbus_otherend_changed+0xad/0x110
> [2.246912]  [] ? xenwatch_thread+0x77/0x180
> [2.246912]  [] backend_changed+0x13/0x20
> [2.246912]  [] xenwatch_thread+0x66/0x180
> [2.246912]  [] ? wake_up_atomic_t+0x30/0x30
> [2.246912]  [] ?
> unr

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-14 Thread Evgenii Shatokhin

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob



Unfortunately, it did not help. The same BUG_ON() in 
blkfront_setup_indirect() still triggers in our kernel based on RHEL's 
3.10.0-327.18.2, where I added the patch.


As far as I can see, the patch makes sure the indirect pages are added 
to the list only if (!info->feature_persistent) holds. I suppose it 
holds in our case and the pages are added to the list because the 
triggered BUG_ON() is here:


if (!info->feature_persistent && info->max_indirect_segments) {
<...>
BUG_ON(!list_empty(&info->indirect_pages));
<...>
}

So the problem is still out there somewhere, it seems.

Regards,
Evgenii



The part of the system log he was able to retrieve is attached. Here
is
the bug info, for convenience:


[2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
[2.246912] invalid opcode:  [#1] SMP
[2.246912] Modules linked in: ata_generic pata_acpi
crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
dm_mirror
dm_region_hash dm_log dm_mod scsi_transport_iscsi
[2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
3.10.0-327.18.2.el7.x86_64 #1
[2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
12/07/2015
[2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
8800e98bc000
[2.246912] RIP: 0010:[]  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
[2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
0020
[2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
8800353e15d0
[2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
8800eb403c00
[2.246912] R10: a0155532 R11: ffe8 R12:
8800e98c4000
[2.246912] R13: 8800e98c52b8 R14: 0020 R15:
8800353e15c0
[2.246912] FS:  () GS:8800efc2()
knlGS:
[2.246912] CS:  0010 DS:  ES:  CR0: 80050033
[2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
001406e0
[2.246912] DR0:  DR1:  DR2:

[2.246912] DR3:  DR6: 0ff0 DR7:
0400
[2.246912] Stack:
[2.246912]  0020 0001 0020a0157217
0100e98bfdbc
[2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
8800e98c4000
[2.246912]  8800e98ce040 0001 8800e98bfe08
a0155d4c
[2.246912] Call Trace:
[2.246912]  [] blkback_changed+0x4ec/0xfc8
[xen_blkfront]
[2.246912]  [] ? xenbus_gather+0x170/0x190
[2.246912]  [] ? __slab_free+0x10e/0x277
[2.246912]  []
xenbus_otherend_changed+0xad/0x110
[2.246912]  [] ? xenwatch_thread+0x77/0x180
[2.246912]  [] backend_changed+0x13/0x20
[2.246912]  [] xenwatch_thread+0x66/0x180
[2.246912]  [] ? wake_up_atomic_t+0x30/0x30
[2.246912]  [] ?
unregister_xenbus_watch+0x1f0/0x1f0
[2.246912]  [] kthread+0xcf/0xe0
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912]  [] ret_from_fork+0x58/0x90
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
45
b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
fe
ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
00
[2.246912] RIP  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912]  RSP 
[2.491574] ---[ end trace 8a9b992812627c71 ]---
[2.495618] Kernel panic - not syncing: Fatal exception


Xen versi

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Evgenii Shatokhin

On 11.07.2016 13:37, George Dunlap wrote:

On Mon, Jul 11, 2016 at 9:50 AM, Evgenii Shatokhin
 wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:


Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there
is something we can do at the kernel's side, is it?


I think part of the problem is that your report has confused people so
that everyone cc'd thinks it's someone else's problem. :-)

To wit..


One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


...by mentioning the exact CentOS kernel version, but not the version
of the "custom kernel" you used, I suspect the people familiar with
netfront filtered this out as something to be taken care of by the
CentOS / RHEL system.


The custom kernel is based on the same RHEL's kernel 
3.10.0-327.18.2.el7.x86_64 as the one from CentOS 7. We did not change 
Xen-related parts there.




If you can reproduce this with a relatively recent stock kernel,
please post the kernel version and the debug information.

If you can't, then it's likely to be an issue that RH needs to take
care of by backporting whatever change fixed the issue.


Thanks for the suggestion.

Yes, if the patch Bob Liu has proposed does not help, I will try to 
reproduce the problem on a recent mainline kernel.


The difficult part is that the problem is rather hard to reproduce, but, 
well, it is another story.




Thanks,
  -George
.



Regards,
Evgenii

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Evgenii Shatokhin

On 11.07.2016 15:04, Bob Liu wrote:



On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is 
something we can do at the kernel's side, is it?


On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.


Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc


Thanks! I have rebuilt our kernel (based on RHEL's 
3.10.0-327.18.2.el7.x86_64) with that patch added and asked that user to 
try it. Let us see if it helps.


Regards,
Evgenii



Regards,
Bob



The part of the system log he was able to retrieve is attached. Here
is
the bug info, for convenience:


[2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
[2.246912] invalid opcode:  [#1] SMP
[2.246912] Modules linked in: ata_generic pata_acpi
crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
dm_mirror
dm_region_hash dm_log dm_mod scsi_transport_iscsi
[2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
3.10.0-327.18.2.el7.x86_64 #1
[2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
12/07/2015
[2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
8800e98bc000
[2.246912] RIP: 0010:[]  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
[2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
0020
[2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
8800353e15d0
[2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
8800eb403c00
[2.246912] R10: a0155532 R11: ffe8 R12:
8800e98c4000
[2.246912] R13: 8800e98c52b8 R14: 0020 R15:
8800353e15c0
[2.246912] FS:  () GS:8800efc2()
knlGS:
[2.246912] CS:  0010 DS:  ES:  CR0: 80050033
[2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
001406e0
[2.246912] DR0:  DR1:  DR2:

[2.246912] DR3:  DR6: 0ff0 DR7:
0400
[2.246912] Stack:
[2.246912]  0020 0001 0020a0157217
0100e98bfdbc
[2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
8800e98c4000
[2.246912]  8800e98ce040 0001 8800e98bfe08
a0155d4c
[2.246912] Call Trace:
[2.246912]  [] blkback_changed+0x4ec/0xfc8
[xen_blkfront]
[2.246912]  [] ? xenbus_gather+0x170/0x190
[2.246912]  [] ? __slab_free+0x10e/0x277
[2.246912]  []
xenbus_otherend_changed+0xad/0x110
[2.246912]  [] ? xenwatch_thread+0x77/0x180
[2.246912]  [] backend_changed+0x13/0x20
[2.246912]  [] xenwatch_thread+0x66/0x180
[2.246912]  [] ? wake_up_atomic_t+0x30/0x30
[2.246912]  [] ?
unregister_xenbus_watch+0x1f0/0x1f0
[2.246912]  [] kthread+0xcf/0xe0
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912]  [] ret_from_fork+0x58/0x90
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
45
b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
fe
ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
00
[2.246912] RIP  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912]  RSP 
[2.491574] ---[ end trace 8a9b992812627c71 ]---
[2.495618] Kernel panic - not syncing: Fatal exception


Xen version 4.2.

EC2 instance type: c3.large with EBS magnetic storage, if that
matters.

Here is the code where the BUG_ON triggers (drivers/block/xen-
blkfront.c):

if (!info->feature_persistent && info->max_indirect_segments) {
   /*
   * We are using indirect descriptors but not persistent
   * grants, we need to allocate a set of pages that can be
   * used for mapping indirect grefs
   */
   int num = INDIRECT_GR

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Bob Liu


On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
> On 06.06.2016 11:42, Dario Faggioli wrote:
>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>
> 
> Ping.
> 
> Any suggestions how to debug this or what might cause the problem?
> 
> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there 
> is something we can do at the kernel's side, is it?
> 
>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>> (Resending this bug report because the message I sent last week did
>>> not
>>> make it to the mailing list somehow.)
>>>
>>> Hi,
>>>
>>> One of our users gets kernel panics from time to time when he tries
>>> to
>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>> happens within minutes from the moment the instance starts. The
>>> problem
>>> does not show up every time, however.
>>>
>>> The user first observed the problem with a custom kernel, but it was
>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>> CentOS7 was affected as well.

Please try this patch:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc

Regards,
Bob

>>>
>>> The part of the system log he was able to retrieve is attached. Here
>>> is
>>> the bug info, for convenience:
>>>
>>> 
>>> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>> [2.246912] invalid opcode:  [#1] SMP
>>> [2.246912] Modules linked in: ata_generic pata_acpi
>>> crct10dif_pclmul
>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>> dm_mirror
>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>> 3.10.0-327.18.2.el7.x86_64 #1
>>> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>> 12/07/2015
>>> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
>>> 8800e98bc000
>>> [2.246912] RIP: 0010:[]  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
>>> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
>>> 0020
>>> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
>>> 8800353e15d0
>>> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
>>> 8800eb403c00
>>> [2.246912] R10: a0155532 R11: ffe8 R12:
>>> 8800e98c4000
>>> [2.246912] R13: 8800e98c52b8 R14: 0020 R15:
>>> 8800353e15c0
>>> [2.246912] FS:  () GS:8800efc2()
>>> knlGS:
>>> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
>>> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
>>> 001406e0
>>> [2.246912] DR0:  DR1:  DR2:
>>> 
>>> [2.246912] DR3:  DR6: 0ff0 DR7:
>>> 0400
>>> [2.246912] Stack:
>>> [2.246912]  0020 0001 0020a0157217
>>> 0100e98bfdbc
>>> [2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
>>> 8800e98c4000
>>> [2.246912]  8800e98ce040 0001 8800e98bfe08
>>> a0155d4c
>>> [2.246912] Call Trace:
>>> [2.246912]  [] blkback_changed+0x4ec/0xfc8
>>> [xen_blkfront]
>>> [2.246912]  [] ? xenbus_gather+0x170/0x190
>>> [2.246912]  [] ? __slab_free+0x10e/0x277
>>> [2.246912]  []
>>> xenbus_otherend_changed+0xad/0x110
>>> [2.246912]  [] ? xenwatch_thread+0x77/0x180
>>> [2.246912]  [] backend_changed+0x13/0x20
>>> [2.246912]  [] xenwatch_thread+0x66/0x180
>>> [2.246912]  [] ? wake_up_atomic_t+0x30/0x30
>>> [2.246912]  [] ?
>>> unregister_xenbus_watch+0x1f0/0x1f0
>>> [2.246912]  [] kthread+0xcf/0xe0
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912]  [] ret_from_fork+0x58/0x90
>>> [2.246912]  [] ?
>>> kthread_create_on_node+0x140/0x140
>>> [2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
>>> 45
>>> b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
>>> fe
>>> ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
>>> 00
>>> [2.246912] RIP  []
>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>> [2.246912]  RSP 
>>> [2.491574] ---[ end trace 8a9b992812627c71 ]---
>>> [2.495618] Kernel panic - not syncing: Fatal exception
>>> 
>>>
>>> Xen version 4.2.
>>>
>>> EC2 instance type: c3.large with EBS magnetic storage, if that
>>> matters.
>>>
>>> Here is the code where the BUG_ON triggers (drivers/block/xen-
>>> blkfront.c):
>>> 
>>> if (!info->feature_persistent && info->max_indirect_segments) {
>>>   /

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread George Dunlap
On Mon, Jul 11, 2016 at 9:50 AM, Evgenii Shatokhin
 wrote:
> On 06.06.2016 11:42, Dario Faggioli wrote:
>>
>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>
>
> Ping.
>
> Any suggestions how to debug this or what might cause the problem?
>
> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there
> is something we can do at the kernel's side, is it?

I think part of the problem is that your report has confused people so
that everyone cc'd thinks it's someone else's problem. :-)

To wit..

>>> One of our users gets kernel panics from time to time when he tries
>>> to
>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>> happens within minutes from the moment the instance starts. The
>>> problem
>>> does not show up every time, however.
>>>
>>> The user first observed the problem with a custom kernel, but it was
>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>> CentOS7 was affected as well.

...by mentioning the exact CentOS kernel version, but not the version
of the "custom kernel" you used, I suspect the people familiar with
netfront filtered this out as something to be taken care of by the
CentOS / RHEL system.

If you can reproduce this with a relatively recent stock kernel,
please post the kernel version and the debug information.

If you can't, then it's likely to be an issue that RH needs to take
care of by backporting whatever change fixed the issue.

Thanks,
 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-07-11 Thread Evgenii Shatokhin

On 06.06.2016 11:42, Dario Faggioli wrote:

Just Cc-ing some Linux, block, and Xen on CentOS people...



Ping.

Any suggestions how to debug this or what might cause the problem?

Obviously, we cannot control Xen on the Amazon's servers. But perhaps 
there is something we can do at the kernel's side, is it?



On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:

(Resending this bug report because the message I sent last week did
not
make it to the mailing list somehow.)

Hi,

One of our users gets kernel panics from time to time when he tries
to
use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
happens within minutes from the moment the instance starts. The
problem
does not show up every time, however.

The user first observed the problem with a custom kernel, but it was
found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
CentOS7 was affected as well.

The part of the system log he was able to retrieve is attached. Here
is
the bug info, for convenience:


[2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
[2.246912] invalid opcode:  [#1] SMP
[2.246912] Modules linked in: ata_generic pata_acpi
crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
dm_mirror
dm_region_hash dm_log dm_mod scsi_transport_iscsi
[2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
3.10.0-327.18.2.el7.x86_64 #1
[2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
12/07/2015
[2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti:
8800e98bc000
[2.246912] RIP: 0010:[]  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
[2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX:
0020
[2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI:
8800353e15d0
[2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09:
8800eb403c00
[2.246912] R10: a0155532 R11: ffe8 R12:
8800e98c4000
[2.246912] R13: 8800e98c52b8 R14: 0020 R15:
8800353e15c0
[2.246912] FS:  () GS:8800efc2()
knlGS:
[2.246912] CS:  0010 DS:  ES:  CR0: 80050033
[2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4:
001406e0
[2.246912] DR0:  DR1:  DR2:

[2.246912] DR3:  DR6: 0ff0 DR7:
0400
[2.246912] Stack:
[2.246912]  0020 0001 0020a0157217
0100e98bfdbc
[2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000
8800e98c4000
[2.246912]  8800e98ce040 0001 8800e98bfe08
a0155d4c
[2.246912] Call Trace:
[2.246912]  [] blkback_changed+0x4ec/0xfc8
[xen_blkfront]
[2.246912]  [] ? xenbus_gather+0x170/0x190
[2.246912]  [] ? __slab_free+0x10e/0x277
[2.246912]  []
xenbus_otherend_changed+0xad/0x110
[2.246912]  [] ? xenwatch_thread+0x77/0x180
[2.246912]  [] backend_changed+0x13/0x20
[2.246912]  [] xenwatch_thread+0x66/0x180
[2.246912]  [] ? wake_up_atomic_t+0x30/0x30
[2.246912]  [] ?
unregister_xenbus_watch+0x1f0/0x1f0
[2.246912]  [] kthread+0xcf/0xe0
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912]  [] ret_from_fork+0x58/0x90
[2.246912]  [] ?
kthread_create_on_node+0x140/0x140
[2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
45
b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
fe
ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
00
[2.246912] RIP  []
blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
[2.246912]  RSP 
[2.491574] ---[ end trace 8a9b992812627c71 ]---
[2.495618] Kernel panic - not syncing: Fatal exception


Xen version 4.2.

EC2 instance type: c3.large with EBS magnetic storage, if that
matters.

Here is the code where the BUG_ON triggers (drivers/block/xen-
blkfront.c):

if (!info->feature_persistent && info->max_indirect_segments) {
  /*
  * We are using indirect descriptors but not persistent
  * grants, we need to allocate a set of pages that can be
  * used for mapping indirect grefs
  */
  int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;

  BUG_ON(!list_empty(&info->indirect_pages)); // << This one hits.
  for (i = 0; i < num; i++) {
  struct page *indirect_page = alloc_page(GFP_NOIO);
  if (!indirect_page)
  goto out_of_memory;
  list_add(&indirect_page->lru, &info->indirect_pages);
  }
}


As we checked, 'info->indirect_pages' list indeed contained ar

Re: [Xen-devel] [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711

2016-06-06 Thread Dario Faggioli
Just Cc-ing some Linux, block, and Xen on CentOS people...

On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
> (Resending this bug report because the message I sent last week did
> not 
> make it to the mailing list somehow.)
> 
> Hi,
> 
> One of our users gets kernel panics from time to time when he tries
> to 
> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic 
> happens within minutes from the moment the instance starts. The
> problem 
> does not show up every time, however.
> 
> The user first observed the problem with a custom kernel, but it was 
> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from 
> CentOS7 was affected as well.
> 
> The part of the system log he was able to retrieve is attached. Here
> is 
> the bug info, for convenience:
> 
> 
> [2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
> [2.246912] invalid opcode:  [#1] SMP
> [2.246912] Modules linked in: ata_generic pata_acpi
> crct10dif_pclmul 
> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel 
> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul 
> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
> dm_mirror 
> dm_region_hash dm_log dm_mod scsi_transport_iscsi
> [2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted 
> 3.10.0-327.18.2.el7.x86_64 #1
> [2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
> 12/07/2015
> [2.246912] task: 8800e9fcb980 ti: 8800e98bc000 task.ti: 
> 8800e98bc000
> [2.246912] RIP: 0010:[]  [] 
> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
> [2.246912] RSP: 0018:8800e98bfcd0  EFLAGS: 00010283
> [2.246912] RAX: 8800353e15c0 RBX: 8800e98c52c8 RCX: 
> 0020
> [2.246912] RDX: 8800353e15b0 RSI: 8800e98c52b8 RDI: 
> 8800353e15d0
> [2.246912] RBP: 8800e98bfd20 R08: 8800353e15b0 R09: 
> 8800eb403c00
> [2.246912] R10: a0155532 R11: ffe8 R12: 
> 8800e98c4000
> [2.246912] R13: 8800e98c52b8 R14: 0020 R15: 
> 8800353e15c0
> [2.246912] FS:  () GS:8800efc2() 
> knlGS:
> [2.246912] CS:  0010 DS:  ES:  CR0: 80050033
> [2.246912] CR2: 7f1b615ef000 CR3: e2b44000 CR4: 
> 001406e0
> [2.246912] DR0:  DR1:  DR2: 
> 
> [2.246912] DR3:  DR6: 0ff0 DR7: 
> 0400
> [2.246912] Stack:
> [2.246912]  0020 0001 0020a0157217 
> 0100e98bfdbc
> [2.246912]  27efa3ef 8800e98bfdbc 8800e98ce000 
> 8800e98c4000
> [2.246912]  8800e98ce040 0001 8800e98bfe08 
> a0155d4c
> [2.246912] Call Trace:
> [2.246912]  [] blkback_changed+0x4ec/0xfc8 
> [xen_blkfront]
> [2.246912]  [] ? xenbus_gather+0x170/0x190
> [2.246912]  [] ? __slab_free+0x10e/0x277
> [2.246912]  []
> xenbus_otherend_changed+0xad/0x110
> [2.246912]  [] ? xenwatch_thread+0x77/0x180
> [2.246912]  [] backend_changed+0x13/0x20
> [2.246912]  [] xenwatch_thread+0x66/0x180
> [2.246912]  [] ? wake_up_atomic_t+0x30/0x30
> [2.246912]  [] ?
> unregister_xenbus_watch+0x1f0/0x1f0
> [2.246912]  [] kthread+0xcf/0xe0
> [2.246912]  [] ?
> kthread_create_on_node+0x140/0x140
> [2.246912]  [] ret_from_fork+0x58/0x90
> [2.246912]  [] ?
> kthread_create_on_node+0x140/0x140
> [2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
> 45 
> b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
> fe 
> ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
> 00
> [2.246912] RIP  [] 
> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
> [2.246912]  RSP 
> [2.491574] ---[ end trace 8a9b992812627c71 ]---
> [2.495618] Kernel panic - not syncing: Fatal exception
> 
> 
> Xen version 4.2.
> 
> EC2 instance type: c3.large with EBS magnetic storage, if that
> matters.
> 
> Here is the code where the BUG_ON triggers (drivers/block/xen-
> blkfront.c):
> 
> if (!info->feature_persistent && info->max_indirect_segments) {
>  /*
>  * We are using indirect descriptors but not persistent
>  * grants, we need to allocate a set of pages that can be
>  * used for mapping indirect grefs
>  */
>  int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
> 
>  BUG_ON(!list_empty(&info->indirect_pages)); // << This one hits.
>  for (i = 0; i < num; i++) {
>  struct page *indirect_page = alloc_page(GFP_NOIO);
>  if (!indirect_page)
>  goto out_of_memory;
>  list_add(&indirect_page->lru, &info->indirect_pages);
>  }
> }
> 
> 
> As we checked, 'info->indirect_pages' list indeed