Re: Kernel version numbers after 4.9.255 and 4.4.255
David Laight wrote... > A full wrap might catch checks for less than (say) 4.4.2 which > might be present to avoid very early versions. > So sticking at 255 or wrapping onto (say) 128 to 255 might be better. Hitting such version checks still might happen, though. Also, any wrapping introduces a real risk package managers will see version numbers running backwards and therefore will refrain from installing an actually newer version. For scripts/package/builddeb (I don't use that, though), you could work around by setting an epoch, i.e. (untested) -$sourcename ($packageversion) $distribution; urgency=low +$sourcename (1:$packageversion) $distribution; urgency=low but every packaging mechanism in-tree and outside should adopt such a change, if even possible. Which is why this feels bad. Possibly I am missing something: What's the reason to not use EXTRAVERSION as back in the old 2.6.x.y days, so change to 4.4.255.1 and so on? Well, unless there are still installations who treat 4.4.255 as 2.6.64.255. Christoph
Re: [PATCH 4.20 41/65] Revert "powerpc/tm: Unset MSR[TS] if not recheckpointing"
Greg Kroah-Hartman wrote... > 4.20-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Greg Kroah-Hartman > > This reverts commit d412deb85a4aada382352a8202beb7af8921cd53 which is > commit 6f5b9f018f4c7686fd944d920209d1382d320e4e upstream. > > It breaks the powerpc build, so drop it from the tree until a fix goes > upstream. Is this necessary on 4.20? The build failures I reported were on 4.19 only. The 4.20.2-rc1 kernel for my Powermac G5 builds with and without that patch, both boot fine, no visible differences. Again however, Breno is authoritative here. Aside, I also checked 4.19.15-rc1, builds and runs without any noticeable problems. Christoph
Re: [PATCH 4.19 000/170] 4.19.14-stable review
Greg Kroah-Hartman wrote... > Ok, Breno and Christoph, what should I do here? Should I drop this > patch in the tree or add this new one? Ideally I can fix this soon as I > don't like having broken trees out there... Whatever Breno says - I don't feel competent enough to decide what's right here. Christoph
Re: [PATCH 4.19 000/170] 4.19.14-stable review
Christoph Biedl wrote... > Sorry for not getting back to you earlier. Building yesterday's > release (v4.19.14) *failed*, bisect led to > > | commit a9935a12768851762089fda8e5a9daaf0231808e (HEAD) > | Author: Breno Leitao > | Date: Mon Nov 26 18:12:00 2018 -0200 > | > | powerpc/tm: Unset MSR[TS] if not recheckpointing > > Reverting that commit seems to be sufficient, build passes then. > > Additionally, neither 4.20 nor 5.0-rc1 show this problem. The > | commit 6f5b9f018f4c7686fd944d920209d1382d320e4e upstream. > builds as well, so I'll try to find the missing prerequisite next. Cherry-picking | commit 5c784c8414fba11b62e12439f11e109fb5751f38 | Author: Breno Leitao | Date: Thu Aug 16 14:21:07 2018 -0300 | | powerpc/tm: Remove msr_tm_active() makes the build pass. Bruno, does this make sense? Christoph
Re: [PATCH 4.19 000/170] 4.19.14-stable review
Greg Kroah-Hartman wrote... > I dropped e1c3743e1a20 ("powerpc/tm: Set MSR[TS] just prior to recheckpoint") > from the stable trees, which is what I was told was the commit that was > causing the problems by Christoph and Breno (on to: now). > > Was that not the offending commit? If so, what one was? > > totally confused, Sorry for not getting back to you earlier. Building yesterday's release (v4.19.14) *failed*, bisect led to | commit a9935a12768851762089fda8e5a9daaf0231808e (HEAD) | Author: Breno Leitao | Date: Mon Nov 26 18:12:00 2018 -0200 | | powerpc/tm: Unset MSR[TS] if not recheckpointing Reverting that commit seems to be sufficient, build passes then. Additionally, neither 4.20 nor 5.0-rc1 show this problem. The | commit 6f5b9f018f4c7686fd944d920209d1382d320e4e upstream. builds as well, so I'll try to find the missing prerequisite next. An example .config is attached. Christoph # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.19.14 Kernel Configuration # # # Compiler: powerpc64-linux-gnu-gcc (Debian 8.2.0-11) 8.2.0 # CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=80200 CONFIG_CLANG_VERSION=0 CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_XZ=y # CONFIG_KERNEL_GZIP is not set CONFIG_KERNEL_XZ=y CONFIG_DEFAULT_HOSTNAME="" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y # CONFIG_USELIB is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_SHOW_LEVEL=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_IRQ_DOMAIN=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_ARCH_HAS_TICK_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_CPU_ISOLATION=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y # CONFIG_MEMCG_SWAP is not set CONFIG_MEMCG_KMEM=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y # CONFIG_RT_GROUP_SCHED is not set CONFIG_CGROUP_PIDS=y # CONFIG_CGROUP_RDMA is not set CONFIG_CGROUP_FREEZER=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y # CONFIG_CHECKPOINT_RESTORE is not set CONFIG_SCHED_AUTOGROUP=y # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y # CONFIG_RD_BZIP2 is not set # CONFIG_RD_LZMA is not set CONFIG_RD_XZ=y # CONFIG_RD_LZO is not set # CONFIG_RD_LZ4 is not set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y # CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y CONFIG_SYSCTL_EXCEPTION_TRACE=y CONFIG_BPF=y CONFIG_EXPERT=y CONFIG_MULTIUSER=y CONFIG_SGETMASK_SYSCALL=y CONFIG_SYSFS_SYSCALL=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_FHANDLE=y CONFIG_POSIX_TIMERS=y CONFIG_PRINTK=y CONFIG_PRINTK_NMI=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_FUTEX_PI=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_ADVISE_SYSCALLS=y CONFIG_MEMBARRIER=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set CONFIG_KALLSYMS_BASE_RELATIVE=y CONFIG_BPF_SYSCALL=y # CONFIG_USERFAULTFD is not set CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS=y CONFIG_RSEQ=y # CONFIG_DEBUG_RSEQ is not set # CONFIG_EMBEDDED is not set CONFIG_HAVE_PERF_EVENTS=y # CONFIG_PC104 is not set # # Kernel Performance Events And Counters # CONFIG_PERF_EVENTS=y CONFIG_VM_EVENT_COUNTERS=y #
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
Greg Kroah-Hartman wrote... > On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: > > > commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. > > On big-endian systems, this change intruduces severe corruption, > > resulting in complete loss of the data on the used block device. > That sucks. Can you test Linus's tree to verify the problem is there? > I'll gladly revert this if Linus's tree also gets the revert, I don't > want you to hit this when you upgrade to a newer kernel. Confirmed: The problem is, err ... was in Linus' tree as well. The rather recent commit 8f5fd927c3a7 reverted the change, after that everything is as expected again. Looking at the original commit, I don't have a clue why things go wrong so horribly - otherwise don't be afraid of my data. I took this as a chance to verify my data recovery procedure, with success. Christoph
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
Greg Kroah-Hartman wrote... > On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: > > > commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. > > On big-endian systems, this change intruduces severe corruption, > > resulting in complete loss of the data on the used block device. > That sucks. Can you test Linus's tree to verify the problem is there? > I'll gladly revert this if Linus's tree also gets the revert, I don't > want you to hit this when you upgrade to a newer kernel. Confirmed: The problem is, err ... was in Linus' tree as well. The rather recent commit 8f5fd927c3a7 reverted the change, after that everything is as expected again. Looking at the original commit, I don't have a clue why things go wrong so horribly - otherwise don't be afraid of my data. I took this as a chance to verify my data recovery procedure, with success. Christoph
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
Greg Kroah-Hartman wrote... > 4.14-stable review patch. If anyone has any objections, please let me know. > commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) > If the filesystem is always used on a same endian host, this will not > be a problem. >From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. Cheers, Christoph
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
Greg Kroah-Hartman wrote... > 4.14-stable review patch. If anyone has any objections, please let me know. > commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) > If the filesystem is always used on a same endian host, this will not > be a problem. >From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. Cheers, Christoph
Re: [PATCH] crypto: sun4i-ss: add missing statesize
Maxime Ripard wrote... > > > > This patch specifiy statesize for sha1 and md5. > > > > > > > > Signed-off-by: LABBE Corentin > > > > Cc: sta...@vger.kernel.org > > > > > > Please also add a Fixes tag (and the stable version it applies to). > > > > I don't see the point for a fixes tag as it would simply refer > > to the original patch-set that added the driver. > > What's the problem with that? Fixes: should rather point to the commit that caused the breakage in my opinion. Which did this by intention: | commit 8996eafdcbad149ac0f772fb1649fbb75c482a6a | Author: Russell King | Date: Fri Oct 9 20:43:33 2015 +0100 | | crypto: ahash - ensure statesize is non-zero (...) + This patch adds a check to prevent these drivers from registering + ahash algorithms until they are fixed. Another crypto subsystem (mv_cesa) suffers from the same problem. I have a patch ready but would prefer a consensus on these formalities before submitting. Aside from this, if you really need another Tested-by:, add me. And also Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107281 Christoph signature.asc Description: Digital signature
Re: [PATCH] crypto: sun4i-ss: add missing statesize
Maxime Ripard wrote... > > > > This patch specifiy statesize for sha1 and md5. > > > > > > > > Signed-off-by: LABBE Corentin> > > > Cc: sta...@vger.kernel.org > > > > > > Please also add a Fixes tag (and the stable version it applies to). > > > > I don't see the point for a fixes tag as it would simply refer > > to the original patch-set that added the driver. > > What's the problem with that? Fixes: should rather point to the commit that caused the breakage in my opinion. Which did this by intention: | commit 8996eafdcbad149ac0f772fb1649fbb75c482a6a | Author: Russell King | Date: Fri Oct 9 20:43:33 2015 +0100 | | crypto: ahash - ensure statesize is non-zero (...) + This patch adds a check to prevent these drivers from registering + ahash algorithms until they are fixed. Another crypto subsystem (mv_cesa) suffers from the same problem. I have a patch ready but would prefer a consensus on these formalities before submitting. Aside from this, if you really need another Tested-by:, add me. And also Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107281 Christoph signature.asc Description: Digital signature
Re: Soft lockup issue in Linux 4.1.9
Eric Dumazet wrote... [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ] > It definitely should help ! Yesterday, I've experienced issues somewhat similar to this, but I'm not entirely sure: Four of five systems running 4.1.9 stopped working. No reaction on network, keyboard, serial console. In one case, the stack trace as below made it to the loghost. Two things are quite different. First, the systems had a reasonable uptime, about a week. And second, the scary part: All incidents happened within a rather short time span of three minutes the most, beginning after 16:41:28 and before 16:41:54 UTC. So I assumed a brownout first - until I realized the systems faded away at slightly different times, and one is at a different location. While other systems using different kernel versions continued to operate on both sites. So, I'd be glad for answers for - Is this the same issue or should I be even more afraid? - What might be the reason for this temporal coincidence? I have no plausible idea. Confused, Christoph INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) rcu_sched kthread starved for 5999 jiffies! Task dump for CPU 3: swapper/3 R running task0 0 1 0x0008 81e396c0 88042dcc3b20 810807da 0003 81e396c0 88042dcc3b40 81083b78 88042dcc3b80 0003 88042dcc3b70 810a945c 88042dcd5740 Call Trace: [] sched_show_task+0xaa/0x110 [] dump_cpu_task+0x38/0x40 [] rcu_dump_cpu_stacks+0x8c/0xc0 [] rcu_check_callbacks+0x3b1/0x680 [] ? acct_account_cputime+0x17/0x20 [] ? account_system_time+0x8e/0x180 [] update_process_times+0x33/0x60 [] tick_sched_handle.isra.14+0x30/0x40 [] tick_sched_timer+0x43/0x80 [] __run_hrtimer.isra.32+0x4a/0xd0 [] hrtimer_interrupt+0xd5/0x1f0 [] local_apic_timer_interrupt+0x34/0x60 INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) rcu_sched kthread starved for 5999 jiffies! Task dump for CPU 3: swapper/3 R running task0 0 1 0x0008 81e396c0 88042dcc3b20 810807da 0003 81e396c0 88042dcc3b40 81083b78 88042dcc3b80 0003 88042dcc3b70 810a945c 88042dcd5740 Call Trace: [] sched_show_task+0xaa/0x110 [] dump_cpu_task+0x38/0x40 [] smp_apic_timer_interrupt+0x3c/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? _raw_spin_unlock_irqrestore+0x9/0x10 [] try_to_del_timer_sync+0x48/0x60 [] ? del_timer_sync+0x42/0x60 [] del_timer_sync+0x4a/0x60 [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 [] reqsk_timer_handler+0x12f/0x290 [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 [] call_timer_fn.isra.26+0x26/0x80 [] rcu_dump_cpu_stacks+0x8c/0xc0 [] rcu_check_callbacks+0x3b1/0x680 [] ? acct_account_cputime+0x17/0x20 [] ? account_system_time+0x8e/0x180 [] update_process_times+0x33/0x60 [] tick_sched_handle.isra.14+0x30/0x40 [] tick_sched_timer+0x43/0x80 [] __run_hrtimer.isra.32+0x4a/0xd0 [] hrtimer_interrupt+0xd5/0x1f0 [] local_apic_timer_interrupt+0x34/0x60 [] run_timer_softirq+0x18e/0x220 [] __do_softirq+0xda/0x1f0 [] irq_exit+0x76/0xa0 [] smp_apic_timer_interrupt+0x45/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? sched_clock_cpu+0x9e/0xb0 [] ? amd_e400_idle+0x35/0xd0 [] ? amd_e400_idle+0x33/0xd0 [] arch_cpu_idle+0xa/0x10 [] cpu_startup_entry+0x2c3/0x330 [] smp_apic_timer_interrupt+0x3c/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? _raw_spin_unlock_irqrestore+0x9/0x10 [] try_to_del_timer_sync+0x48/0x60 [] ? del_timer_sync+0x42/0x60 [] del_timer_sync+0x4a/0x60 [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 [] reqsk_timer_handler+0x12f/0x290 [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 [] call_timer_fn.isra.26+0x26/0x80 [] start_secondary+0x17c/0x1a0 [] run_timer_softirq+0x18e/0x220 [] __do_softirq+0xda/0x1f0 [] irq_exit+0x76/0xa0 [] smp_apic_timer_interrupt+0x45/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? sched_clock_cpu+0x9e/0xb0 [] ? amd_e400_idle+0x35/0xd0 [] ? amd_e400_idle+0x33/0xd0 [] arch_cpu_idle+0xa/0x10 [] cpu_startup_entry+0x2c3/0x330 [] start_secondary+0x17c/0x1a0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Soft lockup issue in Linux 4.1.9
Eric Dumazet wrote... [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ] > It definitely should help ! Yesterday, I've experienced issues somewhat similar to this, but I'm not entirely sure: Four of five systems running 4.1.9 stopped working. No reaction on network, keyboard, serial console. In one case, the stack trace as below made it to the loghost. Two things are quite different. First, the systems had a reasonable uptime, about a week. And second, the scary part: All incidents happened within a rather short time span of three minutes the most, beginning after 16:41:28 and before 16:41:54 UTC. So I assumed a brownout first - until I realized the systems faded away at slightly different times, and one is at a different location. While other systems using different kernel versions continued to operate on both sites. So, I'd be glad for answers for - Is this the same issue or should I be even more afraid? - What might be the reason for this temporal coincidence? I have no plausible idea. Confused, Christoph INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) rcu_sched kthread starved for 5999 jiffies! Task dump for CPU 3: swapper/3 R running task0 0 1 0x0008 81e396c0 88042dcc3b20 810807da 0003 81e396c0 88042dcc3b40 81083b78 88042dcc3b80 0003 88042dcc3b70 810a945c 88042dcd5740 Call Trace: [] sched_show_task+0xaa/0x110 [] dump_cpu_task+0x38/0x40 [] rcu_dump_cpu_stacks+0x8c/0xc0 [] rcu_check_callbacks+0x3b1/0x680 [] ? acct_account_cputime+0x17/0x20 [] ? account_system_time+0x8e/0x180 [] update_process_times+0x33/0x60 [] tick_sched_handle.isra.14+0x30/0x40 [] tick_sched_timer+0x43/0x80 [] __run_hrtimer.isra.32+0x4a/0xd0 [] hrtimer_interrupt+0xd5/0x1f0 [] local_apic_timer_interrupt+0x34/0x60 INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) rcu_sched kthread starved for 5999 jiffies! Task dump for CPU 3: swapper/3 R running task0 0 1 0x0008 81e396c0 88042dcc3b20 810807da 0003 81e396c0 88042dcc3b40 81083b78 88042dcc3b80 0003 88042dcc3b70 810a945c 88042dcd5740 Call Trace: [] sched_show_task+0xaa/0x110 [] dump_cpu_task+0x38/0x40 [] smp_apic_timer_interrupt+0x3c/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? _raw_spin_unlock_irqrestore+0x9/0x10 [] try_to_del_timer_sync+0x48/0x60 [] ? del_timer_sync+0x42/0x60 [] del_timer_sync+0x4a/0x60 [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 [] reqsk_timer_handler+0x12f/0x290 [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 [] call_timer_fn.isra.26+0x26/0x80 [] rcu_dump_cpu_stacks+0x8c/0xc0 [] rcu_check_callbacks+0x3b1/0x680 [] ? acct_account_cputime+0x17/0x20 [] ? account_system_time+0x8e/0x180 [] update_process_times+0x33/0x60 [] tick_sched_handle.isra.14+0x30/0x40 [] tick_sched_timer+0x43/0x80 [] __run_hrtimer.isra.32+0x4a/0xd0 [] hrtimer_interrupt+0xd5/0x1f0 [] local_apic_timer_interrupt+0x34/0x60 [] run_timer_softirq+0x18e/0x220 [] __do_softirq+0xda/0x1f0 [] irq_exit+0x76/0xa0 [] smp_apic_timer_interrupt+0x45/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? sched_clock_cpu+0x9e/0xb0 [] ? amd_e400_idle+0x35/0xd0 [] ? amd_e400_idle+0x33/0xd0 [] arch_cpu_idle+0xa/0x10 [] cpu_startup_entry+0x2c3/0x330 [] smp_apic_timer_interrupt+0x3c/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? _raw_spin_unlock_irqrestore+0x9/0x10 [] try_to_del_timer_sync+0x48/0x60 [] ? del_timer_sync+0x42/0x60 [] del_timer_sync+0x4a/0x60 [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 [] reqsk_timer_handler+0x12f/0x290 [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 [] call_timer_fn.isra.26+0x26/0x80 [] start_secondary+0x17c/0x1a0 [] run_timer_softirq+0x18e/0x220 [] __do_softirq+0xda/0x1f0 [] irq_exit+0x76/0xa0 [] smp_apic_timer_interrupt+0x45/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? sched_clock_cpu+0x9e/0xb0 [] ? amd_e400_idle+0x35/0xd0 [] ? amd_e400_idle+0x33/0xd0 [] arch_cpu_idle+0xa/0x10 [] cpu_startup_entry+0x2c3/0x330 [] start_secondary+0x17c/0x1a0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 030/143] proc connector: fix info leaks
Willy Tarreau wrote... > Initialize event_data for all possible message types to prevent leaking > kernel stack contents to userland (up to 20 bytes). Also set the flags > member of the connector message to 0 to prevent leaking two more stack > bytes this way. There are build errors as shown below and I guess that one is the culprit. Can do detailled checks tonight, I'm a bit in a hurry right now. (Using gcc-4.7 as provided by Debian wheezy) Christoph drivers/connector/cn_proc.c:286:9: error: expected declaration specifiers or '...' before '&' token drivers/connector/cn_proc.c:286:26: error: expected declaration specifiers or '...' before numeric constant drivers/connector/cn_proc.c:286:29: error: expected declaration specifiers or '...' before 'sizeof' drivers/connector/cn_proc.c:287:5: error: expected '=', ',', ';', 'asm' or '__attribute__' before '->' token make[5]: *** [drivers/connector/cn_proc.o] Error 1 make[4]: *** [drivers/connector] Error 2 make[4]: *** Waiting for unfinished jobs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 030/143] proc connector: fix info leaks
Willy Tarreau wrote... Initialize event_data for all possible message types to prevent leaking kernel stack contents to userland (up to 20 bytes). Also set the flags member of the connector message to 0 to prevent leaking two more stack bytes this way. There are build errors as shown below and I guess that one is the culprit. Can do detailled checks tonight, I'm a bit in a hurry right now. (Using gcc-4.7 as provided by Debian wheezy) Christoph drivers/connector/cn_proc.c:286:9: error: expected declaration specifiers or '...' before '' token drivers/connector/cn_proc.c:286:26: error: expected declaration specifiers or '...' before numeric constant drivers/connector/cn_proc.c:286:29: error: expected declaration specifiers or '...' before 'sizeof' drivers/connector/cn_proc.c:287:5: error: expected '=', ',', ';', 'asm' or '__attribute__' before '-' token make[5]: *** [drivers/connector/cn_proc.o] Error 1 make[4]: *** [drivers/connector] Error 2 make[4]: *** Waiting for unfinished jobs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Khalid Aziz wrote... > Better yet, just pull this patch from stable from now. I will redo > the patch and send another one for the next round. FYI, after patching mm/swap.c accordingly, all the 3.0 and 3.4 configurations I use do build. Some boot tests will follow, I'll follow up only if I see unusual behaviour. Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Khalid Aziz wrote... > Thanks for tracking this down. I had not tried a configuration with > CONFIG_HUGETLB_PAGE not set. In my config, I was getting many > multiple definition errors for bunch of other defines from > linux/hugetlb.h. I will look at my config again but chances are I > had something else screwed up in my build since you did not see > those errors. Did you compile with CONFIG_HUGETLB_PAGE set after > including linux/hugetlb.h? If you did, including linux/hugetlb.h > instead of importing just the definition of PageHuge in mm/swap.c > would be the right thing to do. Yes, one of my configurations has CONFIG_HUGETLB_PAGE, also CONFIG_NUMA=y, and the kernel built. Could not test it, though. There still might be other configuration settings that caused the error messages you've seen. Manually picking both PageHuge definitions from linux/hugetlb.h should be a safe alternative then, but that's ugly. Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Guenter Roeck wrote... > On 10/02/2013 09:04 PM, Greg Kroah-Hartman wrote: > >This is the start of the stable review cycle for the 3.0.99 release. > Heads up: I am getting lots of build failures in 3.0 and 3.4 builds. > > mm/built-in.o: In function `__put_compound_page': > slab.c:(.text+0xaa3c): undefined reference to `PageHuge' > mm/built-in.o: In function `put_compound_page': > slab.c:(.text+0xaab0): undefined reference to `PageHuge' > mm/built-in.o: In function `__get_page_tail': > slab.c:(.text+0xb178): undefined reference to `PageHuge' > make: *** [.tmp_vmlinux1] Error 1 This is obviously due to | [ 11/13] mm: fix aio performance regression for database caused by THP and happens if CONFIG_HUGETLB_PAGE is not set. Looking closer, upstream commit 7cb2ef56 included linux/hugetlb.h while the backport for 3.0 just defines PageHuge. Reverting that like in the patch below causes the build to complete, and the resulting kernel shows no anomalies here. However did that backport, why was it done that way? Or did I miss an important point? Christoph --- a/mm/swap.c +++ b/mm/swap.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "internal.h" @@ -41,8 +42,6 @@ static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); -int PageHuge(struct page *page); - /* * This path almost never happens for VM activity - pages are normally * freed via pagevecs. But it gets used by networking. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Guenter Roeck wrote... On 10/02/2013 09:04 PM, Greg Kroah-Hartman wrote: This is the start of the stable review cycle for the 3.0.99 release. Heads up: I am getting lots of build failures in 3.0 and 3.4 builds. mm/built-in.o: In function `__put_compound_page': slab.c:(.text+0xaa3c): undefined reference to `PageHuge' mm/built-in.o: In function `put_compound_page': slab.c:(.text+0xaab0): undefined reference to `PageHuge' mm/built-in.o: In function `__get_page_tail': slab.c:(.text+0xb178): undefined reference to `PageHuge' make: *** [.tmp_vmlinux1] Error 1 This is obviously due to | [ 11/13] mm: fix aio performance regression for database caused by THP and happens if CONFIG_HUGETLB_PAGE is not set. Looking closer, upstream commit 7cb2ef56 included linux/hugetlb.h while the backport for 3.0 just defines PageHuge. Reverting that like in the patch below causes the build to complete, and the resulting kernel shows no anomalies here. However did that backport, why was it done that way? Or did I miss an important point? Christoph --- a/mm/swap.c +++ b/mm/swap.c @@ -31,6 +31,7 @@ #include linux/backing-dev.h #include linux/memcontrol.h #include linux/gfp.h +#include linux/hugetlb.h #include internal.h @@ -41,8 +42,6 @@ static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); -int PageHuge(struct page *page); - /* * This path almost never happens for VM activity - pages are normally * freed via pagevecs. But it gets used by networking. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Khalid Aziz wrote... Thanks for tracking this down. I had not tried a configuration with CONFIG_HUGETLB_PAGE not set. In my config, I was getting many multiple definition errors for bunch of other defines from linux/hugetlb.h. I will look at my config again but chances are I had something else screwed up in my build since you did not see those errors. Did you compile with CONFIG_HUGETLB_PAGE set after including linux/hugetlb.h? If you did, including linux/hugetlb.h instead of importing just the definition of PageHuge in mm/swap.c would be the right thing to do. Yes, one of my configurations has CONFIG_HUGETLB_PAGE, also CONFIG_NUMA=y, and the kernel built. Could not test it, though. There still might be other configuration settings that caused the error messages you've seen. Manually picking both PageHuge definitions from linux/hugetlb.h should be a safe alternative then, but that's ugly. Christoph -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/13] 3.0.99-stable review
Khalid Aziz wrote... Better yet, just pull this patch from stable from now. I will redo the patch and send another one for the next round. FYI, after patching mm/swap.c accordingly, all the 3.0 and 3.4 configurations I use do build. Some boot tests will follow, I'll follow up only if I see unusual behaviour. Christoph -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.32.61 - x86/ptrace/gcc 4.7 build error
Willy Tarreau wrote... > I'm attaching the two patches here to be appled on top of 2.6.32.61, I would > like it if you could try in your environment to confirm that they correctly > fix the issue. Confirmation: Kernel builds and runs for both Debian squeeze and wheezy (gcc 4.4 and gcc 4.7) on i386. There are still other issues that need investigation but they might be older and/or related to changes on my end. virtio-net doesn't seem to work at all (but does so in the Debian squeeze 2.6.32 kernel), and the virtualbox guest module (4.1.18) fails to load (known issue on i386 if build using gcc 4.7, but know this also happens with gcc 4.4). Unfortunately my time ressources are very limited at the moment, and there's also something in 3.4.49 which has higher priority. Stay tuned. Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.32.61 - x86/ptrace/gcc 4.7 build error
Willy Tarreau wrote... I'm attaching the two patches here to be appled on top of 2.6.32.61, I would like it if you could try in your environment to confirm that they correctly fix the issue. Confirmation: Kernel builds and runs for both Debian squeeze and wheezy (gcc 4.4 and gcc 4.7) on i386. There are still other issues that need investigation but they might be older and/or related to changes on my end. virtio-net doesn't seem to work at all (but does so in the Debian squeeze 2.6.32 kernel), and the virtualbox guest module (4.1.18) fails to load (known issue on i386 if build using gcc 4.7, but know this also happens with gcc 4.4). Unfortunately my time ressources are very limited at the moment, and there's also something in 3.4.49 which has higher priority. Stay tuned. Christoph -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 81/83] ACPI video: Ignore errors after _DOD evaluation.
Greg Kroah-Hartman wrote... > 3.6-stable review patch. If anyone has any objections, please let me know. Only a small set of the (guessed) 300 e-mails arrived here, they were: - 21/11 Greg Kroah-Hartman ┬─>[ 81/83] ACPI video: Ignore errors after _DOD evaluation. - 21/11 Greg Kroah-Hartman └─>[ 29/83] ASoC: core: Double control update err for snd_soc_put_volsw_sx - 21/11 Greg Kroah-Hartman [ 05/38] ptp: update adjfreq callback description - 21/11 Greg Kroah-Hartman ┬─>[ 105/171] libceph: fully initialize connection in con_init() - 21/11 Greg Kroah-Hartman ├─>[ 001/171] mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() - 21/11 Greg Kroah-Hartman ├─>[ 075/171] ceph: ensure auth ops are defined before use - 21/11 Greg Kroah-Hartman ├─>[ 086/171] libceph: flush msgr queue during mon_client shutdown - 21/11 Greg Kroah-Hartman ├─>[ 005/171] mac80211: call skb_dequeue/ieee80211_free_txskb instead of __skb_queue_purge - 21/11 Greg Kroah-Hartman ├─>[ 065/171] libceph: dont reset kvec in prepare_write_banner() - 21/11 Greg Kroah-Hartman ├─>[ 049/171] drm/i915: fix overlay on i830M - 21/11 Greg Kroah-Hartman └─>[ 039/171] r8169: Fix WoL on RTL8168d/8111d. And they had a huge delay of eight hours at kernel.org: | Received: from mail.kernel.org ([198.145.19.201]:49019 "EHLO mail.kernel.org" | rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP | id S1753592Ab2KVSdp (ORCPT ); | Thu, 22 Nov 2012 13:33:45 -0500 | Received: from mail.kernel.org (localhost [127.0.0.1]) | by mail.kernel.org (Postfix) with ESMTP id 87EA220434; | Thu, 22 Nov 2012 00:46:51 + (UTC) Since there was nothing else in the two hours since then, care to check what went wrong? Nevertheless, thanks to the predictable patch download URLs, I could start my tests, everything looks good so far. Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 81/83] ACPI video: Ignore errors after _DOD evaluation.
Greg Kroah-Hartman wrote... 3.6-stable review patch. If anyone has any objections, please let me know. Only a small set of the (guessed) 300 e-mails arrived here, they were: - 21/11 Greg Kroah-Hartman ┬─[ 81/83] ACPI video: Ignore errors after _DOD evaluation. - 21/11 Greg Kroah-Hartman └─[ 29/83] ASoC: core: Double control update err for snd_soc_put_volsw_sx - 21/11 Greg Kroah-Hartman [ 05/38] ptp: update adjfreq callback description - 21/11 Greg Kroah-Hartman ┬─[ 105/171] libceph: fully initialize connection in con_init() - 21/11 Greg Kroah-Hartman ├─[ 001/171] mm: bugfix: set current-reclaim_state to NULL while returning from kswapd() - 21/11 Greg Kroah-Hartman ├─[ 075/171] ceph: ensure auth ops are defined before use - 21/11 Greg Kroah-Hartman ├─[ 086/171] libceph: flush msgr queue during mon_client shutdown - 21/11 Greg Kroah-Hartman ├─[ 005/171] mac80211: call skb_dequeue/ieee80211_free_txskb instead of __skb_queue_purge - 21/11 Greg Kroah-Hartman ├─[ 065/171] libceph: dont reset kvec in prepare_write_banner() - 21/11 Greg Kroah-Hartman ├─[ 049/171] drm/i915: fix overlay on i830M - 21/11 Greg Kroah-Hartman └─[ 039/171] r8169: Fix WoL on RTL8168d/8111d. And they had a huge delay of eight hours at kernel.org: | Received: from mail.kernel.org ([198.145.19.201]:49019 EHLO mail.kernel.org | rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP | id S1753592Ab2KVSdp (ORCPT rfc822;sta...@vger.kernel.org); | Thu, 22 Nov 2012 13:33:45 -0500 | Received: from mail.kernel.org (localhost [127.0.0.1]) | by mail.kernel.org (Postfix) with ESMTP id 87EA220434; | Thu, 22 Nov 2012 00:46:51 + (UTC) Since there was nothing else in the two hours since then, care to check what went wrong? Nevertheless, thanks to the predictable patch download URLs, I could start my tests, everything looks good so far. Christoph -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/11] 3.2-stable: Fix for leapsecond caused hrtimer/futex issue
John Stultz wrote... > Attached is the test case I used to reproduce and test the solution > to the hard-hang deadlock. I was wondering whether anybody managed to crash a virtualbox guest using your program. No avail, using version 4.1.18 on the host and the guest kernel running several 3.0.x (x < 38) kernels on both x32 and x64, the guest utilies were stopped. Rather a fun fact I guess but I wanted to let you know. All real hardware tested, including a dockstar on armel, crashed as predicted, while 3.0.38-rc1 was immune. Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/11] 3.2-stable: Fix for leapsecond caused hrtimer/futex issue
John Stultz wrote... Attached is the test case I used to reproduce and test the solution to the hard-hang deadlock. I was wondering whether anybody managed to crash a virtualbox guest using your program. No avail, using version 4.1.18 on the host and the guest kernel running several 3.0.x (x 38) kernels on both x32 and x64, the guest utilies were stopped. Rather a fun fact I guess but I wanted to let you know. All real hardware tested, including a dockstar on armel, crashed as predicted, while 3.0.38-rc1 was immune. Christoph -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
it821x trouble since 2.6.18
Hi, There are ITE 8212-based controllers installed in some of my computers. I had always skipped the build-in RAID things and used them as plain controllers. 1. Beginning with 2.6.18 there was some trouble, basically slowed data transfer, sometimes even a totally stalled system (I cannot reproduce the latter though). Diffing dmesg of 2.6.17.x and 2.6.18.x (same for 2.6.19): Probing IDE interface ide2... hde: SAMSUNG VA34324A, ATA DISK drive hde: Performing identify fixups. ide2 at 0xd000-0xd007,0xd402 on irq 10 hde: max request size: 128KiB -hde: 8446032 sectors (4324 MB) w/478KiB Cache, CHS=14896/9/63, BUG +hde: 8446032 sectors (4324 MB) w/478KiB Cache, CHS=14896/9/63, BUG DMA OFF hde:hde: recal_intr: status=0x51 { DriveReady SeekComplete Error } hde: recal_intr: error=0x04 { DriveStatusError } ide: failed opcode was: unknown +hde: irq timeout: status=0xd0 { Busy } +ide: failed opcode was: unknown +ide2: reset: master: ECC circuitry error +hde: recal_intr: status=0x51 { DriveReady SeekComplete Error } +hde: recal_intr: error=0x04 { DriveStatusError } +ide: failed opcode was: unknown hde1 hde2 < hde5 hde6 > OK, there was "BUG" up to and including 2.6.17 but the disk(s) worked without any problem. Not surprisingly the disk throughput was affected by that. Comparing "hdparm -tT /dev/hde": 2.6.17 /dev/hde: Timing cached reads:88 MB in 2.01 seconds = 43.72 MB/sec Timing buffered disk reads: 26 MB in 3.21 seconds = 8.10 MB/sec 2.6.18 /dev/hde: Timing cached reads:90 MB in 2.04 seconds = 44.01 MB/sec Timing buffered disk reads: 16 MB in 3.14 seconds = 5.10 MB/sec 2. Trying to understand what has happened I found the main difference is not in the driver but in ide-dma.c: --- linux-2.6.17.14/drivers/ide/ide-dma.c 2006-06-18 01:49:35.0 + +++ linux-2.6.18.6/drivers/ide/ide-dma.c2006-09-20 03:42:06.0 + @@ -752,7 +750,7 @@ goto bug_dma_off; printk(", DMA"); } else if (id->field_valid & 1) { - printk(", BUG"); + goto bug_dma_off; } return; bug_dma_off: and reverting that change returns the old transfer rates. But that is probably not a good idea. 3. To increase confusion: I had learned the ite8212 chip is close to the cmd/sil680, and patching siimage.c had been a solution to get support for 8212 before it showed up in the kernel. So I tried this again: --- ORG/linux-2.6.19.2/drivers/ide/pci/siimage.c2006-11-29 21:57:37.0 + +++ NEW/linux-2.6.19.2/drivers/ide/pci/siimage.c2007-01-14 17:59:03.0 + @@ -52,6 +52,7 @@ case PCI_DEVICE_ID_SII_1210SA: return 1; case PCI_DEVICE_ID_SII_680: + case PCI_DEVICE_ID_ITE_8212: return 0; } BUG(); @@ -1082,6 +1083,7 @@ static struct pci_device_id siimage_pci_tbl[] = { { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_680, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, + { PCI_VENDOR_ID_ITE, PCI_DEVICE_ID_ITE_8212, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, #ifdef CONFIG_BLK_DEV_IDE_SATA { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_3112, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 1}, { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_1210SA, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 2}, and disabled CONFIG_BLK_DEV_IT821X - and now the disk is even faster than before: /dev/hde: Timing cached reads:88 MB in 2.01 seconds = 43.71 MB/sec Timing buffered disk reads: 34 MB in 3.02 seconds = 11.25 MB/sec So it seems the problem is it82xx.c at least for my controllers. 4. Now there is 2.6.19 and the new ata driver model. So I tried that one, too: Booting with | CONFIG_PATA_IT821X=m stalls the system upon module load. The last kernel messages were ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) Buffer I/O error on device sda, logical block 0 Patching the sil driver again: --- ORG/linux-2.6.19.2/drivers/ata/pata_sil680.c2006-11-29
it821x trouble since 2.6.18
Hi, There are ITE 8212-based controllers installed in some of my computers. I had always skipped the build-in RAID things and used them as plain controllers. 1. Beginning with 2.6.18 there was some trouble, basically slowed data transfer, sometimes even a totally stalled system (I cannot reproduce the latter though). Diffing dmesg of 2.6.17.x and 2.6.18.x (same for 2.6.19): Probing IDE interface ide2... hde: SAMSUNG VA34324A, ATA DISK drive hde: Performing identify fixups. ide2 at 0xd000-0xd007,0xd402 on irq 10 hde: max request size: 128KiB -hde: 8446032 sectors (4324 MB) w/478KiB Cache, CHS=14896/9/63, BUG +hde: 8446032 sectors (4324 MB) w/478KiB Cache, CHS=14896/9/63, BUG DMA OFF hde:hde: recal_intr: status=0x51 { DriveReady SeekComplete Error } hde: recal_intr: error=0x04 { DriveStatusError } ide: failed opcode was: unknown +hde: irq timeout: status=0xd0 { Busy } +ide: failed opcode was: unknown +ide2: reset: master: ECC circuitry error +hde: recal_intr: status=0x51 { DriveReady SeekComplete Error } +hde: recal_intr: error=0x04 { DriveStatusError } +ide: failed opcode was: unknown hde1 hde2 hde5 hde6 OK, there was BUG up to and including 2.6.17 but the disk(s) worked without any problem. Not surprisingly the disk throughput was affected by that. Comparing hdparm -tT /dev/hde: 2.6.17 /dev/hde: Timing cached reads:88 MB in 2.01 seconds = 43.72 MB/sec Timing buffered disk reads: 26 MB in 3.21 seconds = 8.10 MB/sec 2.6.18 /dev/hde: Timing cached reads:90 MB in 2.04 seconds = 44.01 MB/sec Timing buffered disk reads: 16 MB in 3.14 seconds = 5.10 MB/sec 2. Trying to understand what has happened I found the main difference is not in the driver but in ide-dma.c: --- linux-2.6.17.14/drivers/ide/ide-dma.c 2006-06-18 01:49:35.0 + +++ linux-2.6.18.6/drivers/ide/ide-dma.c2006-09-20 03:42:06.0 + @@ -752,7 +750,7 @@ goto bug_dma_off; printk(, DMA); } else if (id-field_valid 1) { - printk(, BUG); + goto bug_dma_off; } return; bug_dma_off: and reverting that change returns the old transfer rates. But that is probably not a good idea. 3. To increase confusion: I had learned the ite8212 chip is close to the cmd/sil680, and patching siimage.c had been a solution to get support for 8212 before it showed up in the kernel. So I tried this again: --- ORG/linux-2.6.19.2/drivers/ide/pci/siimage.c2006-11-29 21:57:37.0 + +++ NEW/linux-2.6.19.2/drivers/ide/pci/siimage.c2007-01-14 17:59:03.0 + @@ -52,6 +52,7 @@ case PCI_DEVICE_ID_SII_1210SA: return 1; case PCI_DEVICE_ID_SII_680: + case PCI_DEVICE_ID_ITE_8212: return 0; } BUG(); @@ -1082,6 +1083,7 @@ static struct pci_device_id siimage_pci_tbl[] = { { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_680, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, + { PCI_VENDOR_ID_ITE, PCI_DEVICE_ID_ITE_8212, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, #ifdef CONFIG_BLK_DEV_IDE_SATA { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_3112, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 1}, { PCI_VENDOR_ID_CMD, PCI_DEVICE_ID_SII_1210SA, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 2}, and disabled CONFIG_BLK_DEV_IT821X - and now the disk is even faster than before: /dev/hde: Timing cached reads:88 MB in 2.01 seconds = 43.71 MB/sec Timing buffered disk reads: 34 MB in 3.02 seconds = 11.25 MB/sec So it seems the problem is it82xx.c at least for my controllers. 4. Now there is 2.6.19 and the new ata driver model. So I tried that one, too: Booting with | CONFIG_PATA_IT821X=m stalls the system upon module load. The last kernel messages were ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: tag 0 cmd 0xc4 Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: port failed to respond (30 secs, Status 0xd0) Buffer I/O error on device sda, logical block 0 Patching the sil driver again: --- ORG/linux-2.6.19.2/drivers/ata/pata_sil680.c2006-11-29 22:57:37.0 +0100