Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio
On 4/22/24 22:06, Michael S. Tsirkin wrote: > On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote: >> Hi, >> >> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote: >>> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote: From: Hou Tao Hi, The patch set aims to fix the warning related to an abnormal size parameter of kmalloc() in virtiofs. The warning occurred when attempting to insert a 10MB sized kernel module kept in a virtiofs with cache disabled. As analyzed in patch #1, the root cause is that the length of the read buffer is no limited, and the read buffer is passed directly to virtiofs through out_args[0].value. Therefore patch #1 limits the length of the read buffer passed to virtiofs by using max_pages. However it is not enough, because now the maximal value of max_pages is 256. Consequently, when reading a 10MB-sized kernel module, the length of the bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will try to allocate 2MB from memory subsystem. The request for 2MB of physically contiguous memory significantly stress the memory subsystem and may fail indefinitely on hosts with fragmented memory. To address this, patch #2~#5 use scattered pages in a bio_vec to replace the kmalloc-allocated bounce buffer when the length of the bounce buffer for KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the allocation of the bounce buffer and sg array in virtiofs is that GFP_ATOMIC is used even when the allocation occurs in a kworker context. Therefore the last patch uses GFP_NOFS for the allocation of both sg array and bounce buffer when initiated by the kworker. For more details, please check the individual patches. As usual, comments are always welcome. Change Log: >>> Bernd should I just merge the patchset as is? >>> It seems to fix a real problem and no one has the >>> time to work on a better fix WDYT? >> >> Sorry for the long delay. I am just start to prepare for v3. In v3, I >> plan to avoid the unnecessary memory copy between fuse args and bio_vec. >> Will post it before next week. > > Didn't happen before this week apparently. Hi Michael, sorry for my later reply, I had been totally busy for the last weeks as well. Also I can't decide to merge it - I'm not the official fuse maintainer... >From my point of view, patch 1 is just missing to set the actual limit and then would be clear and easy back-portable bug fix. Not promised, I will try it out if I find a bit time tomorrow. Bernd
Re: [PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages
On 3/9/24 05:26, Hou Tao wrote: > Hi, > > On 3/1/2024 9:42 PM, Miklos Szeredi wrote: >> On Wed, 28 Feb 2024 at 15:40, Hou Tao wrote: >> >>> So instead of limiting both the values of max_read and max_write in >>> kernel, capping the maximal length of kvec iter IO by using max_pages in >>> fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max >>> value for max_pages is 256, so on host with 4KB page size, the maximal >>> size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The >>> allocation of 2MB of physically contiguous memory will still incur >>> significant stress on the memory subsystem, but the warning is fixed. >>> Additionally, the requirement for huge physically contiguous memory will >>> be removed in the following patch. >> So the issue will be fixed properly by following patches? >> >> In that case this patch could be omitted, right? > > Sorry for the late reply. Being busy with off-site workshop these days. > > No, this patch is still necessary and it is used to limit the number of > scatterlist used for fuse request and reply in virtio-fs. If the length > of out_args[0].size is not limited, the number of scatterlist used to > map the fuse request may be greater than the queue size of virtio-queue > and the fuse request may hang forever. I'm currently also totally busy and didn't carefully check, but isn't there something missing that limits fc->max_write/fc->max_read? Thanks, Bernd
Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw
On 1/10/24 02:16, Hou Tao wrote: Hi, On 1/9/2024 9:11 PM, Bernd Schubert wrote: On 1/3/24 11:59, Hou Tao wrote: From: Hou Tao When trying to insert a 10MB kernel module kept in a virtiofs with cache disabled, the following warning was reported: [ cut here ] WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 .. Modules linked in: CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), .. RIP: 0010:__alloc_pages+0x2c4/0x360 .. Call Trace: ? __warn+0x8f/0x150 ? __alloc_pages+0x2c4/0x360 __kmalloc_large_node+0x86/0x160 __kmalloc+0xcd/0x140 virtio_fs_enqueue_req+0x240/0x6d0 virtio_fs_wake_pending_and_unlock+0x7f/0x190 queue_request_and_unlock+0x58/0x70 fuse_simple_request+0x18b/0x2e0 fuse_direct_io+0x58a/0x850 fuse_file_read_iter+0xdb/0x130 __kernel_read+0xf3/0x260 kernel_read+0x45/0x60 kernel_read_file+0x1ad/0x2b0 init_module_from_file+0x6a/0xe0 idempotent_init_module+0x179/0x230 __x64_sys_finit_module+0x5d/0xb0 do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 .. ---[ end trace ]--- The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses kmalloc-ed memory as bound buffer for fuse args, but fuse_get_user_pages() only limits the length of fuse arg by max_read or max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()). For virtiofs, max_read is UINT_MAX, so a big read request which is about I find this part of the explanation a bit confusing. I guess you wanted to write something like fuse_direct_io() -> fuse_get_user_pages() is limited by fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to UINT_MAX basically no limit is applied at all. Yes, what you said is just as expected but it is not the root cause of the warning. The culprit of the warning is kmalloc() in copy_args_to_argbuf() just as said in commit message. vmalloc() is also not acceptable, because the physical memory needs to be contiguous. For the problem, because there is no page involved, so there will be extra sg available, maybe we can use these sg to break the big read/write request into page. Hmm ok, I was hoping that contiguous memory is not needed. I see that ENOMEM is handled, but how that that perform (or even complete) on a really badly fragmented system? I guess splitting into smaller pages or at least adding some reserve kmem_cache (or even mempool) would make sense? I also wonder if it wouldn't it make sense to set a sensible limit in virtio_fs_ctx_set_defaults() instead of introducing a new variable? As said in the commit message: A feasible solution is to limit the value of max_read for virtiofs, so the length passed to kmalloc() will be limited. However it will affects the max read size for ITER_IOVEC io and the value of max_write also needs limitation. It is a bit hard to set a reasonable value for both max_read and max_write to handle both normal ITER_IOVEC io and ITER_KVEC io. And considering ITER_KVEC io + dio case is uncommon, I think using a new limitation is more reasonable. For ITER_IOVEC max_pages applies - which is limited to FUSE_MAX_MAX_PAGES - why can't this be used in virtio_fs_ctx_set_defaults? @Miklos, is there a reason why there is no upper fc->max_{read,write} limit in process_init_reply()? Shouldn't both be limited to (FUSE_MAX_MAX_PAGES * PAGE_SIZE). Or any other reasonable limit? Thanks, Bernd Also, I guess the issue is kmalloc_array() in virtio_fs_enqueue_req? Wouldn't it make sense to use kvm_alloc_array/kvfree in that function? Thanks, Bernd 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn with len=10MB, and triggers the warning in __alloc_pages(): WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)). A feasible solution is to limit the value of max_read for virtiofs, so the length passed to kmalloc() will be limited. However it will affects the max read size for ITER_IOVEC io and the value of max_write also needs limitation. So instead of limiting the values of max_read and max_write, introducing max_nopage_rw to cap both the values of max_read and max_write when the fuse dio read/write request is initiated from kernel. Considering that fuse read/write request from kernel is uncommon and to decrease the demand for large contiguous pages, set max_nopage_rw as 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar. Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Hou Tao --- fs/fuse/file.c | 12 +++- fs/fuse/fuse_i.h | 3 +++ fs/fuse/inode.c | 1 + fs/fuse/virtio_fs.c | 6 ++ 4 files changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/
Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw
On 1/3/24 11:59, Hou Tao wrote: From: Hou Tao When trying to insert a 10MB kernel module kept in a virtiofs with cache disabled, the following warning was reported: [ cut here ] WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 .. Modules linked in: CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), .. RIP: 0010:__alloc_pages+0x2c4/0x360 .. Call Trace: ? __warn+0x8f/0x150 ? __alloc_pages+0x2c4/0x360 __kmalloc_large_node+0x86/0x160 __kmalloc+0xcd/0x140 virtio_fs_enqueue_req+0x240/0x6d0 virtio_fs_wake_pending_and_unlock+0x7f/0x190 queue_request_and_unlock+0x58/0x70 fuse_simple_request+0x18b/0x2e0 fuse_direct_io+0x58a/0x850 fuse_file_read_iter+0xdb/0x130 __kernel_read+0xf3/0x260 kernel_read+0x45/0x60 kernel_read_file+0x1ad/0x2b0 init_module_from_file+0x6a/0xe0 idempotent_init_module+0x179/0x230 __x64_sys_finit_module+0x5d/0xb0 do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 .. ---[ end trace ]--- The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses kmalloc-ed memory as bound buffer for fuse args, but fuse_get_user_pages() only limits the length of fuse arg by max_read or max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()). For virtiofs, max_read is UINT_MAX, so a big read request which is about I find this part of the explanation a bit confusing. I guess you wanted to write something like fuse_direct_io() -> fuse_get_user_pages() is limited by fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to UINT_MAX basically no limit is applied at all. I also wonder if it wouldn't it make sense to set a sensible limit in virtio_fs_ctx_set_defaults() instead of introducing a new variable? Also, I guess the issue is kmalloc_array() in virtio_fs_enqueue_req? Wouldn't it make sense to use kvm_alloc_array/kvfree in that function? Thanks, Bernd 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn with len=10MB, and triggers the warning in __alloc_pages(): WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)). A feasible solution is to limit the value of max_read for virtiofs, so the length passed to kmalloc() will be limited. However it will affects the max read size for ITER_IOVEC io and the value of max_write also needs limitation. So instead of limiting the values of max_read and max_write, introducing max_nopage_rw to cap both the values of max_read and max_write when the fuse dio read/write request is initiated from kernel. Considering that fuse read/write request from kernel is uncommon and to decrease the demand for large contiguous pages, set max_nopage_rw as 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar. Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Hou Tao --- fs/fuse/file.c | 12 +++- fs/fuse/fuse_i.h| 3 +++ fs/fuse/inode.c | 1 + fs/fuse/virtio_fs.c | 6 ++ 4 files changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a660f1f21540..f1beb7c0b782 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii, return ret < 0 ? ret : 0; } +static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc, + const struct iov_iter *iter, int write) +{ + unsigned int nmax = write ? fc->max_write : fc->max_read; + + if (iov_iter_is_kvec(iter)) + nmax = min(nmax, fc->max_nopage_rw); + return nmax; +} + ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, loff_t *ppos, int flags) { @@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, struct inode *inode = mapping->host; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fm->fc; - size_t nmax = write ? fc->max_write : fc->max_read; + size_t nmax = fuse_max_dio_rw_size(fc, iter, write); loff_t pos = *ppos; size_t count = iov_iter_count(iter); pgoff_t idx_from = pos >> PAGE_SHIFT; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 1df83eebda92..fc753cd34211 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -594,6 +594,9 @@ struct fuse_conn { /** Constrain ->max_pages to this value during feature negotiation */ unsigned int max_pages_limit; + /** Maximum read/write size when there is no page in request */ + unsigned int max_nopage_rw; + /** Input queue */ struct fuse_iqueue iq; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 2a6d44f91729..4cbbcb4a4b71 100644 --- a/fs/fuse/inode.c +++ b/fs/fu
Re: md: Combine two kmalloc() calls into one in sb_equal()
On 09.12.2016 22:58, SF Markus Elfring wrote: Irrelevant, the variable is not used before checking it. * Will it be more appropriate to attempt another memory allocation only if the previous one succeeded already? * Can it be a bit more efficient to duplicate only the required data in a single function call before? How many memory allocations do you expect to fail?
Re: md: Combine two kmalloc() calls into one in sb_equal()
On 09.12.2016 20:54, SF Markus Elfring wrote: So where did you get the idea from that it is not checked immediately? Is another variable assignment performed so far before the return value is checked from a previous function call? Irrelevant, the variable is not used before checking it.
Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
On 09.12.2016 19:30, SF Markus Elfring wrote: From: Markus Elfring Date: Fri, 9 Dec 2016 19:09:13 +0100 The function "kmalloc" was called in one case by the function "sb_equal" without checking immediately if it failed. Err, your patch actually *replaces* the check. So where did you get the idea from that it is not checked immediately? [...] - tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL); - tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL); - - if (!tmp1 || !tmp2) { - ret = 0; - goto abort; - } This is not immediately? Bernd
Re: [PATCH v4] fuse: Add support for passthrough read/write
On 01/21/2016 01:16 AM, Nikhilesh Reddy wrote: > Add support for filesystem passthrough read/write of files > when enabled in userspace through the option FUSE_PASSTHROUGH. > > There are many FUSE based filesystems that perform checks or > enforce policy or perform some kind of decision making in certain > functions like the "open" call but simply act as a "passthrough" > when performing operations such as read or write. > > When FUSE_PASSTHROUGH is enabled all the reads and writes > to the fuse mount point go directly to the passthrough filesystem > i.e a native filesystem that actually hosts the files rather than > through the fuse daemon. All requests that aren't read/write still > go thought the userspace code. > > This allows for significantly better performance on read and writes. > The difference in performance between fuse and the native lower > filesystem is negligible. > > There is also a significant cpu/power savings that is achieved which > is really important on embedded systems that use fuse for I/O. > > Signed-off-by: Nikhilesh Reddy I think it is common style to add a change log between patch set versions in the patch description. Bernd
Re: [PATCH] sysctl: Add a feature to drop caches selectively
On 06/27/2014 04:55 AM, Dave Chinner wrote: On Thu, Jun 26, 2014 at 02:10:28PM +0200, Bernd Schubert wrote: On 06/26/2014 01:57 PM, Lukáš Czerner wrote: On Thu, 26 Jun 2014, Artem Bityutskiy wrote: On Thu, 2014-06-26 at 12:36 +0200, Bernd Schubert wrote: On 06/26/2014 08:13 AM, Artem Bityutskiy wrote: On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote: Your particular use case can be handled by directing your benchmark at a filesystem mount point and unmounting the filesystem in between benchmark runs. There is no ned to adding kernel functionality for somethign that can be so easily acheived by other means, especially in benchmark environments where *everything* is tightly controlled. If I was a benchmark writer, I would not be willing running it as root to be able to mount/unmount, I would not be willing to require the customer creating special dedicated partitions for the benchmark, because this is too user-unfriendly. Or do I make incorrect assumptions? But why a sysctl then? And also don't see a point for that at all, why can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)? The latter question was answered - people want a way to drop caches for a file. They need a method which guarantees that the caches are dropped. They do not need an advisory method which does not give any guarantees. I'm not sure if a benchmark really needs that so much that FADV_DONTNEED isn't sufficient. Personally I would also like to know if FADV_DONTNEED succeeded. I.e. 'ql-fstest' is to check if the written pattern went to the block device and currently it does not know if data really had been dropped from the page cache. As it reads files several times this is not critical, but only would be a nice to have - nothing worth to add a new syscall. ql-test is not a benchmark, it's a data integrity test. The re-read verification problem is easily solved by using direct IO to read the files directly without going through the page cache. Indeed, direct IO will invalidate cached pages over the range it reads before it does the read, so the guarantee that you are after - no cached pages when the read is done - is also fulfilled by the direct IO read... I really don't understand why people keep trying to make cached IO behave like uncached IO when we already have uncached IO interfaces Firstly, direct IO has an entirely different IO pattern, usually much simpler than buffered through the page cache. Secondly, going through the page cache ensures that page cache buffering is also tested. I'm not at all opposed to open files randomly with direct IO to also test that path and I'm going to add that soon, but only using direct IO would limit the use case of ql-fstest. Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sysctl: Add a feature to drop caches selectively
On 06/26/2014 01:57 PM, Lukáš Czerner wrote: On Thu, 26 Jun 2014, Artem Bityutskiy wrote: On Thu, 2014-06-26 at 12:36 +0200, Bernd Schubert wrote: On 06/26/2014 08:13 AM, Artem Bityutskiy wrote: On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote: Your particular use case can be handled by directing your benchmark at a filesystem mount point and unmounting the filesystem in between benchmark runs. There is no ned to adding kernel functionality for somethign that can be so easily acheived by other means, especially in benchmark environments where *everything* is tightly controlled. If I was a benchmark writer, I would not be willing running it as root to be able to mount/unmount, I would not be willing to require the customer creating special dedicated partitions for the benchmark, because this is too user-unfriendly. Or do I make incorrect assumptions? But why a sysctl then? And also don't see a point for that at all, why can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)? The latter question was answered - people want a way to drop caches for a file. They need a method which guarantees that the caches are dropped. They do not need an advisory method which does not give any guarantees. I'm not sure if a benchmark really needs that so much that FADV_DONTNEED isn't sufficient. Personally I would also like to know if FADV_DONTNEED succeeded. I.e. 'ql-fstest' is to check if the written pattern went to the block device and currently it does not know if data really had been dropped from the page cache. As it reads files several times this is not critical, but only would be a nice to have - nothing worth to add a new syscall. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sysctl: Add a feature to drop caches selectively
On 06/26/2014 08:13 AM, Artem Bityutskiy wrote: On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote: Your particular use case can be handled by directing your benchmark at a filesystem mount point and unmounting the filesystem in between benchmark runs. There is no ned to adding kernel functionality for somethign that can be so easily acheived by other means, especially in benchmark environments where *everything* is tightly controlled. If I was a benchmark writer, I would not be willing running it as root to be able to mount/unmount, I would not be willing to require the customer creating special dedicated partitions for the benchmark, because this is too user-unfriendly. Or do I make incorrect assumptions? But why a sysctl then? And also don't see a point for that at all, why can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)? Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/11][SCSI]mpt2sas: Added new driver module Parameter disable_eedp to Disable EEDP Support
u8 scsi_io_cb_idx; diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c index 7f0af4f..d502728 100644 --- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c +++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c @@ -127,6 +127,11 @@ static int disable_discovery = -1; module_param(disable_discovery, int, 0); MODULE_PARM_DESC(disable_discovery, " disable discovery "); +/* Enable or disable EEDP support */ +static int disable_eedp; +module_param(disable_eedp, uint, 0); +MODULE_PARM_DESC(disable_eedp, " disable EEDP support: (default=0)"); Wouldn't it make sense to exlain what EEDP means? Something like MODULE_PARM_DESC(disable_eedp, " disable end-to-end data protection support (DIF): " "default=0)"); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kmemleak or crc32_le bug?
I'm frequently getting UG: unable to handle kernel paging request at 880f87550dc0 IP: [] crc32_le+0x30/0x110 called from kmemleak, see bottom of the message. schubert@wheezy@fsdevel2 linux-stable>addr2line -e vmlinux -i -a 813016d0 0x813016d0 /home/schubert/src/linux/linux-stable/lib/crc32.c:129 /home/schubert/src/linux/linux-stable/lib/crc32.c:247 /home/schubert/src/linux/linux-stable/lib/crc32.c:265 129: unlikely, refers to "u32 q" in crc32_body 247: crc = crc32_body(crc, p, len, tab); Also doesn't seem to be very likely. 265: u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len) { return crc32_le_generic(crc, p, len, (const u32 (*)[256])crc32table_le, CRCPOLY_LE); } Doesn't seem anything could fail here either. schubert@fsdevel2 linux-stable>addr2line -e vmlinux -i -a 811cdff9 0x811cdff9 /home/schubert/src/linux/linux-stable/mm/kmemleak.c:1350 kmemleak_scan() +1350 list_for_each_entry_rcu(object, &object_list, object_list) { spin_lock_irqsave(&object->lock, flags); if (color_white(object) && (object->flags & OBJECT_ALLOCATED) 1350: && update_checksum(object) && get_object(object)) { With the "Cannot allocate a kmemleak_object structure" messages, somehow looks like object is not proper initialized, but update_checksum() checks for that. Hmm, I'm not sure about kmemcheck_shadow_lookup(), especially about > if (!virt_addr_valid(address)) > return NULL; So is the test > shadow = kmemcheck_shadow_lookup(addr); > if (!shadow) > return true; right here? Shouldn't that be 'return false'? Thanks, Bernd kmemleak: Cannot allocate a kmemleak_object structure kmemleak: Kernel memory leak detector disabled kmemleak: Cannot allocate a kmemleak_object structure BUG: unable to handle kernel paging request at 880f87550dc0 IP: [] crc32_le+0x30/0x110 PGD 103f370067 PUD 10350e7067 PMD 10350ac067 PTE 800f87550060 Oops: [#1] SMP DEBUG_PAGEALLOC Modules linked in: fhgfs(O) fhgfs_client_opentk(O) parport_pc ppdev lp parport uinput nfsd auth_rpcgss dm_mod mlx4_ib ib_umad rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_ipoib ib_cm ib_sa ib_mad ib_core iTCO_wdt gpio_ich iTCO_vendor_support dcdbas mgag200 snd_pcm snd_page_alloc ttm snd_timer drm_kms_helper syscopyarea snd sysfillrect ipmi_si soundcore sysimgblt ipmi_msghandler pcspkr sb_edac edac_core joydev shpchp lpc_ich wmi acpi_power_meter ipv6 fuse nfsv4 nfsv3 nfs_acl nfs lockd sunrpc fscache sg sd_mod crc_t10dif crct10dif_common ahci libahci mlx4_core tg3 mpt2sas hwmon raid_class ptp scsi_transport_sas pps_core [last unloaded: fhgfs_client_opentk] CPU: 24 PID: 230 Comm: kmemleak Tainted: G O 3.13.1-dbg-1-gf9a023f #66 Hardware name: Dell Inc. PowerEdge R720/08RW36, BIOS 2.1.3 11/20/2013 task: 8807db75a790 ti: 8807d2f76000 task.ti: 8807d2f76000 RIP: 0010:[] [] crc32_le+0x30/0x110 RSP: 0018:8807d2f77db0 EFLAGS: 00010046 RAX: RBX: 880f833cb408 RCX: 0001 RDX: 0046 RSI: 880f87550dc0 RDI: 880f87550dbc RBP: 8807d2f77db8 R08: R09: 0001 R10: R11: 880f87550dbc R12: 0286 R13: R14: 0104 R15: 0400 FS: () GS:88081e60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 880f87550dc0 CR3: 00103dc0c000 CR4: 001407e0 Stack: 880f833cb408 8807d2f77e18 811cdff9 811cdf51 81a3d984 88070009 880f833cb458 000927c0 000927c0 811ce5a0 Call Trace: [] kmemleak_scan+0x399/0x590 [] ? kmemleak_scan+0x2f1/0x590 [] ? kmemleak_write+0x3b0/0x3b0 [] kmemleak_scan_thread+0x63/0xd0 [] kthread+0xf6/0x110 [] ? kthread_create_on_node+0x250/0x250 [] ret_from_fork+0x7c/0xb0 [] ? kthread_create_on_node+0x250/0x250 Code: 89 f8 48 89 e5 53 0f 85 cd 00 00 00 49 89 d2 48 c1 ea 03 4c 8d 5e fc 41 83 e2 07 48 85 d2 0f 84 81 00 00 00 4c 89 df 45 31 c0 90 <8b> 5f 04 48 83 c7 08 49 83 c0 01 8b 0f 31 c3 89 d8 44 0f b6 cb RIP [] crc32_le+0x30/0x110 RSP CR2: 880f87550dc0 ---[ end trace 71bec186f2a04a6f ]--- BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20 in_atomic(): 1, irqs_disabled(): 1, pid: 230, name: kmemleak INFO: lockdep is turned off. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject: [v3.8][v3.11][Regression] [SCSI] sd: Update WRITE SAME heuristics
Hello Joseph, On 10/29/2013 08:21 PM, Joseph Salisbury wrote: Hi Martin, A bug was opened against the Ubuntu kernel[0]. After a kernel bisect, it was found that reverting the following commit resolved this bug: commit 66c28f97120e8a621afd5aa7a31c4b85c547d33d Author: Martin K. Petersen Date: Thu Jun 6 22:15:55 2013 -0400 [SCSI] sd: Update WRITE SAME heuristics The regression was introduced as of v3.11-rc1, but it also made it's way into the stable trees. I see that you are the author of this patch, so I wanted to run this by you. I was thinking of requesting a revert for v3.12, but I wanted to get your feedback first. James queued this up for 3.13 http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=735e39e680256a13e7be3492acfb4d9721287a42 Maybe we should try to convince James to take it into 3.12? Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 09:34 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote: On 09/30/2013 08:02 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: On 09/30/2013 07:44 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: It would be nice if there would be way if the file system would get a hint that the target file is supposed to be copy of another file. That way distributed file systems could also create the target-file with the correct meta-information (same storage targets as in-file has). Well, if we cannot agree on that, file system with a custom protocol at least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not sure if this would work for pNFS, though. splice() does not create new files. What you appear to be asking for lies way outside the scope of that system call interface. Sorry I know, definitely outside the scope of splice, but in the context of offloaded file copies. So the question is, what is the best way to address/discuss that? Why does it need to be addressed in the first place? An offloaded copy is still not efficient if different storage servers/targets used by from-file and to-file. So? mds1: orig-file oss1/target1: orig-chunk1 mds1: target-file ossN/targetN: target-chunk1 clientN: Performs the copy Ideally, orig-chunk1 and target-chunk1 are on the same server and same target. Copy offload then even could done from the underlying fs, similiar as local splice. If different ossN servers are used copies still have to be done over network by these storage servers, although the client only would need to initiate the copy. Still faster, but also not ideal. What is preventing an application from retrieving and setting this information using standard libc functions such as fstat()+open(), and supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd where appropriate? At a minimum this requires network and metadata overhead. And while I'm working on FhGFS now, I still wonder what other file system need to do - for example Lustre pre-allocates storage-target files on creating a file, so file layout changes mean even more overhead there. The problem you are describing is limited to a narrow set of storage architectures. If copy offload using splice() doesn't make sense for those architectures, then don't implement it for them. But it _does_ make sense. The file system just needs a hint that a splice copy is going to come up. You might be able to provide ioctls() to do these special hinted file creations for those filesystems that need it, but the vast majority don't, and you shouldn't enforce it on them. And exactly for that we need a standard - it does not make sense if each and every distributed file system implements its own ioctl/libattr/libacl interface for that. Anyway, if we could agree on to use libattr or libacl to teach the file system about the upcoming splice call I would be fine. libattr and libacl are generic libraries that exist to manipulate xattrs and acls. They do not need to contain Lustre-specific code. pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own interface? And userspace needs to address all of them differently? I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, didn't find a better name yet), which would take in-file-path and out-file-path and allow the file system to create out-file-path with the same meta-layout as in-file-path. And it would need some flags, such as AUTO (file system decides if it makes sense to do a local copy) and FORCE (always try a local copy). Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 08:02 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: On 09/30/2013 07:44 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: It would be nice if there would be way if the file system would get a hint that the target file is supposed to be copy of another file. That way distributed file systems could also create the target-file with the correct meta-information (same storage targets as in-file has). Well, if we cannot agree on that, file system with a custom protocol at least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not sure if this would work for pNFS, though. splice() does not create new files. What you appear to be asking for lies way outside the scope of that system call interface. Sorry I know, definitely outside the scope of splice, but in the context of offloaded file copies. So the question is, what is the best way to address/discuss that? Why does it need to be addressed in the first place? An offloaded copy is still not efficient if different storage servers/targets used by from-file and to-file. What is preventing an application from retrieving and setting this information using standard libc functions such as fstat()+open(), and supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd where appropriate? At a minimum this requires network and metadata overhead. And while I'm working on FhGFS now, I still wonder what other file system need to do - for example Lustre pre-allocates storage-target files on creating a file, so file layout changes mean even more overhead there. Anyway, if we could agree on to use libattr or libacl to teach the file system about the upcoming splice call I would be fine. Metadata overhead is probably negligible for large files. Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 07:44 PM, Myklebust, Trond wrote: On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: It would be nice if there would be way if the file system would get a hint that the target file is supposed to be copy of another file. That way distributed file systems could also create the target-file with the correct meta-information (same storage targets as in-file has). Well, if we cannot agree on that, file system with a custom protocol at least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not sure if this would work for pNFS, though. splice() does not create new files. What you appear to be asking for lies way outside the scope of that system call interface. Sorry I know, definitely outside the scope of splice, but in the context of offloaded file copies. So the question is, what is the best way to address/discuss that? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 06:31 PM, Miklos Szeredi wrote: Here's an example "cp" app using direct splice (and without fallback to non-splice, which is obviously required unless the kernel is known to support direct splice). Untested, but trivial enough... The important part is, I think, that the app must not assume that the kernel can complete the request in one go. Thanks, Miklos #define _GNU_SOURCE #include #include #include #include #include #include #ifndef SPLICE_F_DIRECT #define SPLICE_F_DIRECT(0x10) /* neither splice fd is a pipe */ #endif int main(int argc, char *argv[]) { struct stat stbuf; int in_fd; int out_fd; int res; off_t off; off_t off = 0; if (argc != 3) errx(1, "usage: %s from to", argv[0]); in_fd = open(argv[1], O_RDONLY); if (in_fd == -1) err(1, "opening %s", argv[1]); res = fstat(in_fd, &stbuf); if (res == -1) err(1, "fstat"); out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode); if (out_fd == -1) err(1, "opening %s", argv[2]); do { off_t in_off = off, out_off = off; ssize_t rres; rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX, SPLICE_F_DIRECT); if (rres == -1) err(1, "splice"); if (rres == 0) break; off += rres; } while (off < stbuf.st_size); res = close(in_fd); if (res == -1) err(1, "close"); res = fsync(out_fd); if (res == -1) err(1, "fsync"); res = close(out_fd); if (res == -1) err(1, "close"); return 0; } It would be nice if there would be way if the file system would get a hint that the target file is supposed to be copy of another file. That way distributed file systems could also create the target-file with the correct meta-information (same storage targets as in-file has). Well, if we cannot agree on that, file system with a custom protocol at least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not sure if this would work for pNFS, though. Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drivers: scsi: FLUSH timeout
On 09/24/2013 02:35 PM, KY Srinivasan wrote: -Original Message- From: Jack Wang [mailto:xjtu...@gmail.com] Sent: Tuesday, September 24, 2013 5:08 AM To: KY Srinivasan Cc: Greg KH; linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; oher...@suse.com; jbottom...@parallels.com; h...@infradead.org; linux- s...@vger.kernel.org; Mike Christie Subject: Re: Drivers: scsi: FLUSH timeout On 09/21/2013 07:24 AM, KY Srinivasan wrote: -Original Message- From: Greg KH [mailto:gre...@linuxfoundation.org] Sent: Friday, September 20, 2013 1:32 PM To: KY Srinivasan Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; oher...@suse.com; jbottom...@parallels.com; h...@infradead.org; linux- s...@vger.kernel.org Subject: Re: Drivers: scsi: FLUSH timeout On Fri, Sep 20, 2013 at 12:32:27PM -0700, K. Y. Srinivasan wrote: The SD_FLUSH_TIMEOUT value is currently hardcoded. Hardcoded where? Please, more context. This is defined in scsi/sd.h: #define SD_FLUSH_TIMEOUT(60 * HZ) On our cloud, we sometimes hit this timeout. I was wondering if we could make this a module parameter. If this is acceptable, I can send you a patch for this. A module parameter don't make sense for a per-device value, does it? Currently, the 60 second timeout is applied across devices. Ideally, I want to be able to control the FLUSH TIMEOUT as we currently do I/O timeout. If this is acceptable, I can work on a patch for that as well. Regards, K. Y greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi, Back to 2010, Mike(cc-ed) try to add a flush time out interface, similar to what you want here, no idea why it's just ignored? http://www.spinics.net/lists/linux-scsi/msg45017.html Thanks Jack. Mike, do you know what the concerns were as to why this patch was not accepted? See also this discussion: http://marc.info/?l=linux-scsi&m=127167679221319&w=2 And retries have been added by commit c213e1407be6b04b144794399a91472e0ef92aec Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi disk: Use its own buffer for the vpd request
On 08/31/2013 09:48 PM, Nix wrote: > On 31 Aug 2013, Greg KH said: >> On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote: >>> On 1 Aug 2013, Bernd Schubert said: >>> >>>> Once I noticed that scsi_get_vpd_page() works fine from other function >>>> calls and that it is not 0x89, but already 0x0 that fails fixing it became >>>> easy. >>>> >>>> Nix, any chance you could verify it also works for you? >>> >>> As an aside, this commit does indeed fix the bug I reported, but it >>> doesn't seem to have gone anywhere, not even into -stable. >>> >>> Is it held up somehow? >>> >>> (stable has >>> >>> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb >>> Author: Martin K. Petersen >>> Date: Tue Jul 30 22:58:34 2013 -0400 >>> >>> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages >>> is set >>> >>> which IIRC was eventually found not to be necessary, because this fix >>> works fine instead?) >>> >>> Possibly I'm misremembering the order of month-old events and Martin's >>> fix was eventually considered better... in which case, sorry for the noise. >> >> Is that other patch even needed anymore, now that Martin's patch is in >> the tree? > > My understanding is that this patch is rather better, since Martin's > patch prevents sending of the extended INQUIRY command at all: this one > just uses a reduced buffer size, but can still issue the command. (But I > may be misunderstanding everything.) Hmm, I wonder if 7562523e84ddc742fe1f9db8bd76b01acca89f6b (linus tree) / 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb (stable-tree) always works . It tests if sdev->skip_vpd_pages is set, but as far as I can see this only gets set for Seagate drives via BLIST_SKIP_VPD_PAGES. So if anything else than a Seagate drive is connected to an Areca controller with older firmware it will still fail. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi disk: Use its own buffer for the vpd request
Martin, sorry for my late reply, I entirely lost track of this (customer issues, vacation, lots of main work, ...). On 08/02/2013 05:00 AM, Martin K. Petersen wrote: >>>>>> "Bernd" == Bernd Schubert writes: > > Bernd, > > Bernd> Once I noticed that scsi_get_vpd_page() works fine from other > Bernd> function calls and that it is not 0x89, but already 0x0 that > Bernd> fails fixing it became easy. > > Bernd> Nix, any chance you could verify it also works for you? > > Do we get an appropriate error back when we try to issue WRITE SAME > 10/16? If so, I'm OK with this fix. > > And thanks for looking into this! > Is testing with sg_write_same sufficient? With F/W V1.49: > (squeeze)fslab2:~# lsscsi | grep sda > [2:0:0:0]diskATA HDS724040KLSA80 KFAO /dev/sda > (squeeze)fslab2:~# strace -f sg_write_same --10 -v --num=0 --lba=0 /dev/sda > ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, > 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=6, > flags=0, > data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], > status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, > 00, 00, 20, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, > duration=0, info=0x1}) = 0 > write(2, "Write same: Fixed format, curre"..., 114Write same: Fixed format, > current; Sense key: Illegal Request > Additional sense: Invalid command operation code > ) = 114 > write(2, "Write same(10) command not suppo"..., 37Write same(10) command not > supported > ) = 37 > (squeeze)fslab2:~# strace -f sg_write_same --16 -v --num=0 --lba=0 /dev/sda > ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, > 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, > dxfer_len=512, timeout=6, flags=0, > data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], > status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, > 00, 00, 24, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, > duration=0, info=0x1}) = 0 > write(2, "Write same: Fixed format, curre"..., 104Write same: Fixed format, > current; Sense key: Illegal Request > Additional sense: Invalid field in cdb > ) = 104 > write(2, "bad field in Write same(16) cdb,"..., 63bad field in Write same(16) > cdb, option probably not supported > ) = 63 Now with F/W V1.46 > (squeeze)fslab2:~# lsscsi | grep sdk > [10:0:1:2] diskHitachi HDS724040KLSA80 R001 /dev/sdk > (squeeze)fslab2:~# cat /sys/class/scsi_host/host10/host_fw_model > ARC-1260 > (squeeze)fslab2:~# strace -f sg_write_same --10 -v --num=0 --lba=0 /dev/sdk > execve("/usr/bin/sg_write_same", ["sg_write_same", "--10", "-v", "--num=0", > "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0 > ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, > 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=6, > flags=0, > data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], > status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, > 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, > resid=0, duration=0, info=0x1}) = 0 > write(2, "Write same: Fixed format, curre"..., 134Write same: Fixed format, > current; Sense key: Illegal Request > Additional sense: Invalid command operation code > Info fld=0x0 [0] > ) = 134 > write(2, "Write same(10) command not suppo"..., 37Write same(10) command not > supported > ) = 37 > (squeeze)fslab2:~# strace -f sg_write_same --16 -v --num=0 --lba=0 /dev/sdk > execve("/usr/bin/sg_write_same", ["sg_write_same", "--16", "-v", "--num=0", > "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0 > ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, > 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, > dxfer_len=512, timeout=6, flags=0, > data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], > status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, > 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, > resid=0, duration=0, info=0x1}) = 0 > write(2, "Write same: Fixed format, curre"..., 134Write same: Fixed format, > current; Sense key: Illegal Request > Additional sense: Invalid command operation code > Info fld=0x0 [0] > ) = 134 > write(2, "Write same(16) command not suppo"..., 37Write same(16) command not > supported > ) = 37 Is this sufficient, or do you need something else? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 08/01/2013 06:04 PM, Nix wrote: On 1 Aug 2013, Bernd Schubert verbalised: On 07/30/2013 11:20 PM, Nix wrote: On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with lazy init it also will happen after mounting the file system, while lazy init is running (inode zeroing). Well, it'll happen the first few times you mount the fs. If your fs is years old (as mine are) the inode tables will probably have been initialized by now! I'm frequently doing tests with millions of files and reformating is ways faster than deleting the all these files. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 11:20 PM, Nix wrote: On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with lazy init it also will happen after mounting the file system, while lazy init is running (inode zeroing). Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi disk: Use its own buffer for the vpd request
Whoops, the title is wrong, it should have been: [PATCH] scsi disk: Limit get_vpd_page buf size On 08/01/2013 04:34 PM, Bernd Schubert wrote: Once I noticed that scsi_get_vpd_page() works fine from other function calls and that it is not 0x89, but already 0x0 that fails fixing it became easy. Nix, any chance you could verify it also works for you? From: Bernd Schubert Somehow older areca firmware versions have issues with scsi_get_vpd_page() and a large buffer. Even scsi_get_vpd_page(, page=0,) failed in sd_read_write_same(), while a similar request from sd_read_block_limits() worked fine. Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46. Fixes a regression with areca controllers and older firmware versions introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d Reported-by: Nix Signed-off-by: Bernd Schubert CC: sta...@vger.kernel.org --- drivers/scsi/sd.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 80f39b8..02e50ae 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) struct scsi_device *sdev = sdkp->device; if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) { + /* too large values might cause issues with arcmsr */ + int vpd_buf_len = 64; + sdev->no_report_opcodes = 1; /* Disable WRITE SAME if REPORT SUPPORTED OPERATION * CODES is unsupported and the device has an ATA * Information VPD page (SAT). */ - if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE)) + if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len)) sdev->no_write_same = 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] scsi disk: Use its own buffer for the vpd request
Once I noticed that scsi_get_vpd_page() works fine from other function calls and that it is not 0x89, but already 0x0 that fails fixing it became easy. Nix, any chance you could verify it also works for you? From: Bernd Schubert Somehow older areca firmware versions have issues with scsi_get_vpd_page() and a large buffer. Even scsi_get_vpd_page(, page=0,) failed in sd_read_write_same(), while a similar request from sd_read_block_limits() worked fine. Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46. Fixes a regression with areca controllers and older firmware versions introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d Reported-by: Nix Signed-off-by: Bernd Schubert CC: sta...@vger.kernel.org --- drivers/scsi/sd.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 80f39b8..02e50ae 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) struct scsi_device *sdev = sdkp->device; if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) { + /* too large values might cause issues with arcmsr */ + int vpd_buf_len = 64; + sdev->no_report_opcodes = 1; /* Disable WRITE SAME if REPORT SUPPORTED OPERATION * CODES is unsupported and the device has an ATA * Information VPD page (SAT). */ - if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE)) + if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len)) sdev->no_write_same = 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/31/2013 05:15 AM, Martin K. Petersen wrote: >>>>>> "Bernd" == Bernd Schubert writes: > > Bernd, > >>> Product revision level: R001 > > It's clearly not verbatim passthrough... > > Bernd> Besides the firmware, the difference might be that I'm exporting > Bernd> single disks without any areca-raidset in between. I can try to > Bernd> confirm that tomorrow, I just need the system as it is till > Bernd> tomorrow noon. > > That would be a great data point. I don't have any Areca boards. > Just tested it, areca-raidset does not make a difference, but the firmware version does. After downgrading to 1.46 I have the same issue. It is getting a bit late for me, but as this a pure development system, which is also booted over nfs, I can investigate it tomorrow. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 01:34 AM, Martin K. Petersen wrote: "Nix" == Nix writes: Bernd, Nix> I can now confirm that reverting this commit causes this problem to Nix> go away, and my machine boots fine again. Can you please send me the output of sq_inq with your 1.49 firmware? I made a tweak that allowed Nix to boot but we're trying to find a good blacklist trigger. And that's tricky given that Areca allows you manually specify the SCSI model string for each volume... Sorry it got a bit late today. Here it is. (wheezy)fslab1:~# sg_inq -v /dev/sdc inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 60 00 PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 Linked=0 [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Hitachi Product identification: HDS724040KLSA80 Product revision level: R001 inquiry cdb: 12 01 00 00 fc 00 inquiry cdb: 12 01 80 00 fc 00 Unit serial number: KRFS2CRAHXJZVD Besides the firmware, the difference might be that I'm exporting single disks without any areca-raidset in between. I can try to confirm that tomorrow, I just need the system as it is till tomorrow noon. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/29/2013 03:05 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert said: Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen Date: Thu Jun 6 22:15:55 2013 -0400 [...] Obviously, at this point, this machine has no modules loaded (it has almost none loaded even when fully operational) I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried 3.10.2.) Hmm, indeed that points to this commit. I just don't see what could fail there. Could you try to run these commands with 3.10.1? # # check if reporting opcodes works # sg_opcodes -v -n /dev/sdX # check ata information page # sg_vpd --page=0x89 /dev/sdX And I don't think this commit can cause your issue at all, a failing heuristics would enable WRITE SAME and would cause issues with linux-md, but there shouldn't happen anything directly in the scsi-layer. Which was your last working kernel version? 3.10.1. :) Whoops, sorry, I missed that in your first sentence. No changes to arcmsr between those versions... I suspect I'll have to bisect, which will be a complete pig because every failure means a hard powerdown of this box. Always-on servers rarely appreciate hard powerdowns :( Maybe just revert this commit? Helpful would be some scsi logging to see which command actually fails. I guess you don't have a serial console? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: My server's ARC-1210 has been working fine for years, but when I upgraded from 3.10.1, it started failing: Instead of [0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210 [0.804028] scsi0 : Areca SATA Host Adapter RAID Controller Driver Version 1.20.00.15 2010/08/05 [...] [4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.115399] sd 7:0:0:1: [sdd] No Caching mode page present [4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.118081] sdd: sdd1 [4.124363] sd 7:0:0:1: [sdd] No Caching mode page present [4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk I now see (timestamps and some of the right edge chopped off because not captured on my camera, no netconsole as this machine has all my storage and is my loghost, and with this bug it can't get at any of that storage). sd 7:0:0:1: [sdd] Assuming drive cache: write through sd 7:0:0:1: [sdd] No Caching mode page present sd 7:0:0:1: [sdd] Assuming drive cache: write through sdd: sdd1 sd 7:0:0:1: [sdd] No Caching mode page present sd 7:0:0:1: [sdd] Assuming drive cache: write through sd 7:0:0:1: [sdd] Attached SCSI removable disk arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen Date: Thu Jun 6 22:15:55 2013 -0400 SCSI: sd: Update WRITE SAME heuristics so my, admittedly largely baseless, suspicions currently fall there. Obviously, at this point, this machine has no modules loaded (it has almost none loaded even when fully operational) I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this commit can cause your issue at all, a failing heuristics would enable WRITE SAME and would cause issues with linux-md, but there shouldn't happen anything directly in the scsi-layer. Which was your last working kernel version? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument
Hello Nicolas, thanks for your review! Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2 2/2] coccicheck: Allow to show the executed command line
On my system one of the tests failed with "Fatal error: exception Failure("No OCaml compiler found! Install either ocamlopt or ocamlopt.opt")". Investigating such issues is easier if the executed command line is being shown. Signed-off-by: Bernd Schubert CC: Julia Lawall CC: Nicolas Palix CC: co...@systeme.lip6.fr CC: Michal Marek --- scripts/coccicheck | 28 +--- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/scripts/coccicheck b/scripts/coccicheck index f8f15a2..85d3189 100755 --- a/scripts/coccicheck +++ b/scripts/coccicheck @@ -55,6 +55,14 @@ if [ "$ONLINE" = "0" ] ; then echo '' fi +run_cmd() { + if [ $VERBOSE -ne 0 ] ; then + echo "Running: $@" + fi + eval $@ +} + + coccinelle () { COCCI="$1" @@ -100,15 +108,21 @@ coccinelle () { fi if [ "$MODE" = "chain" ] ; then - $SPATCH -D patch $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ - $SPATCH -D report $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \ - $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ - $SPATCH -D org $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1 + run_cmd $SPATCH -D patch \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ + run_cmd $SPATCH -D report \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \ + run_cmd $SPATCH -D context \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ + run_cmd $SPATCH -D org \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1 elif [ "$MODE" = "rep+ctxt" ] ; then - $SPATCH -D report $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \ - $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 + run_cmd $SPATCH -D report \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \ + run_cmd $SPATCH -D context \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 else - $SPATCH -D $MODE $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 + run_cmd $SPATCH -D $MODE $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 fi } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2 1/2] coccicheck: Allow the user to give a V= (verbose) argument
Do not run with verbosity on/off depending on the ONLINE variable, which gets set with C=1 or C=2, but allow the user to set the verbosity using kernel default make V= paramemter. Verbosity is off by default now. Signed-off-by: Bernd Schubert CC: Julia Lawall CC: Nicolas Palix CC: co...@systeme.lip6.fr CC: Michal Marek --- Documentation/coccinelle.txt |4 scripts/coccicheck | 11 ++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/Documentation/coccinelle.txt b/Documentation/coccinelle.txt index cf44eb6..dffa2d6 100644 --- a/Documentation/coccinelle.txt +++ b/Documentation/coccinelle.txt @@ -87,6 +87,10 @@ As any static code analyzer, Coccinelle produces false positives. Thus, reports must be carefully checked, and patches reviewed. +To enable verbose messages set the V= variable, for example: + + make coccicheck MODE=report V=1 + Using Coccinelle with a single semantic patch ~~~ diff --git a/scripts/coccicheck b/scripts/coccicheck index 1a49d1c..f8f15a2 100755 --- a/scripts/coccicheck +++ b/scripts/coccicheck @@ -2,6 +2,15 @@ SPATCH="`which ${SPATCH:=spatch}`" +# The verbosity may be set by the environmental parameter V= +# as for example with 'make V=1 coccicheck' + +if [ -n "$V" -a "$V" != "0" ]; then + VERBOSE=1 +else + VERBOSE=0 +fi + if [ "$C" = "1" -o "$C" = "2" ]; then ONLINE=1 @@ -55,7 +64,7 @@ coccinelle () { # #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null -if [ "$ONLINE" = "0" ] ; then +if [ $VERBOSE -ne 0 ] ; then FILE=`echo $COCCI | sed "s|$srctree/||"` -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument
Hello Nicolas, On 01/22/2013 03:31 PM, Nicolas Palix wrote: Hi, Thank you Bernd for your proposition. I added Michal in CC, who is responsible for the integration. Oh, sorry, I CCed everyone, but forgot Michal. I was wondering if the V variable which already exists would not be better than introducing a new variable. Bernd, is there any reason to not use V ? I'm fine using 'V' either. Your patch also remove the check of the ONLINE variable. In doing so, I think that your patch will badly interfere with the online checking performed with the C variable. Am I missing something ? Hmm, I probably should have told in the patch description that verbosity defaults to 0 now. Shall I revert or make an extra patch for that? With the current patch and ONLINE != 0 nothing will change. Cheers, Bernd Regards, On Tue, Jan 22, 2013 at 2:34 PM, Bernd Schubert wrote: Simply running "make coccicheck" returns very verbose output and warnings might not be noticed. Allow the user to set the verbosity level. Signed-off-by: Bernd Schubert CC: Julia Lawall CC: Nicolas Palix CC: co...@systeme.lip6.fr --- scripts/coccicheck |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/scripts/coccicheck b/scripts/coccicheck index 1a49d1c..eab0b00 100755 --- a/scripts/coccicheck +++ b/scripts/coccicheck @@ -2,6 +2,12 @@ SPATCH="`which ${SPATCH:=spatch}`" +if [ -z "$VERBOSE" ] ; then + RUN_VERBOSE=0 +else + RUN_VERBOSE=$VERBOSE +fi + if [ "$C" = "1" -o "$C" = "2" ]; then ONLINE=1 @@ -55,7 +61,7 @@ coccinelle () { # #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null -if [ "$ONLINE" = "0" ] ; then +if [ "$RUN_VERBOSE" != "0" ] ; then FILE=`echo $COCCI | sed "s|$srctree/||"` -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] coccicheck: Allow to show the executed command line
On my system one of the tests failed with "Fatal error: exception Failure("No OCaml compiler found! Install either ocamlopt or ocamlopt.opt")". Investigating such issues is easier if the executed command line is being shown. Signed-off-by: Bernd Schubert CC: Julia Lawall CC: Nicolas Palix CC: co...@systeme.lip6.fr --- scripts/coccicheck | 28 +--- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/scripts/coccicheck b/scripts/coccicheck index eab0b00..fb98534 100755 --- a/scripts/coccicheck +++ b/scripts/coccicheck @@ -52,6 +52,14 @@ if [ "$ONLINE" = "0" ] ; then echo '' fi +run_cmd() { + if [ "$RUN_VERBOSE" != "0" ] ; then + echo "Running: $@" + fi + eval $@ +} + + coccinelle () { COCCI="$1" @@ -97,15 +105,21 @@ coccinelle () { fi if [ "$MODE" = "chain" ] ; then - $SPATCH -D patch $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ - $SPATCH -D report $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \ - $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ - $SPATCH -D org $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1 + run_cmd $SPATCH -D patch \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ + run_cmd $SPATCH -D report \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \ + run_cmd $SPATCH -D context \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || \ + run_cmd $SPATCH -D org \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1 elif [ "$MODE" = "rep+ctxt" ] ; then - $SPATCH -D report $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \ - $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 + run_cmd $SPATCH -D report \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \ + run_cmd $SPATCH -D context \ + $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 else - $SPATCH -D $MODE $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 + run_cmd $SPATCH -D $MODE $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1 fi } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument
Simply running "make coccicheck" returns very verbose output and warnings might not be noticed. Allow the user to set the verbosity level. Signed-off-by: Bernd Schubert CC: Julia Lawall CC: Nicolas Palix CC: co...@systeme.lip6.fr --- scripts/coccicheck |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/scripts/coccicheck b/scripts/coccicheck index 1a49d1c..eab0b00 100755 --- a/scripts/coccicheck +++ b/scripts/coccicheck @@ -2,6 +2,12 @@ SPATCH="`which ${SPATCH:=spatch}`" +if [ -z "$VERBOSE" ] ; then + RUN_VERBOSE=0 +else + RUN_VERBOSE=$VERBOSE +fi + if [ "$C" = "1" -o "$C" = "2" ]; then ONLINE=1 @@ -55,7 +61,7 @@ coccinelle () { # #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null -if [ "$ONLINE" = "0" ] ; then +if [ "$RUN_VERBOSE" != "0" ] ; then FILE=`echo $COCCI | sed "s|$srctree/||"` -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/17/2012 11:00 AM, Borislav Petkov wrote: + Suresh. On Mon, Dec 17, 2012 at 10:34:46AM +0100, Bernd Schubert wrote: On 12/16/2012 09:39 PM, Borislav Petkov wrote: On Sun, Dec 16, 2012 at 08:46:06PM +0100, Bernd Schubert wrote: Hmm, I read it the other way around - x2apic depends on interrupt remapping, but interrupt remapping can be used without x2apic. Ok, you're right. X2APIC should depend on IRQ_REMAP: https://lwn.net/Articles/289881/ The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to support platforms with CPU's having > 8 bit APIC ID, say Y." I guess may CPU has the latter? I think it is what Yinghai said - you obviously need x2apic kernel support if you have IRQ_REMAP on. Can the kernel panic a bit improved to help user to understand what needs to be enabled? Well, your kernel enables IRQ_REMAP properly: [0.031115] Enabled IRQ remapping in x2apic mode I guess at that stage we could probably check for x2apic support and scream loudly if it is not present... IMHO. Hmm, I think that would the wrong place, It has to be the right place because this "Enabled IRQ..." printk above is from the IRQ remapping code which detects an x2apic mode in your case. as the initial 3.7.0 configuration didn't have IRQ_REMAP enabled. Huh, so why do I see the above message in your dmesg output in http://marc.info/?l=linux-kernel&m=135540103415652 then? Oh huh, that is the dmesg from 3.4.7, which booted fine. I just sent it hoping it would help to see where the issue comes from. When I run "make oldconfig" for 3.7.0 I accidentally unset CONFIG_IRQ_REMAP, wich also unset CONFIG_X86_X2APIC :( Ok, let's sort things out here. Your .config has # CONFIG_IRQ_REMAP is not set but in the original dmesg you sent, the printk above comes from intel_irq_remapping.c which gets enabled by CONFIG_IRQ_REMAP. So, can you try enabling only CONFIG_IRQ_REMAP and leave CONFIG_X86_X2APIC off to confirm the original observation? No need, as I said above, the printk comes from a different kernel with a different config. With either CONFIG_IRQ_REMAP=false or CONFIG_X86_X2APIC=false the booting ends in a kernel panic. Also, I'd guess your machine can boot with both options off? No, only partly with "noapic", but that makes ahci to fail to detect disks later on. And that was the reason why x2apic got disabled during the "make oldconfig" process... Is this message an indication for missing x2apic? "smpboot: weird, boot (#255) not listed by the BIOS" It's an indication that something is fishy with the apic IDs. Thanks. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/16/2012 09:39 PM, Borislav Petkov wrote: On Sun, Dec 16, 2012 at 08:46:06PM +0100, Bernd Schubert wrote: Hmm, I read it the other way around - x2apic depends on interrupt remapping, but interrupt remapping can be used without x2apic. Ok, you're right. X2APIC should depend on IRQ_REMAP: https://lwn.net/Articles/289881/ The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to support platforms with CPU's having > 8 bit APIC ID, say Y." I guess may CPU has the latter? I think it is what Yinghai said - you obviously need x2apic kernel support if you have IRQ_REMAP on. Can the kernel panic a bit improved to help user to understand what needs to be enabled? Well, your kernel enables IRQ_REMAP properly: [0.031115] Enabled IRQ remapping in x2apic mode I guess at that stage we could probably check for x2apic support and scream loudly if it is not present... IMHO. Hmm, I think that would the wrong place, as the initial 3.7.0 configuration didn't have IRQ_REMAP enabled. And that was the reason why x2apic got disabled during the "make oldconfig" process... Is this message an indication for missing x2apic? "smpboot: weird, boot (#255) not listed by the BIOS" Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/16/2012 07:07 PM, Yinghai Lu wrote: On Sun, Dec 16, 2012 at 10:01 AM, Bernd Schubert wrote: can you post your .config for v3.7 ? wonder if you have x2apic in .config Which setting is it? Config is attached. your config does not have CONFIG_X86_X2APIC=y set. please enable that. your BIOS pre-enable x2apic somehow, so you must have x2apic enabled in kernel. it x2apic really can not be re-enabled by kernel, kernel would disable x2apic automatically. The system boots fine with x2apic enabled. Thanks a bunch for your help, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/16/2012 08:13 PM, Borislav Petkov wrote: > On Sun, Dec 16, 2012 at 07:28:59PM +0100, Bernd Schubert wrote: >> CONFIG_X86_X2APIC depends on CONFIG_IRQ_REMAP, which I disabled as it >> is marked as experimental... > > You shouldn't pay too much attention to CONFIG_EXPERIMENTAL because it > is on its way out from the kernel tree. I usually don't too much, if I understand what it is about and what are the consequences. > > But if you don't want to have interrupt remapping on your system, > you can disable it nevertheless. Wait, you can't: according to > d0b03bd1c6725a3463290d7f9626e4b583518a5a, you can use x2apic without > interrupt remapping but interrupt remapping needs to be enabled before > x2apic. Hmm, I read it the other way around - x2apic depends on interrupt remapping, but interrupt remapping can be used without x2apic. The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to support platforms with CPU's having > 8 bit APIC ID, say Y." I guess may CPU has the latter? Can the kernel panic a bit improved to help user to understand what needs to be enabled? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/16/2012 07:07 PM, Yinghai Lu wrote: > On Sun, Dec 16, 2012 at 10:01 AM, Bernd Schubert > wrote: >>> can you post your .config for v3.7 ? >>> >>> wonder if you have x2apic in .config >> >> Which setting is it? Config is attached. > > your config does not have > > CONFIG_X86_X2APIC=y > > set. > > please enable that. > > your BIOS pre-enable x2apic somehow, so you must have x2apic enabled in > kernel. > > it x2apic really can not be re-enabled by kernel, kernel would disable > x2apic automatically. Thanks! CONFIG_X86_X2APIC depends on CONFIG_IRQ_REMAP, which I disabled as it is marked as experimental and as this is my desktop system. So CONFIG_X86_X2APIC also got unset automatically. Will test a new kernel build in the morning. Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/16/2012 12:35 PM, Ingo Molnar wrote: > > * Bernd Schubert wrote: > >> On 12/13/2012 01:16 PM, Bernd Schubert wrote: >>> Hello, >>> >>> I just tried to boot 3.7 and it ends in an APIC panic. I >>> tried to use the recommended "apic=debug", but that does not >>> change anything in the output, at least not in the visible >>> part. The last known kernel to boot was 3.5. If it matters I >>> can try to boot 3.6. >> >> So linux-3.6 also boots. Any idea what is going on or do I >> really need to bisect? > > Yeah, it's hard to tell based on that info alone - would be nice > to send in a log/screen capture of the crash and of course > bisecting would be useful as well, if the crash is > deterministic. I already sent a screen capture in my first mail, but now uploaded two pictures here: http://www.aakef.fastmail.fm/linux/ I think the 2nd one is with apic=debug. Unfortunately the system does not have ipmi, which would allow me to do the bisecting now. It is also my desktop system at work and I don't know yet when (and if at all) I will find the time to do the reboot cycles. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] 3.7 ends in APIC panic
On 12/13/2012 01:16 PM, Bernd Schubert wrote: Hello, I just tried to boot 3.7 and it ends in an APIC panic. I tried to use the recommended "apic=debug", but that does not change anything in the output, at least not in the visible part. The last known kernel to boot was 3.5. If it matters I can try to boot 3.6. So linux-3.6 also boots. Any idea what is going on or do I really need to bisect? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
memory allocation: smap large "Size", but unused
Hello, I'm just investigating why a user space program has a rather large VmSize, but small VmRSS size. Looking into /proc/$pid/smaps I notice several areas with an size of about 64MB, but otherwise that area is unused. So far I did not find a way how to reproduce that with malloc() calls. 7ffd34021000-7ffd3800 ---p 00:00 0 Size: 65404 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced:0 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Any idea how to do such an allocation from user space? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]:dir.c patch
On 08/25/2012 10:37 PM, Christopher Sacchi wrote: > Here is a non-style issue dir.c-patch, and as far as I can see from > the lines of code, the compilation errors weren't about what I put in. > This patch fixes a "break" statement inside an "if" statement, as > obviously not correct. Why should that not be correct? It breaks the while(1) loop? > Here's the patch for the kernel version v3.6.0rc3: > > -- > Signed-off-by: Christopher P. Sacchi > --- dir.c 2012-08-25 15:47:24.260443900 -0400 > +++ dir.c 2012-08-25 16:02:05.458845600 -0400 > @@ -580,7 +580,6 @@ static int ext4_dx_readdir(struct file * > return ret; > if (ret == 0) { > filp->f_pos = ext4_get_htree_eof(filp); > - break; So ext4_htree_fill_tree() did not return more entries and the while(1) loop shall be stopped? Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance problems with 3ware 9500S-4LP and 2.6.25-rc3
Hello Andre, On Tuesday 26 February 2008 18:43:14 Andre Noll wrote: > Hi > > we are experiencing massive performance problems with two of our > Linux servers that contain 3ware controllers on a Tyan mainboard and > a couple of 1T disks. > > During the daily cron job that uses rsync to sync a 500G file system > from another machine to the raid on the 3ware controller the load > jumps up, and the machine becomes sluggish as hell. For example, an > ssh login to that machine takes minutes to complete and ldap becomes > unreliable while the rsync job is running. Even Nagios complains > about the machine being down while rsync is running. do you have the write-back cache of the controller enabled for your disks? When you disable this cache, the controller will also disable the disks, cause a write-performance between 3 to 8MB/s per disks. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sleep before boot panic
Hello Ingo, On Sunday 06 January 2008, Ingo Oeser wrote: > Hi Bernd, > > On Sunday 06 January 2008, you wrote: > > Index: zd1211rw.git.beno/init/do_mounts.c > > === > > --- zd1211rw.git.beno.orig/init/do_mounts.c 2008-01-06 18:44:23.0 > > +0100 > > +++ zd1211rw.git.beno/init/do_mounts.c 2008-01-06 18:45:44.0 > > +0100 @@ -330,6 +330,7 @@ > > printk("Please append a correct \"root=\" boot option; here are > > the > > available partitions:\n"); > > > > printk_all_partitions(); > > + msleep(60 * 1000); > > ssleep(60); feel free to replace it replace it :) > > > panic("VFS: Unable to mount root fs on %s", b); > > } > > Better would be for this and similiar panic()s > (fatal user/admin errors on boot) to NOT print a stack trace+registers, > since it is useless and actually hides useful information. There is no dump_stack() here, but disc detection is relatively early in boot process and on all these information are already scrolled off screen when the panic is done. For this and any other panic it would be optimal if scrolling still would work, but scrolling also requires kernel code, so I see there's a reason not to this for all panics. However, for this boot problem I tend to say there's no need to panic at all... Btw, not all stack straces are useless, *most* of them are actually very useful. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sleep before boot panic
Hi, I just switched to libata (pata) on my laptop and the immediate panic made it impossible to figure out why my boot partition wasn't available. After applying this little patch I could check boot printk output and then saw everything was properly recognized and only scsi-disk support was missing. Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Index: zd1211rw.git.beno/init/do_mounts.c === --- zd1211rw.git.beno.orig/init/do_mounts.c 2008-01-06 18:44:23.0 +0100 +++ zd1211rw.git.beno/init/do_mounts.c 2008-01-06 18:45:44.0 +0100 @@ -330,6 +330,7 @@ printk("Please append a correct \"root=\" boot option; here are the available partitions:\n"); printk_all_partitions(); + msleep(60 * 1000); panic("VFS: Unable to mount root fs on %s", b); } Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: everything in wait_for_completion, what is my system doing?
Hello Andrew, thanks for your help! On Friday 07 December 2007 02:09:11 Andrew Morton wrote: > On Wed, 5 Dec 2007 21:44:54 +0100 > > Bernd Schubert <[EMAIL PROTECTED]> wrote: > > after scsi-recovery a system here went into some kind lock-up, everything > > seems to be in wait_for_completion(). Please see the attached > > blocked_states.txt and all_states.txt files. > > This is 2.6.22.12, I can easily find out the line numbers if required. > > > > Any help is highly appreciated. > > Please cc linux-scsi on scsi-related reports. Sorry, I these traces confused me a bit. I had absolutely no idea about a possible reason. > > > [blocked_states.txt text/plain (20.5KB)] > > [generate break] > > [ 1818.566436] SysRq : Show Blocked State > > [ 1818.570260] > > [ 1818.570261] free > > sibling [ 1818.579253] task PCstack pid > > father child younger older [ 1818.586987] events/7 D > > 0155dd642280 026 2 (L-TLB) [ 1818.593747] > > 81012b529ac0 0046 810128280d18 [ > > 1818.601321] 8100ba2376f8 81012b689630 81012aff76b0 > > 00078023e215 [ 1818.608870] 00010003ca14 > > 810001065400 000780430c13 [ 1818.616222] Call Trace: > > [ 1818.618925] [] io_schedule+0x28/0x36 > > [ 1818.624207] [] get_request_wait+0x104/0x158 > > [ 1818.630112] [] blk_get_request+0x36/0x6b > > [ 1818.635755] [] scsi_execute+0x51/0x129 > > [ 1818.641240] [] > > :scsi_transport_spi:spi_execute+0x87/0xf8 [ 1818.648271] > > [] > > :scsi_transport_spi:spi_dv_device_echo_buffer+0x181/0x27d [ 1818.656739] > > [] :scsi_transport_spi:spi_dv_retrain+0x4e/0x240 [ > > 1818.664139] [] > > :scsi_transport_spi:spi_dv_device+0x615/0x69c [ 1818.671542] > > [] :mptspi:mptspi_dv_device+0xb3/0x14b [ 1818.678042] > > [] :mptspi:mptspi_dv_renegotiate_work+0xcb/0xef [ > > 1818.685348] [] run_workqueue+0x8e/0x120 > > [ 1818.690905] [] worker_thread+0x106/0x117 > > [ 1818.696540] [] kthread+0x4b/0x82 > > [ 1818.701474] [] child_rip+0xa/0x12 > > [ 1818.706495] > > [ 1818.708022] unionfs-fuse- D 01a76ef63463 0 1119 1 > > (NOTLB) [ 1818.714764] 810129765988 0082 > > 80337e22 [ 1818.722329] 8101297658c8 > > 81012b652f20 810129eec810 0006 [ 1818.729895] > > 00010005204e 81000105c400 000680337c3e [ > > 1818.737249] Call Trace: > > [ 1818.739953] [] schedule_timeout+0x8a/0xb6 > > [ 1818.745673] [] io_schedule_timeout+0x28/0x36 > > [ 1818.751664] [] congestion_wait+0x9d/0xc2 > > [ 1818.757300] [] > > balance_dirty_pages_ratelimited_nr+0x196/0x22f [ 1818.764781] > > [] generic_file_buffered_write+0x52a/0x60d [ > > 1818.771641] [] > > __generic_file_aio_write_nolock+0x45a/0x491 [ 1818.778852] > > [] generic_file_aio_write+0x61/0xc1 [ 1818.785101] > > [] nfs_file_write+0x138/0x1b7 > > [ 1818.790822] [] do_sync_write+0xcc/0x112 > > [ 1818.796372] [] vfs_write+0xc3/0x165 > > [ 1818.801575] [] sys_pwrite64+0x68/0x96 > > [ 1818.806959] [] system_call+0x7e/0x83 > > [ 1818.812250] [<2b4eeec3ea73>] > > > > [snippage] > > Possibly your device driver had conniptions and stopped generating > completion interrupts. > > Which driver is in use? This is this time easily visible from the traces (mptspi_dv_device) ;) So its the mpt driver, we are using LSI22320 cards (I CC'ed Eric). > > I don't suppose it is repeatable. Thats a clear "yes and no". Exactly this state we have got two or three times during an exhausting hardware stress test over the last weeks (with real and with simulated errors), but its not easily reproducible. Furthermore, the hardware will go into production soon and I don't have the chance to simulate further errors. However, we can easily get a similar state just on a raid6-rebuild (with high end hardware though. (You probably never won't run into into it with normal disks, we are doing software-raid over a bunch of several hardware raid systems). In the raid6-rebuild case the system is not completely locked up, just mostly. Somehow raid6-rebuild is still working, we can see this by the io usage status of the hardware-raids, but the system is completely blocked otherwise. Only pings and sysrq's are working. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] faster workaround
Hello Tejun, On Tuesday 23 October 2007 10:08:01 Tejun Heo wrote: > Jeff Garzik wrote: > > Alan Cox wrote: > >>> 2) Once we identified, over time, the set of drives affected by this > >>> 3112 quirk (aka drives that didn't fully comply to SATA spec), the > >>> debugging of corruption cases largely shifted to the standard > >>> routine: update the BIOS, replace the > >>> cables/RAM/power/mainboard/slot/etc. to be certain of problem location. > >> > >> Except for the continued series of later SI + Nvidia chipset (mostly) > >> pattern which seems unanswered but also being later chips I assume > >> unrelated to this problem. > > > > The SIL_FLAG_MOD15WRITE flag is set in sil_port_info[] is set according > > to the best info we have from SiI, which indicates that 3114 and 3512 do > > not have the same problem as the 3112. > > I don't think this data corruption problem w/ sil3114 is related to > m15w. m15w workaround slows down things quite a bit and is likely to > hide problems on PCI bus side. There are reports of data corruption > with 3114 on nvidia (most common), via and now amd chipsets. There's > one on intel too but IIRC wasn't too definite. > > According to a user, freebsd didn't have data corruption problem on the > same hardware. I copied PCI FIFO setup code (ours is broken BTW) but it > didn't fix the problem. > > I'll try to reproduce the problem locally and hunt it down. thanks for your help and please tell me, if I can do anything. We have this problem on a production system, but the node in question will be rebooted in Thursday (ups needs to be replaced). If there are some tests/reboots/whatever I could do, it would be best to do it shortly after the scheduled reboot. Actually I now would have attempted to port your mod15 patch (http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23, hoping it would solve Soerens problem and ours as well (ours magically already went away using the mod15 fix). Well, maybe I port it anyway to 2.6.23 to see if it also solves our problem. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions
On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote: > > but as much as it fits onto the disk. On reading back this file, the > > filesystem will report errors somewhere between 50GB and 230GB (disk size > > is 250GB). > > Wow, I really see lots of corruptions (well every 1-2 GB a couple of > bytes are corrupted). Are you getting similiarly many in the 50G - 230G > region? I never tested what is corrupted. Well, a diff over 250GB would take quite a lot of time... -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions
On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote: > On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote: > > Hello, > > > > On Monday 22 October 2007 04:12:44 Tejun Heo wrote: > > > Helo, > > > [...] > > > > > > > Now when I write large files of zeros to root(sda&sdb) and read the > > > > file back in it contains a few nonzero entries: > > > > > > > > # dd if=/dev/zero of=/foo bs=1M count=2000 > > > > # hexdump /foo > > > > 000 > > > > * > > > > 1GB random parts, within large blocks of zeroes> > > > > > > > > I can reliably trigger this on the md0 / devmapper-root setup when I > > > > write about 2GB of data (note that this machine has 1.5G of memory - > > > > and still 1GB is often enough to see this problem). Here it does not > > > > matter where in the filesystem I do these writes. > > > > Thats almost the same test as I'm always doing. Only I do not write only > > 2GB, > > Well when I read your mail I thought that I could be seeing exactly the > same bug... it still may be. However ``my'' problem does not go away > with the mod15fix ... Yeah, pity it did not fix it :( I will try to port Tejuns patch (http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23 today or tomorrow. If you are testing anyway, could you then also try this? > > > but as much as it fits onto the disk. On reading back this file, the > > filesystem will report errors somewhere between 50GB and 230GB (disk size > > is 250GB). > > Wow, I really see lots of corruptions (well every 1-2 GB a couple of > bytes are corrupted). Are you getting similiarly many in the 50G - 230G > region? > > > > Thanks. I'll try to reproduce the problem here. What's your > > > motherboard? > > > > All tested S2882 boards here. > > I assume all equipped with lots of memory and mostly empty pci slots? Yes, all pci-slots are free and the systems to have between 4 and 16GB memory (ecc, monitored with edac). Well, those are cluster systems (actually tyan names those B2882). Do you think the configuration is related? Here it also happens with odirect, we tested this to minimize memory effects. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions
Hello, On Monday 22 October 2007 04:12:44 Tejun Heo wrote: > Helo, > > Soeren Sonnenburg wrote: > > I finally managed to find a *reproducible* setup and way to trigger > > random corruptions using a sata sil 3114 controller connected to 4 > > seagate drives > > > > port 1: ST3400832AS sda > > port 2: ST3400620AS sdb > > port 3: ST3750640AS sdc > > port 4: ST3750640AS sdd > > > > sda & sdb form md0 via a raid1 setup followed by an additional > > devicemapper layer ( root ). sdc and sdb are separate and also have an > > additional device mapper layer ( public ) and ( backups ). > > > > Now when I write large files of zeros to root(sda&sdb) and read the file > > back in it contains a few nonzero entries: > > > > # dd if=/dev/zero of=/foo bs=1M count=2000 > > # hexdump /foo > > 000 > > * > > 1GB random parts, within large blocks of zeroes> > > > > I can reliably trigger this on the md0 / devmapper-root setup when I > > write about 2GB of data (note that this machine has 1.5G of memory - and > > still 1GB is often enough to see this problem). Here it does not matter > > where in the filesystem I do these writes. Thats almost the same test as I'm always doing. Only I do not write only 2GB, but as much as it fits onto the disk. On reading back this file, the filesystem will report errors somewhere between 50GB and 230GB (disk size is 250GB). > > Thanks. I'll try to reproduce the problem here. What's your motherboard? All tested S2882 boards here. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] faster workaround
On Friday 12 October 2007 23:08:21 Jeff Garzik wrote: > Bernd Schubert wrote: > > a) 2.6.23 + sil-patch I posted, this is on a customer system (though my > > former group), I wouldn't like to use -mm there. > > > > b) .config is attached > > > > c) attached > > > > d) attached (don't get irritaded by those machine check events, thats > > "GART TLB errorr", harmless warnings, just not disabled in the bios). > > Any chance you could provide dmesg on 2.6.23 without the sil patch? Its attached. Bernd -- Bernd Schubert Q-Leap Networks GmbH [0.00] Linux version 2.6.23-l162 ([EMAIL PROTECTED]) (gcc version 3.4.6 (Ubuntu 3.4.6-5ubuntu1)) #7 SMP Mon Oct 15 11:50:28 CEST 2007 [0.00] Command line: root=/dev/ram0 ramdisk_size=110592 console=tty0 console=ttyS0,115200 [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 0009f400 (usable) [0.00] BIOS-e820: 0009f400 - 000a (reserved) [0.00] BIOS-e820: 000e - 0010 (reserved) [0.00] BIOS-e820: 0010 - fbff (usable) [0.00] BIOS-e820: fbff - fbfff000 (ACPI data) [0.00] BIOS-e820: fbfff000 - fc00 (ACPI NVS) [0.00] BIOS-e820: ff78 - 0001 (reserved) [0.00] BIOS-e820: 0001 - 0004 (usable) [0.00] Entering add_active_range(0, 0, 159) 0 entries of 3200 used [0.00] Entering add_active_range(0, 256, 1032176) 1 entries of 3200 used [0.00] Entering add_active_range(0, 1048576, 4194304) 2 entries of 3200 used [0.00] end_pfn_map = 4194304 [0.00] DMI 2.3 present. [0.00] ACPI: RSDP 000F6F20, 0014 (r0 ACPIAM) [0.00] ACPI: RSDT FBFF, 0038 (r1 A M I OEMRSDT 7000626 MSFT 97) [0.00] ACPI: FACP FBFF0200, 0081 (r1 A M I OEMFACP 7000626 MSFT 97) [0.00] ACPI: DSDT FBFF0410, 3751 (r1 0 00000 INTL 2002026) [0.00] ACPI: FACS FBFFF000, 0040 [0.00] ACPI: APIC FBFF0380, 0084 (r1 A M I OEMAPIC 7000626 MSFT 97) [0.00] ACPI: OEMB FBFFF040, 0041 (r1 A M I OEMBIOS 7000626 MSFT 97) [0.00] ACPI: SRAT FBFF3B70, 0110 (r1 A M I OEMSRAT 7000626 MSFT 97) [0.00] ACPI: ASF! FBFF3C80, 0086 (r1 AMIASF AMDSTRET1 INTL 2002026) [0.00] SRAT: PXM 0 -> APIC 0 -> Node 0 [0.00] SRAT: PXM 1 -> APIC 1 -> Node 1 [0.00] SRAT: Node 0 PXM 0 10-fc00 [0.00] Entering add_active_range(0, 256, 1032176) 0 entries of 3200 used [0.00] SRAT: Node 1 PXM 1 2-4 [0.00] Entering add_active_range(1, 2097152, 4194304) 1 entries of 3200 used [0.00] SRAT: Node 0 PXM 0 10-2 [0.00] Entering add_active_range(0, 256, 1032176) 2 entries of 3200 used [0.00] Entering add_active_range(0, 1048576, 2097152) 2 entries of 3200 used [0.00] SRAT: Node 0 PXM 0 0-2 [0.00] Entering add_active_range(0, 0, 159) 3 entries of 3200 used [0.00] Entering add_active_range(0, 256, 1032176) 4 entries of 3200 used [0.00] Entering add_active_range(0, 1048576, 2097152) 4 entries of 3200 used [0.00] NUMA: Using 33 for the hash shift. [0.00] Bootmem setup node 0 -0002 [0.00] Bootmem setup node 1 0002-0004 [0.00] Zone PFN ranges: [0.00] DMA 0 -> 4096 [0.00] DMA324096 -> 1048576 [0.00] Normal1048576 -> 4194304 [0.00] Movable zone start PFN for each node [0.00] early_node_map[4] active PFN ranges [0.00] 0:0 -> 159 [0.00] 0: 256 -> 1032176 [0.00] 0: 1048576 -> 2097152 [0.00] 1: 2097152 -> 4194304 [0.00] On node 0 totalpages: 2080655 [0.00] DMA zone: 56 pages used for memmap [0.00] DMA zone: 1451 pages reserved [0.00] DMA zone: 2492 pages, LIFO batch:0 [0.00] DMA32 zone: 14280 pages used for memmap [0.00] DMA32 zone: 1013800 pages, LIFO batch:31 [0.00] Normal zone: 14336 pages used for memmap [0.00] Normal zone: 1034240 pages, LIFO batch:31 [0.00] Movable zone: 0 pages used for memmap [0.00] On node 1 totalpages: 2097152 [0.00] DMA zone: 0 pages used for memmap [0.00] DMA32 zone: 0 pages used for memmap [0.00] Normal zone: 28672 pages used for memmap [0.00] Normal zone: 2068480 pages, LIFO batch:31 [0.00] Movable zone: 0 pages used for memmap [0.00] ACPI: PM-Timer IO Port: 0x1008 [0.00] ACPI: Local APIC address 0xfee0 [0.00] ACPI: LAPIC (acp
Re: [PATCH 3/3] faster workaround
On Thursday 11 October 2007 17:04:45 Jeff Garzik wrote: > Bernd Schubert wrote: > > On Thursday 11 October 2007 16:19:37 Jeff Garzik wrote: > >> 1) Just about the only valid optimization is to ensure that only the > >> write path must be limited to small chunks, not both read- and > >> write-paths. Tejun had a patch to do this a long time ago, but it's an > >> open question whether the large amount of code is worth it for a rare > >> combination. > > > > How large? This patch is rather small? Where can I find it? > > http://home-tj.org/wiki/index.php/Sil_m15w Thanks, I will take a look later on. > > > The problem came up, when 200GB drives were replaced by *newer* 250GB > > drives (well maybe not the newest, no idea were they came from). > > > > Anyway, I'm testing for more than 24h already and didn't observe any data > > corruption as without the patch. I know this is only an obersavation and > > no definite prove... > > Also, this is with 3114, maybe this chip behaves a bit different than > > 3112? > > 3114 + new SATA drive is definitely a new one for us. > > It would help to (a) use the latest kernel, (b) post your .config with > the latest kernel, (c) post lspci booted into latest kernel, and (d) > post dmesg booted into latest kernel. a) 2.6.23 + sil-patch I posted, this is on a customer system (though my former group), I wouldn't like to use -mm there. b) .config is attached c) attached d) attached (don't get irritaded by those machine check events, thats "GART TLB errorr", harmless warnings, just not disabled in the bios). Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH # # Automatically generated make config: don't edit # Linux kernel version: 2.6.23 # Thu Oct 11 13:46:30 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="-l162" CONFIG_LOCALVERSION_AUTO=y # CONFIG_SWAP is not set CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_BLK_DEV_BSG is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y # CONFIG_IOSCHED_AS is not set CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set CONFIG_DEFAULT_DEADLINE=y # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="deadline" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set CONFIG_GENERIC_CPU=y CONFIG_X86_L1_CACHE_BYTES=128 CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_INTERNODE_CACHE_BYTES=128 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y # CONFIG_MICROCODE is not set # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_NODES_SHIFT=6 CONFIG_X86_64_ACPI_NUMA=
Re: [PATCH 3/3] faster workaround
On Thursday 11 October 2007 16:19:37 Jeff Garzik wrote: > 1) Just about the only valid optimization is to ensure that only the > write path must be limited to small chunks, not both read- and > write-paths. Tejun had a patch to do this a long time ago, but it's an > open question whether the large amount of code is worth it for a rare > combination. How large? This patch is rather small? Where can I find it? > > 2) Once we identified, over time, the set of drives affected by this > 3112 quirk (aka drives that didn't fully comply to SATA spec), the > debugging of corruption cases largely shifted to the standard routine: > update the BIOS, replace the cables/RAM/power/mainboard/slot/etc. to be > certain of problem location. Replace this disk or the sata controller maybe, but usually people don't want to replace a big cluster, even if it is already 3 years old, this has to wait at least another 3 years. The problem came up, when 200GB drives were replaced by *newer* 250GB drives (well maybe not the newest, no idea were they came from). Anyway, I'm testing for more than 24h already and didn't observe any data corruption as without the patch. I know this is only an obersavation and no definite prove... Also, this is with 3114, maybe this chip behaves a bit different than 3112? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] faster workaround
This is based on a patch from Jeff from 2004, but backported to 2.6.23 and furthermore, it will use the 7.5kiB/512B splitoff for blacklisted drives only. Jeff, why did you replace ATA_SHT_USE_CLUSTERING and ATA_DMA_BOUNDARY? drivers/ata/libata-core.c |9 - drivers/ata/sata_sil.c| 58 ++-- include/linux/libata.h|6 +++ 3 files changed, 62 insertions(+), 11 deletions(-) Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Index: linux-2.6.23-rc9/drivers/ata/libata-core.c === --- linux-2.6.23-rc9.orig/drivers/ata/libata-core.c 2007-10-02 17:21:12.0 +0200 +++ linux-2.6.23-rc9/drivers/ata/libata-core.c 2007-10-11 10:46:18.0 +0200 @@ -4073,7 +4073,7 @@ void ata_sg_clean(struct ata_queued_cmd * spin_lock_irqsave(host lock) * */ -static void ata_fill_sg(struct ata_queued_cmd *qc) +void ata_fill_sg(struct ata_queued_cmd *qc) { struct ata_port *ap = qc->ap; struct scatterlist *sg; @@ -4217,10 +4217,15 @@ int ata_check_atapi_dma(struct ata_queue */ void ata_qc_prep(struct ata_queued_cmd *qc) { + struct ata_port *ap = qc->ap; + if (!(qc->flags & ATA_QCFLAG_DMAMAP)) return; - ata_fill_sg(qc); + if (ap->ops->fill_sg) + ap->ops->fill_sg(qc); + else + ata_fill_sg(qc); } /** Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c === --- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 10:45:08.0 +0200 +++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:57:51.0 +0200 @@ -120,6 +120,7 @@ static int sil_scr_write(struct ata_port static int sil_set_mode (struct ata_port *ap, struct ata_device **r_failed); static void sil_freeze(struct ata_port *ap); static void sil_thaw(struct ata_port *ap); +static void sil_fill_sg(struct ata_queued_cmd *qc); static const struct pci_device_id sil_pci_tbl[] = { @@ -174,12 +175,12 @@ static struct scsi_host_template sil_sht .queuecommand = ata_scsi_queuecmd, .can_queue = ATA_DEF_QUEUE, .this_id= ATA_SHT_THIS_ID, - .sg_tablesize = LIBATA_MAX_PRD, + .sg_tablesize = 120, /* max 15 kiB sectors ? */ .cmd_per_lun= ATA_SHT_CMD_PER_LUN, .emulated = ATA_SHT_EMULATED, - .use_clustering = ATA_SHT_USE_CLUSTERING, + .use_clustering = 1, .proc_name = DRV_NAME, - .dma_boundary = ATA_DMA_BOUNDARY, + .dma_boundary = 0x1fff, .slave_configure= ata_scsi_slave_config, .slave_destroy = ata_scsi_slave_destroy, .bios_param = ata_std_bios_param, @@ -187,6 +188,7 @@ static struct scsi_host_template sil_sht static const struct ata_port_operations sil_ops = { .port_disable = ata_port_disable, + .fill_sg= sil_fill_sg, .dev_config = sil_dev_config, .tf_load= ata_tf_load, .tf_read= ata_tf_read, @@ -278,9 +280,9 @@ MODULE_LICENSE("GPL"); MODULE_DEVICE_TABLE(pci, sil_pci_tbl); MODULE_VERSION(DRV_VERSION); -static int slow_down = 0; -module_param(slow_down, int, 0444); -MODULE_PARM_DESC(slow_down, "Sledgehammer used to work around random problems, by limiting commands to 15 sectors (0=off, 1=on)"); +static int mod15_quirk = 0; +module_param(mod15_quirk, int, 0444); +MODULE_PARM_DESC(mod15_quirk, "Some disks from Seagate need a mod15 workaround."); static unsigned char sil_get_device_cache_line(struct pci_dev *pdev) @@ -534,6 +536,44 @@ static void sil_thaw(struct ata_port *ap writel(tmp, mmio_base + SIL_SYSCFG); } +static void sil_fill_sg(struct ata_queued_cmd *qc) +{ + struct ata_port *ap = qc->ap; + u32 addr, len; + unsigned int idx; + + ata_fill_sg(qc); + + /* check if we need the MOD15 workaround */ + if (!(qc->dev->quirk & SIL_FLAG_MOD15WRITE)) + return; + + if (unlikely(qc->n_elem < 1)) + return; + + /* hardware S/G list may be longer (or shorter) than number of +* PCI-mapped S/G entries (qc->n_elem), due to splitting +* in ata_fill_sg(). Start at zero, and skip to end +* of list, if we're not already there. + */ + idx = 0; + while ((le32_to_cpu(ap->prd[idx].flags_len) & ATA_PRD_EOT) == 0) + idx++; + + /* Errata workaround: if last segment is exactly 8K, split +* into 7.5K and 512b pieces. +*/ + len = le32_to_cpu(ap->prd[idx].flags_len) & 0x; + if (len == 8192) { +
Re: [PATCH 2/3] Re: sil3114 data corruption
This will add the sil3114 back to the controllers with the mod15 bug. Without this patch no workaround for this controller is done and people might/will suffer from data corruption. Also rather trivial, though with a huge effect, the speed for the effected disks will go down from about 45-50MB/s to 20-25MB/s. But better safe than lost data or damaged filesystem. Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c === --- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 10:45:02.0 +0200 +++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:45:08.0 +0200 @@ -241,7 +241,8 @@ static const struct ata_port_info sil_po }, /* sil_3114 */ { - .flags = SIL_DFL_PORT_FLAGS | SIL_FLAG_RERR_ON_DMA_ACT, + .flags = SIL_DFL_PORT_FLAGS | SIL_FLAG_RERR_ON_DMA_ACT + | SIL_FLAG_MOD15WRITE, .pio_mask = 0x1f, /* pio0-4 */ .mwdma_mask = 0x07, /* mwdma0-2 */ .udma_mask = ATA_UDMA5, -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Re: sil3114 data corruption
This will add the Seagate ST3250820AS to the mod15 blacklist. I think this is rather trivial and should go into any any release as soon as possible, since there will be data corruption without it for this disk. Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c === --- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 10:44:57.0 +0200 +++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:45:02.0 +0200 @@ -151,6 +151,7 @@ static const struct sil_drivelist { { "ST380011ASL",SIL_QUIRK_MOD15WRITE }, { "ST3120022ASL", SIL_QUIRK_MOD15WRITE }, { "ST3160021ASL", SIL_QUIRK_MOD15WRITE }, + { "ST3250820AS",SIL_QUIRK_MOD15WRITE }, { "Maxtor 4D060H3", SIL_QUIRK_UDMA5MAX }, { } }; -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHES] Re: sil3114 data corruption
On Wednesday 10 October 2007 11:12:20 Bernd Schubert wrote: > On Monday 08 October 2007 17:09:17 Bernd Schubert wrote: > > [sorry for sending twice, but after I read the sil sources, I see the > > mail address had been wrong] > > > > Hi, > > > > somehow the sil3114 causes data corruption with some (newer?) disks. > > Simply filling the filesystem with zeros and reading the these data will > > make the kernel to report filesystem corruption. > > This is definitely not an issue of memory, since the systems (several > > tested) do have ECC memory and the memory is monitored with EDAC. > > > > kernel versions tested: 2.6.15-2.6.20 > > Update: Setting sata_sil.slow_down=1 fill fix the problem, seems there are > some drives missing in the quirk table. > > Jeff, I found an old patch/workaround from you > (http://uwsg.indiana.edu/hypermail/linux/kernel/0403.1/1957.html), can you > give me any further information why this never went into the driver? > I will send 3 mails with patches to fix this corruption. -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: do_nanosleep will not return
On Monday 08 October 2007 16:32:52 Rik van Riel wrote: > On Mon, 08 Oct 2007 15:20:26 +0200 > > Bernd Schubert <[EMAIL PROTECTED]> wrote: > > Bernd Schubert wrote: > > > we have a system here were e.g. "sleep 1" will never finish. This > > > is an issue of 2.6.23, on all older kernel versions it did work > > > fine. > > > > > > Seems to hang in do_nanosleep() > > > > Update: Enabling hpet in the bios and setting clocksource=hpet as > > command line parameter will fix it, but still its not nice that > > something that worked without a problem in 2.6.22 and below suddenly > > doesn't work in 2.6.23. > > Which timer source is in use when the system hangs? Well, not the systems hangs, only processes running nanosleep. Well, since the system is booted diskless, one of the very first commands is to run "/etc/init.d/portmap start", which has a sleep call in its script and so it will halt the boot process. The problematic timer source is acpi_pm. Its also interesting that setting the timer source via /sys/devices/system/clocksource/clocksource0/current_clocksource won't fix that problem. Only the boot option clocksource={other than acpi_pm} does help. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: do_nanosleep will not return
Bernd Schubert wrote: > Hi, > > we have a system here were e.g. "sleep 1" will never finish. This is an > issue of 2.6.23, on all older kernel versions it did work fine. > > Seems to hang in do_nanosleep() > Update: Enabling hpet in the bios and setting clocksource=hpet as command line parameter will fix it, but still its not nice that something that worked without a problem in 2.6.22 and below suddenly doesn't work in 2.6.23. Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.23 regression: do_nanosleep will not return
Hi, we have a system here were e.g. "sleep 1" will never finish. This is an issue of 2.6.23, on all older kernel versions it did work fine. Seems to hang in do_nanosleep() [ 153.775792] sleep S 0 5372 5341 [ 153.782385] 81007f0a9ea8 0082 8efc [ 153.790635] 81007f0a9e48 802447b4 81007f0c3080 0003 [ 153.798938] 81007f0c39c8 81007f0c37c0 4001d908 [ 153.806991] Call Trace: [ 153.809937] [] do_nanosleep+0x42/0x75 [ 153.815727] [<0001>] [ 153.819383] [ 153.775792] sleep S 0 5372 5341 [ 330.669444] SysRq : Show Pending Timers [ 330.673552] Timer List Version: v0.3 [ 330.677326] HRTIMER_MAX_CLOCK_BASES: 2 [ 330.681282] now at 255011372633 nsecs [ 330.829981] active timers: [ 330.832859] #0: , hrtimer_wakeup, S:01 [ 330.838805] # expires at 260156346358 nsecs [in 5144973725 nsecs] [ 337.046189] now at 261387685432 nsecs [ 337.194966] active timers: [ 337.197834] #0: , hrtimer_wakeup, S:01 [ 337.203793] # expires at 260156346358 nsecs [in 18446744072478212542 nsecs] [ 330.669444] SysRq : Show Pending Timers Any ideas? Thanks, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2][RESEND] improve generic_file_buffered_write()
No further response to our patches yet, so we are sending them again, re-diffed against 2.6.23-rc5 Hi, recently we discovered writing to a nfs-exported lustre filesystem is rather slow (20-40 MB/s writing, but over 200 MB/s reading). As I already explained on the nfs mailing list, this happens since there is an offset on the very first page due to the nfs header. http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs While this especially effects lustre, Olaf Kirch also noticed it on another filesystem before and wrote a nfs patch for it. This patch has two disadvantages - it requires to move all data within the pages, IMHO rather cpu time consuming, furthermore, it presently causes data corruption when more than one nfs thread is running. After thinking it over and over again we (Goswin and I) believe it would be best to improve generic_file_buffered_write(). If there is sufficient data now, as it is usual for aio writes, generic_file_buffered_write() will now fill each page as much as possible and only then prepare/commit it. Before generic_file_buffered_write() commited chunks of pages even though there were still more data. Some statistics: num_writes = 4669440, bytes_total = 20231249633, segs_total = 5738644, commit_loops = 7697604, commits_total = 6628750 commit_loops is the number commits without the patch and commits_total the number of commits we actually have now. This shows a saving of nearly 14% of prepare, commit, cond_sched calls. < 1: Write size =0, Num segs =0 < 2: Write size =20244, Num segs = 4455583 < 4: Write size = 6722, Num segs = 24 < 8: Write size =19653, Num segs = 213842 < 16: Write size =31778, Num segs =0 < 32: Write size =73395, Num segs =0 < 64: Write size = 148840, Num segs =0 < 128: Write size = 310178, Num segs =0 < 256: Write size =89027, Num segs =0 < 512: Write size = 111903, Num segs =0 <1024: Write size = 140509, Num segs =0 <2048: Write size = 244052, Num segs =0 <4096: Write size = 217164, Num segs =0 <8192: Write size = 2784875, Num segs =0 < 16384: Write size = 433506, Num segs =0 < 32768: Write size =11742, Num segs =0 < 65536: Write size =15783, Num segs =0 < 131072: Write size = 6851, Num segs =0 < 262144: Write size = 1562, Num segs =0 < 524288: Write size = 755, Num segs =0 < 1048576: Write size = 531, Num segs =0 < 2097152: Write size = 272, Num segs =0 < 4194304: Write size = 107, Num segs =0 < 8388608: Write size =0, Num segs =0 Write size shows the number of writes with the total size smaller than denoted in the first column. Num segs shows the number of writes with less segments than denoted in the first column. Most writes (~95%) only have one segment. However, no nfs activity has been done, which is actually the case we made the patches for. size\num1 2 3 4 5 6 7+ < 1: 0 24 0 0 0 0 0 < 2: 20244 0 0 0 0 641526 0 < 4: 67220 0 0 0 0 0 < 8: 19653 0 0 0 0 213842 0 < 16: 31778 0 0 0 0 213856 0 < 32: 73395 0 0 0 0 590 0 < 64: 147730 0 0 0 0 93626 0 < 128: 100888 0 0 0 0 119597 0 < 256: 85588 0 0 0 0 12 0 < 512: 111900 0 0 0 0 3 0 <1024: 140509 0 0 0 0 0 0 <2048: 244052 0 0 0 0 0 0 <4096: 217160 4 0 0 0 0 0 <8192: 2784855 20 0 0 0 0 0 < 16384: 433506 0 0 0 0 0 0 < 32768: 11742 0 0 0 0 0 0 < 65536: 15783 0 0 0 0 0 0 < 131072: 68510 0 0 0 0 0 < 262144: 15620 0 0 0 0 0 < 524288: 755 0 0 0 0 0 0 &l
Re: patch: improve generic_file_buffered_write() (2nd try 2/2)
I guess when aio was introduced this was probably forgotten. For small chunks or synchronous i/o the likehood is correct, but for big data chunks and aio the likehood is false. Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]> Index: linux-2.6.20.3/mm/filemap.c === --- linux-2.6.20.3.orig/mm/filemap.c2007-09-05 18:51:59.0 +0200 +++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:53:12.0 +0200 @@ -2100,7 +2100,7 @@ /* * handle partial DIO write. Adjust cur_iov if needed. */ - if (likely(nr_segs == 1)) + if (nr_segs == 1) buf = iov->iov_base + written; else { filemap_set_next_iovec(&cur_iov, &iov_base, written); @@ -2167,7 +2167,7 @@ vmtruncate(inode, isize); } } - if (likely(nr_segs == 1)) + if (nr_segs == 1) copied = filemap_copy_from_user(page, offset, buf, bytes); else @@ -2213,7 +2213,7 @@ count -= copied; pos += copied; buf += copied; - if (unlikely(nr_segs > 1)) { + if (nr_segs > 1) { filemap_set_next_iovec(&cur_iov, &iov_base, copied); if (count) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch: improve generic_file_buffered_write() (2nd try 1/2)
Hello Randy, thanks for your review. On Wednesday 05 September 2007 17:35:29 Randy Dunlap wrote: > On Wed, 5 Sep 2007 15:45:36 +0200 Bernd Schubert wrote: > > Hi, > > meta-comments: > > filemap.c | 144 > > +- > > 1 file changed, 96 insertions(+), 48 deletions(-) > > Use "diffstat -p 1 -w 70" per Documentation/SubmittingPatches. Thanks, never would have thought this is documented. [...] > Use proper kernel-doc notation, per > Documentation/kernel-doc-nano-HOWTO.txt. Ouch, I really should have read these files before. Now I know why there are so few functions commented. Nobody wants to read the documentation. > > This comment block should be: > > /** > * generic_file_buffered_write - handle an iov > * @iocb: file operations > * @iov: vector of data to write > * @nr_segs: number of iov segments > * @pos: position in the file > * @ppos: position in the file after this function > * @count:number of bytes to write > * @written: offset in iov->base (data to skip on write) > * > * This function will do 3 main tasks for each iov: > * - prepare a write > * - copy the data from iov into a new page > * - commit this page Thanks, done. I also removed the FIXMEs and created a second patch. Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]> mm/filemap.c | 142 + 1 file changed, 96 insertions(+), 46 deletions(-) Index: linux-2.6.20.3/mm/filemap.c === --- linux-2.6.20.3.orig/mm/filemap.c2007-09-05 14:04:18.0 +0200 +++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:50:26.0 +0200 @@ -2057,6 +2057,21 @@ } EXPORT_SYMBOL(generic_file_direct_write); +/** + * generic_file_buffered_write - handle iov'ectors + * @iob: file operations + * @iov: vector of data to write + * @nr_segs: number of iov segments + * @pos: position in the file + * @ppos: position in the file after this function + * @count: number of bytes to write + * written:offset in iov->base (data to skip on write) + * + * This function will do 3 main tasks for each iov: + * - prepare a write + * - copy the data from iov into a new page + * - commit this page + */ ssize_t generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos, loff_t *ppos, @@ -2074,6 +2089,11 @@ const struct iovec *cur_iov = iov; /* current iovec */ size_t iov_base = 0; /* offset in the current iovec */ char __user *buf; + unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ + loff_t wpos = pos; /* the position in the file we will return */ + + /* position in file as index of pages */ + unsigned long index = pos >> PAGE_CACHE_SHIFT; pagevec_init(&lru_pvec, 0); @@ -2087,9 +2107,15 @@ buf = cur_iov->iov_base + iov_base; } + page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec); + if (!page) { + status = -ENOMEM; + goto out; + } + do { - unsigned long index; unsigned long offset; + unsigned long data_end; /* end of data within the page */ size_t copied; offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ @@ -2106,6 +2132,8 @@ */ bytes = min(bytes, cur_iov->iov_len - iov_base); + data_end = offset + bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -2114,34 +2142,30 @@ */ fault_in_pages_readable(buf, bytes); - page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); - if (!page) { - status = -ENOMEM; - break; - } - if (unlikely(bytes == 0)) { status = 0; copied = 0; goto zero_length_segment; } - status = a_ops->prepare_write(file, page, offset, offset+bytes); - if (unlikely(status)) { - loff_t isize = i_size_read(inode); - - if (status != AOP_TRUNCATED_PAGE) - unlock_page(page); - page_cache_release(page); - if (status == AOP_TRUNCATED_PAGE) - contin
patch: improve generic_file_buffered_write()
Hi, recently we discovered writing to a nfs-exported lustre filesystem is rather slow (20-40 MB/s writing, but over 200 MB/s reading). As I already explained on the nfs mailing list, this happens since there is an offset on the very first page due to the nfs header. http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs While this especially effects lustre, Olaf Kirch also noticed it on another filesystem before and wrote a nfs patch for it. This patch has two disadvantages - it requires to move all data within the pages, IMHO rather cpu time consuming, furthermore, it presently causes data corruption when more than one nfs thread is running. After thinking it over and over again we (Goswin and I) believe it would be best to improve generic_file_buffered_write(). If there is sufficient data now, as it is usual for aio writes, generic_file_buffered_write() will now fill each page as much as possible and only then prepare/commit it. Before generic_file_buffered_write() commited chunks of pages even though there were still more data. The attached patch still has two FIXMEs, both for likely()/unlikely() conditions which IMHO don't reflect the likelyhood for the new aio data functions. filemap.c | 144 +- 1 file changed, 96 insertions(+), 48 deletions(-) Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]> Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]> Cheers, Goswin and Bernd Index: linux-2.6.20.3/mm/filemap.c === --- linux-2.6.20.3.orig/mm/filemap.c2007-09-04 13:43:04.0 +0200 +++ linux-2.6.20.3/mm/filemap.c 2007-09-05 12:39:23.0 +0200 @@ -2057,6 +2057,19 @@ } EXPORT_SYMBOL(generic_file_direct_write); +/** + * This function will do 3 main tasks for each iov: + * - prepare a write + * - copy the data from iov into a new page + * - commit this page + * @iob: file operations + * @iov: vector of data to write + * @nr_segs: number of iov segments + * @pos: position in the file + * @ppos: position in the file after this function + * @count: number of bytes to write + * written:offset in iov->base (data to skip on write) + */ ssize_t generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos, loff_t *ppos, @@ -2074,6 +2087,11 @@ const struct iovec *cur_iov = iov; /* current iovec */ size_t iov_base = 0; /* offset in the current iovec */ char __user *buf; + unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ + loff_t wpos = pos; /* the position in the file we will return */ + + /* position in file as index of pages */ + unsigned long index = pos >> PAGE_CACHE_SHIFT; pagevec_init(&lru_pvec, 0); @@ -2087,9 +2105,15 @@ buf = cur_iov->iov_base + iov_base; } + page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec); + if (!page) { + status = -ENOMEM; + goto out; + } + do { - unsigned long index; unsigned long offset; + unsigned long data_end; /* end of data within the page */ size_t copied; offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ @@ -2106,6 +2130,8 @@ */ bytes = min(bytes, cur_iov->iov_len - iov_base); + data_end = offset + bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -2114,95 +2140,117 @@ */ fault_in_pages_readable(buf, bytes); - page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); - if (!page) { - status = -ENOMEM; - break; - } - if (unlikely(bytes == 0)) { status = 0; copied = 0; goto zero_length_segment; } - status = a_ops->prepare_write(file, page, offset, offset+bytes); - if (unlikely(status)) { - loff_t isize = i_size_read(inode); - - if (status != AOP_TRUNCATED_PAGE) - unlock_page(page); - page_cache_release(page); - if (status == AOP_TRUNCATED_PAGE) - continue; + if (data_end == PAGE_CACHE_SIZE || count == bytes) { /* -* prepare_write() may have instantiated a few
Re: API changes / 2.6.21 sysctl changes
On Monday 11 June 2007 17:46:27 Alexey Dobriyan wrote: > On Mon, Jun 11, 2007 at 03:13:12PM +0200, Bernd Schubert wrote: > > in 2.6.21 register_sysctl_table(), struct ctl_table and probably > > something else did change. Unfortunately so far I didn't figure out the > > "something else". > > Do you have a problem porting your sysctls to newer kernels? A little bit, yes. Well, I got it working, but I don't understand why I had to do that whay. I'm porting lustre to newer kernel versions and up to 2.6.20 the procfs/sysctl logic was 1.) register_sysctl_table() -> creates /proc/sys/lnet 2.) create_proc_entry() -> add additional files in /proc/sys/lnet With 2.6.21 creating additional entries in /proc/sys/lnet fails and I have to first call "proc_mkdir("sys/lnet", NULL)". I did this proc_mkdir() call even before the register_sysctl_table() call, hoping that its correct. My guess is that register_sysctl_table() doesn't create /proc/sys/lnet anymore, but I have now idea why. Either I did something wrong or its intended. Since I don't like guessing I ask for more documentation. However I think in general, each interface change should be documented. Its just such a waste of time of many people, just because one person doesn't want to spend additional 5 min to write what did change. Thanks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
API changes / 2.6.21 sysctl changes
Hi, in 2.6.21 register_sysctl_table(), struct ctl_table and probably something else did change. Unfortunately so far I didn't figure out the "something else". Please, if generic interface modifications render all available documentation in the web invalid, is it so hard to also write kernel api documentation then (even if it so far does not exist in the Documentation/ dir)? I mean the time overhead of thousands of coders digging through git commits is huge, just because API changes are not properly documented. E.g.: Documentation/api/sysctl.txt Up to 2.6.20: struct ctl_table_header *register_sysctl_table(ctl_table * table, int insert_at_head); Beginning with 2.6.21-rcX: struct ctl_table_header *register_sysctl_table(ctl_table * table); struct ctl_table: removed entry struct proc_dir_entry *de added entry ctl_table *parent [Maybe also something like] Additionaly to different functions calls, programmers also need to change ... Thanks, Bernd Thanks, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mkfs.ext2 triggered softlockup
On Wednesday 16 May 2007 18:49:57 Michal Piotrowski wrote: > Hi Bernd, > > On 16/05/07, Bernd Schubert <[EMAIL PROTECTED]> wrote: > > Maybe you still remember my report about an mkfs.ext2 triggered ram disk > > corruption? > > > > http://lkml.org/lkml/2007/5/4/272 > > > > Well, in principle I'm now doing the same stuff, only this time with > > another initrd, which mounts the root-fs over nfs. > > > > [ 1596.928552] BUG: soft lockup detected on CPU#2! > > [ 1596.933109] > > [ 1596.933110] Call Trace: > > [ 1596.933111][] softlockup_tick+0xd8/0xef > > [ 1596.933129] [] run_local_timers+0x13/0x15 > > [ 1596.933132] [] update_process_times+0x4a/0x77 > > [ 1596.933138] [] smp_local_timer_interrupt+0x34/0x54 > > [ 1596.933143] [] smp_apic_timer_interrupt+0x61/0x78 > > [ 1596.933147] [] apic_timer_interrupt+0x6b/0x70 > > [ 1596.933151][] free_buffer_head+0x24/0x3e > > [ 1596.933162] [] kmem_cache_free+0x1f4/0x201 > > [ 1596.933170] [] free_buffer_head+0x24/0x3e > > [ 1596.933175] [] try_to_free_buffers+0x88/0x9f > > [ 1596.933181] [] try_to_release_page+0x39/0x40 > > [ 1596.933188] [] invalidate_mapping_pages+0x9d/0x121 > > [ 1596.933196] [] invalidate_inode_pages+0xf/0x11 > > [ 1596.933200] [] invalidate_bdev+0x3b/0x3f > > [ 1596.933203] [] kill_bdev+0x13/0x29 > > [ 1596.933208] [] __blkdev_put+0x62/0x141 > > [ 1596.933213] [] blkdev_put+0xb/0xd > > [ 1596.933218] [] blkdev_close+0x2e/0x33 > > [ 1596.933222] [] __fput+0xc3/0x172 > > [ 1596.933228] [] fput+0x14/0x16 > > [ 1596.933233] [] filp_close+0x61/0x6d > > [ 1596.933238] [] sys_close+0x8c/0xce > > [ 1596.933244] [] system_call+0x7e/0x83 > > [ 1596.933250] > > Can you tell me which kernel version you are using? Sorry, forgot that. I think 2.6.20.6 or 2.6.20.7 (I always rename them to .3, for some reasons thats easier than to change our tftp-rembo config). The kernel is patches with lustre patches, hmm, one of them also adds a read-only test to the block device layer. Probably I should test a vanilla kernel. Going to do that now... Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mkfs.ext2 triggered softlockup
Maybe you still remember my report about an mkfs.ext2 triggered ram disk corruption? http://lkml.org/lkml/2007/5/4/272 Well, in principle I'm now doing the same stuff, only this time with another initrd, which mounts the root-fs over nfs. [ 1596.928552] BUG: soft lockup detected on CPU#2! [ 1596.933109] [ 1596.933110] Call Trace: [ 1596.933111][] softlockup_tick+0xd8/0xef [ 1596.933129] [] run_local_timers+0x13/0x15 [ 1596.933132] [] update_process_times+0x4a/0x77 [ 1596.933138] [] smp_local_timer_interrupt+0x34/0x54 [ 1596.933143] [] smp_apic_timer_interrupt+0x61/0x78 [ 1596.933147] [] apic_timer_interrupt+0x6b/0x70 [ 1596.933151][] free_buffer_head+0x24/0x3e [ 1596.933162] [] kmem_cache_free+0x1f4/0x201 [ 1596.933170] [] free_buffer_head+0x24/0x3e [ 1596.933175] [] try_to_free_buffers+0x88/0x9f [ 1596.933181] [] try_to_release_page+0x39/0x40 [ 1596.933188] [] invalidate_mapping_pages+0x9d/0x121 [ 1596.933196] [] invalidate_inode_pages+0xf/0x11 [ 1596.933200] [] invalidate_bdev+0x3b/0x3f [ 1596.933203] [] kill_bdev+0x13/0x29 [ 1596.933208] [] __blkdev_put+0x62/0x141 [ 1596.933213] [] blkdev_put+0xb/0xd [ 1596.933218] [] blkdev_close+0x2e/0x33 [ 1596.933222] [] __fput+0xc3/0x172 [ 1596.933228] [] fput+0x14/0x16 [ 1596.933233] [] filp_close+0x61/0x6d [ 1596.933238] [] sys_close+0x8c/0xce [ 1596.933244] [] system_call+0x7e/0x83 [ 1596.933250] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mkfs.ext2 triggerd RAM corruption
On Sat, May 05, 2007 at 02:57:35PM -0400, Theodore Tso wrote: > On Sat, May 05, 2007 at 03:36:37AM +0200, Bernd Schubert wrote: > > distribution: modified debian sarge, in which aspect is the distribution > > important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX > > and not /dev/rd/0. Stracing it and grepping for open calls shows that > > only /dev/sdaX is opened in read-write mode. > > /dev/rd/0? What's this? Is this the partition where your root > partition is found? What is it? Is it a ramdisk? Or is it some kind > of persistent storage device? > > If it is a persistant storage device, do the corrupted files stay > corrupted when you reboot? (If it's a ramdisk which you load, then > obviously it's getting reloaded on reboot.) You didn't give enough > information to be sure exactly what's going on. Sorry, should have expressed myself more clearly, /dev/rd/0 is the devfs-style name of the first ram disk device (don't like those devfs names myself, but since I'm rather new in this group I couldn't convice my boss to switch to short names yet ;) ). However, its only the devfs-style of udev and not devfs itself. > > The next thing to ask is how the files are corrupted. Can you see > save a copy of the corrupted files to stable storage, so you can see > *how* they were corrupted. Were large swaths of zeros getting written > into it? Yes, many zeros. Binary files, hexdump and diff are here: http://www.q-leap.com/~bschubert/data-corruption > > Next question; if you don't use these mke2fs parameters, can you > reproduce the corruption? > > mkfs.ext2 -j -b 4096 -F -i 4096 -J size=400 -I 512 /dev/sda4 > > What if you change the it to: > > mkfs.ext2 -j -b 4096 /dev/sda4 > > Do you still see corruption problems? No, no observable corruption. > > > I already tested several partition types, e.g. something like this for a > > test on sda3 > > > > beo-05:~# sfdisk -d /dev/sda > > # partition table of /dev/sda > > unit: sectors > > > > /dev/sda1 : start= 63, size= 4208967, Id=83 > > /dev/sda2 : start= 4209030, size= 4209030, Id=83 > > /dev/sda3 : start= 8418060, size=313251435, Id=83 > > /dev/sda4 : start=0, size=0, Id= 0 > > What if the partition size is smaller; does that make the problem go > away? If so, can you do a binary search on the partition size where > the problem appears? Need to test this thouroughly, but will do it tomorrow, its too late here for this kind of tests. > > And what can you say about the SATA driver you were using; were all of > the machines that you tested this on using the same SATA controller > and same driver? As you can see from my previous reply ;) tested with at least two different controllers - intel and nvidia (will reboot on the 4th system on Monday to figure out its hardware, once the corruption happened, the system tend to stop working). > > Obviously if this were a generic kernel problem, we'd been hearing > about this from a lot more people. So there has to be something > unique to your setup, and we need to figure out what that might happen > to be. I also still have problems to believe its a generic problem... Thanks for your help, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mkfs.ext2 triggerd RAM corruption
On Sat, May 05, 2007 at 09:12:02PM +0200, Jan Engelhardt wrote: > > On May 5 2007 14:57, Theodore Tso wrote: > >On Sat, May 05, 2007 at 03:36:37AM +0200, Bernd Schubert wrote: > >> distribution: modified debian sarge, in which aspect is the distribution > >> important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX > >> and not /dev/rd/0. Stracing it and grepping for open calls shows that > >> only /dev/sdaX is opened in read-write mode. > > > >/dev/rd/0? What's this? > > devfs (hint hint) naming for /dev/ram0. Yep, but udev knows devfs style ... (I already told I tested vanilla kernels, so no patches). Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mkfs.ext2 triggerd RAM corruption
Jan-Benedict Glaw wrote: > On Fri, 2007-05-04 16:59:51 +0200, Bernd Schubert <[EMAIL PROTECTED]> > wrote: >> To see whats going on, I copied the entire / (so the initrd) into a >> tmpfs >> root, chrooted into it, also bind mounted the main / into this chroot >> and >> compared several times /bin of chroot/bin and the bind-mounted /bin >> while >> the mkfs.ext2 command was running. >> >> beo-05:/# diff -r /bin /oldroot/bin/ >> beo-05:/# diff -r /bin /oldroot/bin/ >> beo-05:/# diff -r /bin /oldroot/bin/ >> Binary files /bin/sleep and /oldroot/bin/sleep differ >> beo-05:/# diff -r /bin /oldroot/bin/ >> Binary files /bin/bsd-csh and /oldroot/bin/bsd-csh differ >> Binary files /bin/cat and /oldroot/bin/cat differ >> ... >> >> Also tested different schedulers, at least happens with deadline and >> anticipatory. >> >> The corruption does NOT happen on running the mkfs command on >> /dev/sda1, >> but happens with sda2, sda3 and sda3. Also doesn't happen with >> extended >> partitions of sda1. > > Is sda2 the largest filesystem out of sda2, sda3 (and the logical > partitions within the extended sda1, if these get mkfs'ed, too)? I tested it that way: - test on sda1, no further partitions - test on sda2, sda1: ~2MB, everything else for sda2 - test on sda3, sda1: ~2MB, sda2: ~2MB, everything else for sda3 ... test on sda5: sda1: partition that has the extended partition, everything in sda5 > > I'm not too sure that this is a kernel bug, but probably a bad RAM > chip. Did you run memtest86 for a while? ...and can you reproduce this > problem on different machines? Reproducible on 4 test-systems (2 with identical hardware, but then the 2 + 1 + 1 with entirely different hardware combinations) with ECC memory, which is monitored by EDAC. Memory, CPU, etc. are already real life stress tested with several applications, e.g. linpack. Though I don't entirely agree, my colleagues in this group are always telling me, that their real life stress test shows more memory corruptions than memtest. As soon as I have physical access again, I can also do a memtest86 run (would like to do it over the weekend, but don't know how to convince stupid rembo how to boot memtest). Anyway, a memory corruption is more than unlikely on these systems for several reasons. Thanks, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mkfs.ext2 triggerd RAM corruption
Theodore Tso wrote: > On Fri, May 04, 2007 at 04:59:51PM +0200, Bernd Schubert wrote: >> >> I'm presently rather puzzled, if this is really a kernel bug, its a >> big >> bug. >> >> Summary: The system ramdisk (initrd) gets corrupted while running >> mkfs.ext2 on a local sata disk partition. > > What distribution are you using? What's the hardware configuration, distribution: modified debian sarge, in which aspect is the distribution important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX and not /dev/rd/0. Stracing it and grepping for open calls shows that only /dev/sdaX is opened in read-write mode. hardware: beo-05 and beo-06: cpu: xeon, acpi shows S3000PTH board, memory 2GB (board too new for EDAC), piix sata controller beo-106: Dual Core AMD Opteron, no idea what kind of board, 4GB memory (k8_edac monitored), nforce sata controller beo-01: Presently can't connect to it, afaik another intel system (all system are running in x86_64 mode) > including amount of memory? What is the partition table look > like for /dev/sda? What filesystems are mounted? If you have any I already tested several partition types, e.g. something like this for a test on sda3 beo-05:~# sfdisk -d /dev/sda # partition table of /dev/sda unit: sectors /dev/sda1 : start= 63, size= 4208967, Id=83 /dev/sda2 : start= 4209030, size= 4209030, Id=83 /dev/sda3 : start= 8418060, size=313251435, Id=83 /dev/sda4 : start=0, size=0, Id= 0 For the tests nothing was mounted. > soft RAID partitions, are any of them using part of /dev/sda? What No raid during the tests on sda, of course. When sdaX was part of a raid testing the raid device, the corruption did NOT happen. > swap partitions are you using? And do any of the swap partitions Swap already entirely disabled. > overlap with /dev/sda? :-) Suspected this first too, but the tested partition was never used as swap partition (first always tested on sda4 and sda2 was used for swap), later I entirely disabled the swap. Thanks, Bernd PS: I took me about 10 hours of testing, before I wrote the first mail. Took me that time to believe that its really a kernel bug. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mkfs.ext2 triggerd RAM corruption
Hi, I'm presently rather puzzled, if this is really a kernel bug, its a big bug. Summary: The system ramdisk (initrd) gets corrupted while running mkfs.ext2 on a local sata disk partition. Reproduced on kernel versions: vanilla 2.6.16 - 2.6.20 (<2.6.16 doesn't run on any of the systems I can do tests with). Please note: I could reproduce this on serveral systems, all of them use ECC memory and the memory of most of them the memory is monitored using EDAC. Details: 1.) Our systems boot from an initrd, all system services are running from the initrd/ramdisk. 2.) While setting up a lustre meta data storage server, lustre runs mkfs.ext2 -j -b 4096 -F -i 4096 -J size=400 -I 512 /dev/sda4 (Please note, I first observed this while using a lustre patched kernel, but I could reproduce this with vanilla kernels). While this mkfs.ext2 command was running, suddenly running commands such as ps, top, ls, etc. resulted in segmentation faults. To see whats going on, I copied the entire / (so the initrd) into a tmpfs root, chrooted into it, also bind mounted the main / into this chroot and compared several times /bin of chroot/bin and the bind-mounted /bin while the mkfs.ext2 command was running. beo-05:/# diff -r /bin /oldroot/bin/ beo-05:/# diff -r /bin /oldroot/bin/ beo-05:/# diff -r /bin /oldroot/bin/ Binary files /bin/sleep and /oldroot/bin/sleep differ beo-05:/# diff -r /bin /oldroot/bin/ Binary files /bin/bsd-csh and /oldroot/bin/bsd-csh differ Binary files /bin/cat and /oldroot/bin/cat differ ... Also tested different schedulers, at least happens with deadline and anticipatory. The corruption does NOT happen on running the mkfs command on /dev/sda1, but happens with sda2, sda3 and sda3. Also doesn't happen with extended partitions of sda1. Any idea whats going on? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
aic79xx oops
Hello, this morning our server crashed without any log messages, nothing captured via serial cable and magic sysrqs also didn't work. Anyway during the reboot on oops of the aic79xx module occured, see below . This is a 2.6.11.12 kernel, patched with bluesmoke and the bio-clone fix. Furthermore, the drbd module is loaded. You may find a dmesg, lsmod and lspci information and the kernel config here: http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/aic79xx-oops/ Ooops: (none) login: ACPI: PCI interrupt :02:06.0[A] -> GSI 24 (level, low) -> IRQ 24 ACPI: PCI interrupt :02:06.1[B] -> GSI 25 (level, low) -> IRQ 25 scsi4 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11 aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs scsi4:A:0:0: DV failed to configure device. Please file a bug report against this driver. (scsi4:A:0): 160.000MB/s transfers (80.000MHz DT, 16bit) Vendor: transtec Model: T5008 Rev: 0001 Type: Direct-Access ANSI SCSI revision: 03 scsi4:A:0:0: Tagged Queuing enabled. Depth 32 SCSI device sdc: 4101521408 512-byte hdwr sectors (2099979 MB) SCSI device sdc: drive cache: write back SCSI device sdc: 4101521408 512-byte hdwr sectors (2099979 MB) SCSI device sdc: drive cache: write back sdc: sdc1 sdc2 sdc3 < sdc5 sdc6 sdc7 sdc8 > Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0 Attached scsi generic sg2 at scsi4, channel 0, id 0, lun 0, type 0 scsi: host 4 channel 0 id 0 lun 0x0200080c0400 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun1002486961 has a LUN larger than allowed by the host adapter scsi: host 4 channel 0 id 0 lun 0x0100407a27c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0x007a27c0d05d27c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0x305e27c0907b27c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0xf08227c0b08d27c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0x307827c0008527c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0xb06727c0 has a LUN larger than currently supported. scsi: host 4 channel 0 id 0 lun 0x306727c0706727c0 has a LUN larger than currently supported. Vendor: transtec Model: T5008 Rev: 0001 Type: Direct-Access ANSI SCSI revision: 03 Unable to handle kernel NULL pointer dereference at virtual address 0403 printing eip: f8a4de8e *pde = Oops: [#1] SMP Modules linked in: aic79xx CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010246 (2.6.11.12-tc2) EIP is at ahd_send_async+0xde/0x2a0 [aic79xx] eax: 000f ebx: 0042 ecx: f7f05d28 edx: esi: 0400 edi: f7f74000 ebp: esp: f7f05c64 ds: 007b es: 007b ss: 0068 Process modprobe (pid: 1081, threadinfo=f7f04000 task=f7ee1540) Stack: c0135aae 0006162f 0282 f7f05c88 0050 0001 c01230a0 0001 f7f05d00 c0107376 f7f05d00 c03fb620 f7f05d0a c038f3a0 2000 f7323536 2000 Call Trace: [] __do_IRQ+0x10e/0x160 [] do_timer+0xc0/0xd0 [] timer_interrupt+0xb6/0x130 [] __do_IRQ+0x10e/0x160 [] ahd_set_tags+0x55/0x70 [aic79xx] [] ahd_linux_device_queue_depth+0xa7/0xd0 [aic79xx] [] ahd_linux_free_target+0x112/0x160 [aic79xx] [] ahd_linux_slave_configure+0x72/0xe0 [aic79xx] [] ahd_linux_slave_configure+0x0/0xe0 [aic79xx] [] scsi_add_lun+0x2aa/0x300 [] scsi_probe_and_add_lun+0xd9/0x220 [] scsi_report_lun_scan+0x330/0x480 [] scsi_probe_and_add_lun+0xf2/0x220 [] scsi_scan_target+0x102/0x130 [] scsi_scan_channel+0x7a/0x90 [] scsi_scan_host_selected+0xb5/0xf0 [] scsi_scan_host+0x2f/0x40 [] ahd_linux_register_host+0x21e/0x270 [aic79xx] [] sysfs_add_file+0x58/0x80 [] sysfs_create_file+0x2e/0x50 [] pci_create_newid_file+0x27/0x30 [] pci_register_driver+0x8e/0x90 [] ahd_linux_detect+0x4c/0x70 [aic79xx] [] ahd_linux_init+0xf/0x13 [aic79xx] [] sys_init_module+0x167/0x200 [] syscall_call+0x7/0xb Code: c7 44 24 20 00 00 00 00 80 fb 42 0f b6 87 41 1d 00 00 8d 50 08 0f 44 c2 8d 54 ed 00 01 d2 03 94 87 6c 18 00 00 8d b2 00 04 00 00 <0f> b6 56 03 3a 56 09 0f 84 2b 01 00 00 a1 28 d1 a6 f8 85 c0 0f Even though there was an oops, the aic79xx module and the card still seem to work, we currently uncertain if we better should reboot again and/or try to use a newer kernel or can leave the system as it is without another reboot. Of course, we would also like to know the reason for the oops. Any help is appreciated. Thanks in advance, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body
Re: 2.6.12.3 network slowdown?
On Wednesday 27 July 2005 12:30, Mihai Rusu wrote: > On Wed, Jul 27, 2005 at 01:44:43AM -0700, Howard Chu wrote: > > I just recently compiled the 2.6.12.3 kernel for my x86_64 machine > > (Asus A8V motherboard); was previously running a SuSE-compiled 2.6.8 > > kernel (SuSE 9.2 distro). I'm now seeing extremely slow throughput on > > the onboard Yukon (Marvell) ethernet interface, but only in certain > > conditions. Going back to the 2.6.8 kernel shows no slowdown. > > You might try the other SysKonnect driver as 2.6.12 ships with 2 > different drivers for this family of NICs. > No, AFAIK the rewritten driver is only in 2.6.13-rc or 2.6.12-mm (also already in previous -mm kernel versions). Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with kernel 2.6.11
Hello Francois, > > > I have a problem with a program named Gaussian (http://www.gaussian.com) > > > (versions g98 or g03) and FC 4.0 (default kernel 2.6.11): I am used to > > > take > > > Gaussian binaries compiled on the RedHat 9.0 version, and used them on FC > > > 2.0 or FC 3.0. If I try to do so, on FC 4.0. (with the default kernel) > > > Gaussian stops (both g98 and g03 versions) with the following error > > > message: could you please tell me which compiler you used to compile Gaussian? Its rather probably pgf77 (PGI), but the version is also important. If it was 5.2, you just ran into bugs we already experienced some time ago. I also posted a warning about that to the CCL list. On the CCL list I also saw there were problems with PGI-6.0, but I never bothered to test this myself, as our gaussian-binaries compiled with PGI-5.1 seem to work fine. Also, binaries from the PGI compiler are to our experience rather sensible to the glibc version. I'm not absolutely sure whats causing that, but somehow I'm under the impression that the PGI-libraries, which all binaries created with the PGI compiler are linked with, do some odd optimizations. So to make sure that its really a kernel issue you should use the libc of the compiler system (via LD_LIBRARY_PATH) or compile Gaussian statically. > stat64("/home/fyd/0QM_SCR/Gau-3174.inp", 0xbf9db114) = -1 ENOENT (No such file I'm a bit tired now and maybe I'm interpreting it wrong, but I think you should use strace -f ... > rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 > --- SIGCHLD (Child exited) @ 0 (0) --- Same here. Cheers, Bernd -- Bernd Schubert PCI / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.11.12: oops + panic
Hello, below is an oops that happend on one of our servers last week. This oops was still logged by syslog, but its seems this oops immediately followed another oops which made the kernel to panick. The last oops could not logged anymore by the syslogd and unfortunately capturing using a serial cable was also disabled. But I guess the second oops is only the result of the first oops, so anyone knows were it comes from or how to further debug it? Thanks, Bernd Unable to handle kernel NULL pointer dereference at virtual address 0004 printing eip: c013f958 *pde = Oops: 0002 [#1] SMP Modules linked in: quota_v2 drbd parport_pc lp parport ohci_hcd usbcore i2c_amd756 i2c_amd8111 dm_mod w83627hf eeprom lm85 i2c_sensor i2c_isa i2c_core sk98lin tg3 aic79xx CPU:0 EIP:0060:[free_block+72/208]Not tainted VLI EFLAGS: 00010016 (2.6.11.12) EIP is at free_block+0x48/0xd0 eax: ebx: f027cd80 ecx: cc5be600 edx: 00580a78 esi: c2b93ac0 edi: ebp: c2b93ae8 esp: c2991edc ds: 007b es: 007b ss: 0068 Process events/0 (pid: 6, threadinfo=c299 task=c2962a60) Stack: c0407b60 c011440a c2b93af8 c2a0abd0 c2a0abc0 0002 c2b93ac0 c014010a c2b93ac0 c2a0abd0 0002 c2b93a38 c2b93ac0 0005 c2b93a60 c01401d4 c2b93ac0 c2a0abc0 c2b93a38 c299 c2b93b50 0296 c2814ac0 Call Trace: [finish_task_switch+58/128] finish_task_switch+0x3a/0x80 [drain_array_locked+122/192] drain_array_locked+0x7a/0xc0 [cache_reap+132/496] cache_reap+0x84/0x1f0 [worker_thread+441/592] worker_thread+0x1b9/0x250 [cache_reap+0/496] cache_reap+0x0/0x1f0 [default_wake_function+0/32] default_wake_function+0x0/0x20 [default_wake_function+0/32] default_wake_function+0x0/0x20 [worker_thread+0/592] worker_thread+0x0/0x250 [kthread+183/192] kthread+0xb7/0xc0 [kthread+0/192] kthread+0x0/0xc0 [kernel_thread_helper+5/20] kernel_thread_helper+0x5/0x14 Code: 46 38 8d 6e 28 89 44 24 08 8b 44 24 24 8b 15 50 ea 4e c0 8b 0c b8 8d 81 00 00 00 40 c1 e8 0c c1 e0 05 8b 5c 02 1c 8b 53 04 8b 03 <89> 50 04 89 02 c7 43 04 00 02 20 00 2b 4b 0c c7 03 00 01 10 00 -- Bernd Schubert PCI / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CDR read problems with 2.6.11?
[...] > > [EMAIL PROTECTED]:~[1009]# mount /mnt/cdrom > mount: wrong fs type, bad option, bad superblock on /dev/cdrom, >missing codepage or other error >In some cases useful info is found in syslog - try >dmesg | tail or so > [...] > The drive is a NEC DVD+RW ND-5100A > > Any suggestions on why I can't read (or burn correctly) the disks with 2.6.11? > I have seen exactly the same on my fathers computer and could solve this by not starting the udftools. Didn't have the time to digg further into this... Can you confirm thats really a udf problem? Just run "/etc/init.d/udftools stop" or the similar for your distribution and try mounting again. Cheers, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
On Thursday 03 March 2005 10:19, Andi Kleen wrote: > On Wed, Mar 02, 2005 at 08:53:07AM -0800, Trond Myklebust wrote: > > on den 02.03.2005 Klokka 12:33 (+0100) skreiv Bernd Schubert: > > > > I can see no good reason for truncating inode number values on > > > > platforms that actually do support 64-bit inode numbers, but I can > > > > see several > > > > > > Well, at least we would have a reason ;) > > > > A 32-bit emulation mode is clearly a "platform" which does NOT support > > 64-bit inode numbers, however there is (currently) no way for the kernel > > to detect that you are running that. Any extra truncation should > > therefore ideally be done by the emulation layer rather than the kernel > > itself. > > The problem here is that glibc uses stat64() which supports > 64bit inode numbers. But glibc does the overflow checking itself > and generates the EOVERFLOW in user space. Nothing we can do > about that. The 64bit inodes work under 32bit too, so your > code checking for 64bitness is totally bogus. > > The old stat interface doesn't check that case currently either > (will fix that), but that's not the problem here. > > But in general the emulation layer cannot do truncation because > it doesn't know if it is ok to do for the low level file system. > If anything this has to be done in the fs. > So what do you actually suggest? On the one hand you say even 32bit userspace supports 64bit inodes, if it wants. On the other hand you say the truncation needs to be done on file system level. To my mind this is contradicting, the first statement suggests to do the truncation in userspace, the second says it can only be done in the kernel? Cheers, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
On Wednesday 02 March 2005 17:53, Trond Myklebust wrote: > on den 02.03.2005 Klokka 12:33 (+0100) skreiv Bernd Schubert: > > > I can see no good reason for truncating inode number values on > > > platforms that actually do support 64-bit inode numbers, but I can see > > > several > > > > Well, at least we would have a reason ;) > > A 32-bit emulation mode is clearly a "platform" which does NOT support > 64-bit inode numbers, however there is (currently) no way for the kernel > to detect that you are running that. Any extra truncation should > therefore ideally be done by the emulation layer rather than the kernel > itself. > I already found the function in glibc and it looks as if it would be rather easy to do it there. I only hope the glibc maintainers will accept this kind of fixes (hope they won't say that nobody needs this). Cheers, Bernd PS: Also many thanks for fixing other bugs in the NFS client. Until 2.6.9 init somehow could not open /dev/console on a readonly mountpoint. With 2.6.11 this problem has disappeared, thanks a lot for fixing this and other problems. I never had the time to write a bugreport for that. -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
On Wednesday 02 March 2005 10:13, Trond Myklebust wrote: > on den 02.03.2005 Klokka 09:18 (+0100) skreiv Andi Kleen: > > On Wed, Mar 02, 2005 at 12:46:23AM +0100, Andreas Schwab wrote: > > > Bernd Schubert <[EMAIL PROTECTED]> writes: > > > > Hmm, after compiling with -D_FILE_OFFSET_BITS=64 it works fine. But > > > > why does it work without this option on a 32bit kernel, but not on a > > > > 64bit kernel? > > > > > > See nfs_fileid_to_ino_t for why the inode number is different between > > > 32bit and 64bit kernels. > > > > Ok that explains it. Thanks. Many thanks also from me! > > > > Best would be probably to just do the shift unconditionally on 64bit > > kernels too. > > > > Trond, what do you think? > > Why would this be more appropriate than defining __kernel_ino_t on the > x86_64 platform to be of the size that you actually want the kernel to > support? > > I can see no good reason for truncating inode number values on platforms > that actually do support 64-bit inode numbers, but I can see several Well, at least we would have a reason ;) > reasons why you might want not to (utilities that need to detect hard > linked files for instance). Anyway, glibc already seems to have a condition for that, so IMHO glibc also could truncate the inode numbers if needed. And finally glibc probably knows best if its compiled as 32bit or 64bit. Will take a look into the glibc sources. Many, many thanks to all for their help! Best wishes, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
On Tuesday 01 March 2005 23:10, Andreas Schwab wrote: > Bernd Schubert <[EMAIL PROTECTED]> writes: > >> It is most likely some kind of user space problem. I would change > >> it to int err = stat(dir, &buf); > >> and then go through it with gdb and see what value err gets assigned. > >> > >> I cannot see any kernel problem. > > > > The err value will become -1 here. > > That's because there are some values in the stat64 buffer delivered by the > kernel which cannot be packed into the stat buffer that you pass to stat. > Use stat64 or _FILE_OFFSET_BITS=64. Hmm, after compiling with -D_FILE_OFFSET_BITS=64 it works fine. But why does it work without this option on a 32bit kernel, but not on a 64bit kernel? 32bit kernel, 32bit binary: always works 64bit kernel, 64bit binary: always works 64bit kernel, 32bit binary: - always works on knfsd mount points - always works with -D_FILE_OFFSET_BITS=64 - only works on unfs3 mount points with _FILE_OFFSET_BITS=64 Do I really have to write a bug report for every single debian package that access /etc and /var to make the maintainers recompile it with -D_FILE_OFFSET_BITS=64? Btw, whats about Suse, are there all packages compiled with this option? ;) Cheers, (a completely confused) Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
> strace didn't say so, and normally it doesn't lie about things like this. Well, I show you the updated source code and strace output and if you still don't believe me, ask me for a login to our system ;) #include #include #include #include #include #include #include int main(int argc, char **argv) { char *dir; struct stat *buf; int err; dir = argv[1]; buf = malloc(sizeof(struct stat)); errno = 0; err = stat(dir, buf); if ( err ) { fprintf(stderr, "err = %i\n", err); fprintf(stderr, "stat for %s failed \n", dir); fprintf(stderr, "ernno: %i (%s)\n", errno, strerror(errno)); } else fprintf(stderr, "stat() works fine.\n"); return (0); } > > > > [EMAIL PROTECTED] tests>./test_stat32 /mnt/test/yp > > > stat for /mnt/test/yp failed > > > ernno: 75 (Value too large for defined data type) > > errno is undefined unless a system call returned -1 before or > you set it to 0 before. See above. > > > > But why does stat64() on a 64-bit kernel tries to fill in larger data > > > than > > A 64bit kernel has no stat64(). All stats are 64bit. [EMAIL PROTECTED] tests>strace32 ./test_stat32 /mnt/test/yp execve("./test_stat32", ["./test_stat32", "/mnt/test/yp"], [/* 43 vars */]) = 0 uname({sys="Linux", node="hitchcock", ...}) = 0 brk(0) = 0x80ad000 brk(0x80ce000) = 0x80ce000 stat64("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0 write(2, "err = -1\n", 9err = -1 ) = 9 write(2, "stat for /mnt/test/yp failed \n", 30stat for /mnt/test/yp failed ) = 30 write(2, "ernno: 75 (Value too large for d"..., 50ernno: 75 (Value too large for defined data type) ) = 50 exit_group(0) = ? You certainly know much better than me, but I think strace shows that its calling stat64. > > > > on a 32-bit kernel and larger data also only for nfs-mount points? Hmm, > > > I will tomorrow compare the tcp-packges sent by the server. > > > > So I still think thats a kernel bug. > > Your data so far doesn't support this assertion. I have to admit that knfsd-mount moints are not affected, but on the other hand, I really cant't see anything in the ethereal captures. If someone should be interested, I have uploaded them: http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/nfs-stat/ Cheers, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
Hello Andi, sorry, due to some mail sending/refusing problems, I had to resend to the nfs-list, which prevented the answers there to be posted to the other CCs. > It is most likely some kind of user space problem. I would change > it to int err = stat(dir, &buf); > and then go through it with gdb and see what value err gets assigned. > > I cannot see any kernel problem. The err value will become -1 here. Trond Myklebust already suggested to look at the results of errno: On Tuesday 01 March 2005 00:43, Bernd Schubert wrote: > On Monday 28 February 2005 23:26, you wrote: > > Given that strace shows that both syscalls (stat64() and stat()) > > succeed, I expect the "problem" is probably just glibc setting an > > EOVERFLOW error in the 32-bit case. That's what it is supposed to do if > > a 64 bit value overflows the 32-bit buffers. > > Right, thanks. > > > Have you tried looking at errno? > > [EMAIL PROTECTED] tests>./test_stat32 /mnt/test/yp > stat for /mnt/test/yp failed > ernno: 75 (Value too large for defined data type) > > But why does stat64() on a 64-bit kernel tries to fill in larger data than > on a 32-bit kernel and larger data also only for nfs-mount points? Hmm, I > will tomorrow compare the tcp-packges sent by the server. So I still think thats a kernel bug. Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: 32bit emulation problems
> As usual we are using unfs3 for /etc and /var, but for me that looks like a > client problem. I'm even not sure if this is limited to NFS at all. Sorry, that was easy to test, of course. This problem doesn't seem to exist on a local disk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
x86_64: 32bit emulation problems
Hi, I'm just looking into a very strange problem. Some of our systems have athlon64 CPUs. Due to our diskless nfs environment we currently still prefer a 32bit userspace environment, but would like to be able to use a 64-bit chroot environment. Well, currently there seems to be a stat64() NFS problem when a x86_64 kernel is booted and stat64() comes from a 32bit libc. Here's just an example: hitchcock:/home/bernd/src/tests# ./test_stat64 /mnt/test/yp stat() works fine. hitchcock:/home/bernd/src/tests# ./test_stat32 /mnt/test/yp stat for /mnt/test/yp failed The test program looks rather simple: #include #include #include #include #include #include #include int main(int argc, char **argv) { char *dir; struct stat buf; dir = argv[1]; if (stat (dir, &buf) == -1) fprintf(stderr, "stat for %s failed \n", dir); else fprintf(stderr, "stat() works fine.\n"); return (0); } Here are the strace outputs: = 32bit: -- hitchcock:/home/bernd/src/tests# strace32 ./test_stat32 /mnt/test/yp execve("./test_stat32", ["./test_stat32", "/mnt/test/yp"], [/* 39 vars */]) = 0 uname({sys="Linux", node="hitchcock", ...}) = 0 brk(0) = 0x80ad000 brk(0x80ce000) = 0x80ce000 stat64("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0 write(2, "stat for /mnt/test/yp failed \n", 30stat for /mnt/test/yp failed ) = 30 exit_group(0) = ? 64bit: --- hitchcock:/home/bernd/src/tests# strace ./test_stat64 /mnt/test/yp execve("./test_stat64", ["./test_stat64", "/mnt/test/yp"], [/* 39 vars */]) = 0 uname({sys="Linux", node="hitchcock", ...}) = 0 brk(0) = 0x572000 brk(0x593000) = 0x593000 stat("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0 write(2, "stat() works fine.\n", 19stat() works fine. )= 19 _exit(0)= ? Anyone having an idea whats going on? The ethereal capture also looks pretty normal. The kernel of this system is 2.6.9, but it also happens on another system with 2.6.11-rc5. As usual we are using unfs3 for /etc and /var, but for me that looks like a client problem. I'm even not sure if this is limited to NFS at all. Thanks in advance, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swapper: page allocation failure. order:1, mode:0x20
Hello Benjamin, On Monday 28 February 2005 16:23, Benjamin L. Shi wrote: > We've seen these, by adding the following tueables resolved the problem. > More specifically, the lower zone protection made the difference. > > vm.vfs_cache_pressure=1000 > vm.lower_zone_protection=100 > vm.max_map_count = 32668 > vm.min_free_kbytes = 1 > many thanks, we will test this now and set those values on all of our 2.6. systems. Thanks a lot again, Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
swapper: page allocation failure. order:1, mode:0x20
Oh no, not this page allocation problems again. In summer I already posted problems with page allocation errors with 2.6.7, but to me it seemed that nobody cared. That time we got those problems every morning during the cron jobs and our main file server always completely crashed. This time its our cluster master system and first happend after an uptime of 89 days, kernel is 2.6.9. Besides of those messages, the system still seems to run stable I really beg for help here, so please please please help me solving this probem. What can I do to solve it? First a (dumb) question, what does 'page allocation failure' really mean? Is it some out of memory case? Thanks a lot in advance for any help, Bernd Feb 28 10:04:45 hitchcock kernel: swapper: page allocation failure. order:1, mode:0x20 Feb 28 10:04:45 hitchcock kernel: Feb 28 10:04:45 hitchcock kernel: Call Trace: {__alloc_pages+878} {__get_free_pages+14} Feb 28 10:04:45 hitchcock kernel:{kmem_getpages+38} {ip_frag_create+26} Feb 28 10:04:45 hitchcock kernel:{cache_grow+190} {cache_alloc_refill+560} Feb 28 10:04:45 hitchcock kernel:{__kmalloc+195} {alloc_skb+64} Feb 28 10:04:45 hitchcock kernel: {tg3_alloc_rx_skb+222} {tg3_rx+371} Feb 28 10:04:45 hitchcock kernel:{tg3_poll+183} {net_rx_action+134} Feb 28 10:04:45 hitchcock kernel:{__do_softirq+123} {do_softirq+50} Feb 28 10:04:45 hitchcock kernel:{do_IRQ+347} {ret_from_intr+0} Feb 28 10:04:45 hitchcock kernel: {default_idle+0} {default_idle+36} Feb 28 10:04:45 hitchcock kernel:{cpu_idle+39} Feb 28 10:05:41 hitchcock rpc.mountd: authenticated unmount request from beo-04:666 for /lib64 (/lib64) Feb 28 10:04:45 hitchcock kernel: swapper: page allocation failure. order:1, mode:0x20 Feb 28 10:07:36 hitchcock kernel: Feb 28 10:07:36 hitchcock kernel: Call Trace: {__alloc_pages+878} {__get_free_pages+14} Feb 28 10:07:36 hitchcock kernel:{kmem_getpages+38} {cache_grow+190} Feb 28 10:07:36 hitchcock kernel: {cache_alloc_refill+560} {__kmalloc+195} Feb 28 10:07:36 hitchcock kernel:{alloc_skb+64} {tg3_alloc_rx_skb+222} Feb 28 10:07:36 hitchcock kernel:{tg3_rx+371} {tg3_poll+183} Feb 28 10:07:36 hitchcock kernel:{net_rx_action+134} {__do_softirq+123} Feb 28 10:07:36 hitchcock kernel:{do_softirq+50} {do_IRQ+347} Feb 28 10:07:36 hitchcock kernel:{ret_from_intr+0} {default_idle+0} Feb 28 10:07:36 hitchcock kernel:{default_idle+36} {cpu_idle+39} Feb 28 10:07:36 hitchcock kernel: Feb 28 10:07:36 hitchcock kernel: swapper: page allocation failure. order:1, mode:0x20 Feb 28 10:07:36 hitchcock kernel: Feb 28 10:07:36 hitchcock kernel: Call Trace: {__alloc_pages+878} {__get_free_pages+14} Feb 28 10:07:36 hitchcock kernel:{kmem_getpages+38} {cache_grow+190} Feb 28 10:07:36 hitchcock kernel: {cache_alloc_refill+560} {__kmalloc+195} Feb 28 10:07:36 hitchcock kernel:{alloc_skb+64} {tg3_alloc_rx_skb+222} Feb 28 10:07:36 hitchcock kernel:{tg3_rx+371} {tg3_poll+183} Feb 28 10:07:36 hitchcock kernel:{net_rx_action+134} {__do_softirq+123} Feb 28 10:07:36 hitchcock kernel:{do_softirq+50} {do_IRQ+347} Feb 28 10:07:36 hitchcock kernel:{ret_from_intr+0} {default_idle+0} Feb 28 10:07:36 hitchcock kernel:{default_idle+36} {cpu_idle+39} -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/