from:"Bernd Schubert"

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-22 Thread Bernd Schubert

On 4/22/24 22:06, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
 From: Hou Tao 

 Hi,

 The patch set aims to fix the warning related to an abnormal size
 parameter of kmalloc() in virtiofs. The warning occurred when attempting
 to insert a 10MB sized kernel module kept in a virtiofs with cache
 disabled. As analyzed in patch #1, the root cause is that the length of
 the read buffer is no limited, and the read buffer is passed directly to
 virtiofs through out_args[0].value. Therefore patch #1 limits the
 length of the read buffer passed to virtiofs by using max_pages. However
 it is not enough, because now the maximal value of max_pages is 256.
 Consequently, when reading a 10MB-sized kernel module, the length of the
 bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
 try to allocate 2MB from memory subsystem. The request for 2MB of
 physically contiguous memory significantly stress the memory subsystem
 and may fail indefinitely on hosts with fragmented memory. To address
 this, patch #2~#5 use scattered pages in a bio_vec to replace the
 kmalloc-allocated bounce buffer when the length of the bounce buffer for
 KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
 allocation of the bounce buffer and sg array in virtiofs is that
 GFP_ATOMIC is used even when the allocation occurs in a kworker context.
 Therefore the last patch uses GFP_NOFS for the allocation of both sg
 array and bounce buffer when initiated by the kworker. For more details,
 please check the individual patches.

 As usual, comments are always welcome.

 Change Log:
>>> Bernd should I just merge the patchset as is?
>>> It seems to fix a real problem and no one has the
>>> time to work on a better fix  WDYT?
>>
>> Sorry for the long delay. I am just start to prepare for v3. In v3, I
>> plan to avoid the unnecessary memory copy between fuse args and bio_vec.
>> Will post it before next week.
> 
> Didn't happen before this week apparently.

Hi Michael,

sorry for my later reply, I had been totally busy for the last weeks as
well. Also I can't decide to merge it - I'm not the official fuse
maintainer...
>From my point of view, patch 1 is just missing to set the actual limit
and then would be clear and easy back-portable bug fix.
Not promised, I will try it out if I find a bit time tomorrow.

Bernd

Re: [PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

2024-03-13 Thread Bernd Schubert




On 3/9/24 05:26, Hou Tao wrote:
> Hi,
> 
> On 3/1/2024 9:42 PM, Miklos Szeredi wrote:
>> On Wed, 28 Feb 2024 at 15:40, Hou Tao  wrote:
>>
>>> So instead of limiting both the values of max_read and max_write in
>>> kernel, capping the maximal length of kvec iter IO by using max_pages in
>>> fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max
>>> value for max_pages is 256, so on host with 4KB page size, the maximal
>>> size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The
>>> allocation of 2MB of physically contiguous memory will still incur
>>> significant stress on the memory subsystem, but the warning is fixed.
>>> Additionally, the requirement for huge physically contiguous memory will
>>> be removed in the following patch.
>> So the issue will be fixed properly by following patches?
>>
>> In that case this patch could be omitted, right?
> 
> Sorry for the late reply. Being busy with off-site workshop these days.
> 
> No, this patch is still necessary and it is used to limit the number of
> scatterlist used for fuse request and reply in virtio-fs. If the length
> of out_args[0].size is not limited, the number of scatterlist used to
> map the fuse request may be greater than the queue size of virtio-queue
> and the fuse request may hang forever.

I'm currently also totally busy and didn't carefully check, but isn't
there something missing that limits fc->max_write/fc->max_read?


Thanks,
Bernd

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-01-10 Thread Bernd Schubert





On 1/10/24 02:16, Hou Tao wrote:

Hi,

On 1/9/2024 9:11 PM, Bernd Schubert wrote:



On 1/3/24 11:59, Hou Tao wrote:

From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtiofs with cache
disabled, the following warning was reported:

    [ cut here ]
    WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
    Modules linked in:
    CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
    RIP: 0010:__alloc_pages+0x2c4/0x360
    ..
    Call Trace:
     
     ? __warn+0x8f/0x150
     ? __alloc_pages+0x2c4/0x360
     __kmalloc_large_node+0x86/0x160
     __kmalloc+0xcd/0x140
     virtio_fs_enqueue_req+0x240/0x6d0
     virtio_fs_wake_pending_and_unlock+0x7f/0x190
     queue_request_and_unlock+0x58/0x70
     fuse_simple_request+0x18b/0x2e0
     fuse_direct_io+0x58a/0x850
     fuse_file_read_iter+0xdb/0x130
     __kernel_read+0xf3/0x260
     kernel_read+0x45/0x60
     kernel_read_file+0x1ad/0x2b0
     init_module_from_file+0x6a/0xe0
     idempotent_init_module+0x179/0x230
     __x64_sys_finit_module+0x5d/0xb0
     do_syscall_64+0x36/0xb0
     entry_SYSCALL_64_after_hwframe+0x6e/0x76
     ..
     
    ---[ end trace  ]---

The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
kmalloc-ed memory as bound buffer for fuse args, but
fuse_get_user_pages() only limits the length of fuse arg by max_read or
max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
For virtiofs, max_read is UINT_MAX, so a big read request which is about



I find this part of the explanation a bit confusing. I guess you
wanted to write something like

fuse_direct_io() -> fuse_get_user_pages() is limited by
fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages
does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to
UINT_MAX basically no limit is applied at all.


Yes, what you said is just as expected but it is not the root cause of
the warning. The culprit of the warning is kmalloc() in
copy_args_to_argbuf() just as said in commit message. vmalloc() is also
not acceptable, because the physical memory needs to be contiguous. For
the problem, because there is no page involved, so there will be extra
sg available, maybe we can use these sg to break the big read/write
request into page.


Hmm ok, I was hoping that contiguous memory is not needed.
I see that ENOMEM is handled, but how that that perform (or even 
complete) on a really badly fragmented system? I guess splitting into 
smaller pages or at least adding some reserve kmem_cache (or even 
mempool) would make sense?




I also wonder if it wouldn't it make sense to set a sensible limit in
virtio_fs_ctx_set_defaults() instead of introducing a new variable?


As said in the commit message:

A feasible solution is to limit the value of max_read for virtiofs, so
the length passed to kmalloc() will be limited. However it will affects
the max read size for ITER_IOVEC io and the value of max_write also needs
limitation.

It is a bit hard to set a reasonable value for both max_read and
max_write to handle both normal ITER_IOVEC io and ITER_KVEC io. And
considering ITER_KVEC io + dio case is uncommon, I think using a new
limitation is more reasonable.


For ITER_IOVEC max_pages applies - which is limited to 
FUSE_MAX_MAX_PAGES - why can't this be used in virtio_fs_ctx_set_defaults?


@Miklos, is there a reason why there is no upper fc->max_{read,write} 
limit in process_init_reply()? Shouldn't both be limited to

(FUSE_MAX_MAX_PAGES * PAGE_SIZE). Or any other reasonable limit?


Thanks,
Bernd





Also, I guess the issue is kmalloc_array() in virtio_fs_enqueue_req?
Wouldn't it make sense to use kvm_alloc_array/kvfree in that function?


Thanks,
Bernd



10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
with len=10MB, and triggers the warning in __alloc_pages():
WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).

A feasible solution is to limit the value of max_read for virtiofs, so
the length passed to kmalloc() will be limited. However it will affects
the max read size for ITER_IOVEC io and the value of max_write also
needs
limitation. So instead of limiting the values of max_read and max_write,
introducing max_nopage_rw to cap both the values of max_read and
max_write when the fuse dio read/write request is initiated from kernel.

Considering that fuse read/write request from kernel is uncommon and to
decrease the demand for large contiguous pages, set max_nopage_rw as
256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
   fs/fuse/file.c  | 12 +++-
   fs/fuse/fuse_i.h    |  3 +++
   fs/fuse/inode.c |  1 +
   fs/fuse/virtio_fs.c |  6 ++
   4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-01-09 Thread Bernd Schubert





On 1/3/24 11:59, Hou Tao wrote:

From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtiofs with cache
disabled, the following warning was reported:

   [ cut here ]
   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
   Modules linked in:
   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
   RIP: 0010:__alloc_pages+0x2c4/0x360
   ..
   Call Trace:

? __warn+0x8f/0x150
? __alloc_pages+0x2c4/0x360
__kmalloc_large_node+0x86/0x160
__kmalloc+0xcd/0x140
virtio_fs_enqueue_req+0x240/0x6d0
virtio_fs_wake_pending_and_unlock+0x7f/0x190
queue_request_and_unlock+0x58/0x70
fuse_simple_request+0x18b/0x2e0
fuse_direct_io+0x58a/0x850
fuse_file_read_iter+0xdb/0x130
__kernel_read+0xf3/0x260
kernel_read+0x45/0x60
kernel_read_file+0x1ad/0x2b0
init_module_from_file+0x6a/0xe0
idempotent_init_module+0x179/0x230
__x64_sys_finit_module+0x5d/0xb0
do_syscall_64+0x36/0xb0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
..

   ---[ end trace  ]---

The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
kmalloc-ed memory as bound buffer for fuse args, but
fuse_get_user_pages() only limits the length of fuse arg by max_read or
max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
For virtiofs, max_read is UINT_MAX, so a big read request which is about



I find this part of the explanation a bit confusing. I guess you wanted 
to write something like


fuse_direct_io() -> fuse_get_user_pages() is limited by 
fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages 
does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to 
UINT_MAX basically no limit is applied at all.


I also wonder if it wouldn't it make sense to set a sensible limit in
virtio_fs_ctx_set_defaults() instead of introducing a new variable?

Also, I guess the issue is kmalloc_array() in virtio_fs_enqueue_req? 
Wouldn't it make sense to use kvm_alloc_array/kvfree in that function?



Thanks,
Bernd



10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
with len=10MB, and triggers the warning in __alloc_pages():
WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).

A feasible solution is to limit the value of max_read for virtiofs, so
the length passed to kmalloc() will be limited. However it will affects
the max read size for ITER_IOVEC io and the value of max_write also needs
limitation. So instead of limiting the values of max_read and max_write,
introducing max_nopage_rw to cap both the values of max_read and
max_write when the fuse dio read/write request is initiated from kernel.

Considering that fuse read/write request from kernel is uncommon and to
decrease the demand for large contiguous pages, set max_nopage_rw as
256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
  fs/fuse/file.c  | 12 +++-
  fs/fuse/fuse_i.h|  3 +++
  fs/fuse/inode.c |  1 +
  fs/fuse/virtio_fs.c |  6 ++
  4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a660f1f21540..f1beb7c0b782 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
*ap, struct iov_iter *ii,
return ret < 0 ? ret : 0;
  }
  
+static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,

+  const struct iov_iter *iter, int write)
+{
+   unsigned int nmax = write ? fc->max_write : fc->max_read;
+
+   if (iov_iter_is_kvec(iter))
+   nmax = min(nmax, fc->max_nopage_rw);
+   return nmax;
+}
+
  ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
   loff_t *ppos, int flags)
  {
@@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
iov_iter *iter,
struct inode *inode = mapping->host;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fm->fc;
-   size_t nmax = write ? fc->max_write : fc->max_read;
+   size_t nmax = fuse_max_dio_rw_size(fc, iter, write);
loff_t pos = *ppos;
size_t count = iov_iter_count(iter);
pgoff_t idx_from = pos >> PAGE_SHIFT;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..fc753cd34211 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -594,6 +594,9 @@ struct fuse_conn {
/** Constrain ->max_pages to this value during feature negotiation */
unsigned int max_pages_limit;
  
+	/** Maximum read/write size when there is no page in request */

+   unsigned int max_nopage_rw;
+
/** Input queue */
struct fuse_iqueue iq;
  
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c

index 2a6d44f91729..4cbbcb4a4b71 100644
--- a/fs/fuse/inode.c
+++ b/fs/fu

Re: md: Combine two kmalloc() calls into one in sb_equal()

2016-12-09 Thread Bernd Schubert




On 09.12.2016 22:58, SF Markus Elfring wrote:

Irrelevant, the variable is not used before checking it.


* Will it be more appropriate to attempt another memory allocation only if
  the previous one succeeded already?

* Can it be a bit more efficient to duplicate only the required data
  in a single function call before?


How many memory allocations do you expect to fail?

Re: md: Combine two kmalloc() calls into one in sb_equal()

2016-12-09 Thread Bernd Schubert




On 09.12.2016 20:54, SF Markus Elfring wrote:

So where did you get the idea from that it is not checked immediately?


Is another variable assignment performed so far before the return value
is checked from a previous function call?


Irrelevant, the variable is not used before checking it.

Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()

2016-12-09 Thread Bernd Schubert

On 09.12.2016 19:30, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Fri, 9 Dec 2016 19:09:13 +0100

The function "kmalloc" was called in one case by the function "sb_equal"
without checking immediately if it failed.

Err, your patch actually *replaces* the check. So where did you get the 
idea from that it is not checked immediately?

[...]

-   tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL);
-   tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL);
-
-   if (!tmp1 || !tmp2) {
-   ret = 0;
-   goto abort;
-   }

This is not immediately?

Bernd

Re: [PATCH v4] fuse: Add support for passthrough read/write

2016-01-21 Thread Bernd Schubert



On 01/21/2016 01:16 AM, Nikhilesh Reddy wrote:
> Add support for filesystem passthrough read/write of files
> when enabled in userspace through the option FUSE_PASSTHROUGH.
> 
> There are many FUSE based filesystems that perform checks or
> enforce policy or perform some kind of decision making in certain
> functions like the "open" call but simply act as a "passthrough"
> when performing operations such as read or write.
> 
> When FUSE_PASSTHROUGH is enabled all the reads and writes
> to the fuse mount point go directly to the passthrough filesystem
> i.e a native filesystem that actually hosts the files rather than
> through the fuse daemon. All requests that aren't read/write still
> go thought the userspace code.
> 
> This allows for significantly better performance on read and writes.
> The difference in performance between fuse and the native lower
> filesystem is negligible.
> 
> There is also a significant cpu/power savings that is achieved which
> is really important on embedded systems that use fuse for I/O.
> 
> Signed-off-by: Nikhilesh Reddy 

I think it is common style to add a change log between patch set
versions in the patch description.


Bernd

Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-27 Thread Bernd Schubert


On 06/27/2014 04:55 AM, Dave Chinner wrote:

On Thu, Jun 26, 2014 at 02:10:28PM +0200, Bernd Schubert wrote:

On 06/26/2014 01:57 PM, Lukáš Czerner wrote:

On Thu, 26 Jun 2014, Artem Bityutskiy wrote:

On Thu, 2014-06-26 at 12:36 +0200, Bernd Schubert wrote:

On 06/26/2014 08:13 AM, Artem Bityutskiy wrote:

On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote:

Your particular use case can be handled by directing your benchmark
at a filesystem mount point and unmounting the filesystem in between
benchmark runs. There is no ned to adding kernel functionality for
somethign that can be so easily acheived by other means, especially
in benchmark environments where *everything* is tightly controlled.


If I was a benchmark writer, I would not be willing running it as root
to be able to mount/unmount, I would not be willing to require the
customer creating special dedicated partitions for the benchmark,
because this is too user-unfriendly. Or do I make incorrect assumptions?


But why a sysctl then? And also don't see a point for that at all, why
can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)?


The latter question was answered - people want a way to drop caches for
a file. They need a method which guarantees that the caches are dropped.
They do not need an advisory method which does not give any guarantees.


I'm not sure if a benchmark really needs that so much that
FADV_DONTNEED isn't sufficient.
Personally I would also like to know if FADV_DONTNEED succeeded.
I.e. 'ql-fstest' is to check if the written pattern went to the
block device and currently it does not know if data really had been
dropped from the page cache. As it reads files several times this is
not critical, but only would be a nice to have - nothing worth to
add a new syscall.


ql-test is not a benchmark, it's a data integrity test. The re-read
verification problem is easily solved by using direct IO to read the
files directly without going through the page cache. Indeed, direct
IO will invalidate cached pages over the range it reads before it
does the read, so the guarantee that you are after - no cached pages
when the read is done - is also fulfilled by the direct IO read...

I really don't understand why people keep trying to make cached IO
behave like uncached IO when we already have uncached IO
interfaces



Firstly, direct IO has an entirely different IO pattern, usually much 
simpler than buffered through the page cache. Secondly, going through 
the page cache ensures that page cache buffering is also tested.
I'm not at all opposed to open files randomly with direct IO to also 
test that path and I'm going to add that soon, but only using direct IO 
would limit the use case of ql-fstest.



Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-26 Thread Bernd Schubert


On 06/26/2014 01:57 PM, Lukáš Czerner wrote:

On Thu, 26 Jun 2014, Artem Bityutskiy wrote:

On Thu, 2014-06-26 at 12:36 +0200, Bernd Schubert wrote:

On 06/26/2014 08:13 AM, Artem Bityutskiy wrote:

On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote:

Your particular use case can be handled by directing your benchmark
at a filesystem mount point and unmounting the filesystem in between
benchmark runs. There is no ned to adding kernel functionality for
somethign that can be so easily acheived by other means, especially
in benchmark environments where *everything* is tightly controlled.


If I was a benchmark writer, I would not be willing running it as root
to be able to mount/unmount, I would not be willing to require the
customer creating special dedicated partitions for the benchmark,
because this is too user-unfriendly. Or do I make incorrect assumptions?


But why a sysctl then? And also don't see a point for that at all, why
can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)?


The latter question was answered - people want a way to drop caches for
a file. They need a method which guarantees that the caches are dropped.
They do not need an advisory method which does not give any guarantees.


I'm not sure if a benchmark really needs that so much that FADV_DONTNEED 
isn't sufficient.
Personally I would also like to know if FADV_DONTNEED succeeded. I.e. 
'ql-fstest' is to check if the written pattern went to the block device 
and currently it does not know if data really had been dropped from the 
page cache. As it reads files several times this is not critical, but 
only would be a nice to have - nothing worth to add a new syscall.



Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-26 Thread Bernd Schubert


On 06/26/2014 08:13 AM, Artem Bityutskiy wrote:

On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote:

Your particular use case can be handled by directing your benchmark
at a filesystem mount point and unmounting the filesystem in between
benchmark runs. There is no ned to adding kernel functionality for
somethign that can be so easily acheived by other means, especially
in benchmark environments where *everything* is tightly controlled.


If I was a benchmark writer, I would not be willing running it as root
to be able to mount/unmount, I would not be willing to require the
customer creating special dedicated partitions for the benchmark,
because this is too user-unfriendly. Or do I make incorrect assumptions?


But why a sysctl then? And also don't see a point for that at all, why 
can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)?



Cheers,
Bernd


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/11][SCSI]mpt2sas: Added new driver module Parameter disable_eedp to Disable EEDP Support

2014-03-20 Thread Bernd Schubert


u8  scsi_io_cb_idx;
diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c 
b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index 7f0af4f..d502728 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -127,6 +127,11 @@ static int disable_discovery = -1;
  module_param(disable_discovery, int, 0);
  MODULE_PARM_DESC(disable_discovery, " disable discovery ");

+/* Enable or disable EEDP support */
+static int disable_eedp;
+module_param(disable_eedp, uint, 0);
+MODULE_PARM_DESC(disable_eedp, " disable EEDP support: (default=0)");


Wouldn't it make sense to exlain what EEDP means? Something like

MODULE_PARM_DESC(disable_eedp,
" disable end-to-end data protection support (DIF): "
"default=0)");



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

kmemleak or crc32_le bug?

2014-02-06 Thread Bernd Schubert

I'm frequently getting 

UG: unable to handle kernel paging request at 880f87550dc0
IP: [] crc32_le+0x30/0x110

called from kmemleak, see bottom of the message.


schubert@wheezy@fsdevel2 linux-stable>addr2line -e vmlinux -i -a 
813016d0
0x813016d0
/home/schubert/src/linux/linux-stable/lib/crc32.c:129
/home/schubert/src/linux/linux-stable/lib/crc32.c:247
/home/schubert/src/linux/linux-stable/lib/crc32.c:265

129: unlikely, refers to "u32 q" in crc32_body

247: crc = crc32_body(crc, p, len, tab);

Also doesn't seem to be very likely.

265:

u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
{
return crc32_le_generic(crc, p, len,
(const u32 (*)[256])crc32table_le, CRCPOLY_LE);
}

Doesn't seem anything could fail here either.

schubert@fsdevel2 linux-stable>addr2line -e vmlinux -i -a 811cdff9
0x811cdff9
/home/schubert/src/linux/linux-stable/mm/kmemleak.c:1350

kmemleak_scan() +1350

list_for_each_entry_rcu(object, &object_list, object_list) {
spin_lock_irqsave(&object->lock, flags);
if (color_white(object) && (object->flags & OBJECT_ALLOCATED)
1350:   && update_checksum(object) && get_object(object)) {


With the "Cannot allocate a kmemleak_object structure" messages, 
somehow looks like object is not proper initialized, but update_checksum()
checks for that. Hmm, I'm not sure about kmemcheck_shadow_lookup(), 
especially about

>   if (!virt_addr_valid(address))
>   return NULL;

So is the test 

>   shadow = kmemcheck_shadow_lookup(addr);
>   if (!shadow)
>   return true;

right here? Shouldn't that be 'return false'?



Thanks,
Bernd


kmemleak: Cannot allocate a kmemleak_object structure
kmemleak: Kernel memory leak detector disabled
kmemleak: Cannot allocate a kmemleak_object structure
BUG: unable to handle kernel paging request at 880f87550dc0
IP: [] crc32_le+0x30/0x110
PGD 103f370067 PUD 10350e7067 PMD 10350ac067 PTE 800f87550060
Oops:  [#1] SMP DEBUG_PAGEALLOC
Modules linked in: fhgfs(O) fhgfs_client_opentk(O) parport_pc ppdev lp parport 
uinput nfsd auth_rpcgss dm_mod mlx4_ib ib_umad rdma_ucm rdma_cm ib_addr iw_cm 
ib_uverbs ib_ipoib ib_cm ib_sa ib_mad ib_core iTCO_wdt gpio_ich 
iTCO_vendor_support dcdbas mgag200 snd_pcm snd_page_alloc ttm snd_timer 
drm_kms_helper syscopyarea snd sysfillrect ipmi_si soundcore sysimgblt 
ipmi_msghandler pcspkr sb_edac edac_core joydev shpchp lpc_ich wmi 
acpi_power_meter ipv6 fuse nfsv4 nfsv3 nfs_acl nfs lockd sunrpc fscache sg 
sd_mod crc_t10dif crct10dif_common ahci libahci mlx4_core tg3 mpt2sas hwmon 
raid_class ptp scsi_transport_sas pps_core [last unloaded: fhgfs_client_opentk]
CPU: 24 PID: 230 Comm: kmemleak Tainted: G   O 
3.13.1-dbg-1-gf9a023f #66
Hardware name: Dell Inc. PowerEdge R720/08RW36, BIOS 2.1.3 11/20/2013
task: 8807db75a790 ti: 8807d2f76000 task.ti: 8807d2f76000
RIP: 0010:[]  [] crc32_le+0x30/0x110
RSP: 0018:8807d2f77db0  EFLAGS: 00010046
RAX:  RBX: 880f833cb408 RCX: 0001
RDX: 0046 RSI: 880f87550dc0 RDI: 880f87550dbc
RBP: 8807d2f77db8 R08:  R09: 0001
R10:  R11: 880f87550dbc R12: 0286
R13:  R14: 0104 R15: 0400
FS:  () GS:88081e60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 880f87550dc0 CR3: 00103dc0c000 CR4: 001407e0
Stack:
 880f833cb408 8807d2f77e18 811cdff9 811cdf51
 81a3d984 88070009 880f833cb458 000927c0
 000927c0  811ce5a0 
Call Trace:
 [] kmemleak_scan+0x399/0x590
 [] ? kmemleak_scan+0x2f1/0x590
 [] ? kmemleak_write+0x3b0/0x3b0
 [] kmemleak_scan_thread+0x63/0xd0
 [] kthread+0xf6/0x110
 [] ? kthread_create_on_node+0x250/0x250
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x250/0x250
Code: 89 f8 48 89 e5 53 0f 85 cd 00 00 00 49 89 d2 48 c1 ea 03 4c 8d 5e fc 41 
83 e2 07 48 85 d2 0f 84 81 00 00 00 4c 89 df 45 31 c0 90 <8b> 5f 04 48 83 c7 08 
49 83 c0 01 8b 0f 31 c3 89 d8 44 0f b6 cb 
RIP  [] crc32_le+0x30/0x110
 RSP 
CR2: 880f87550dc0
---[ end trace 71bec186f2a04a6f ]---
BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20
in_atomic(): 1, irqs_disabled(): 1, pid: 230, name: kmemleak
INFO: lockdep is turned off.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Subject: [v3.8][v3.11][Regression] [SCSI] sd: Update WRITE SAME heuristics

2013-10-29 Thread Bernd Schubert


Hello Joseph,

On 10/29/2013 08:21 PM, Joseph Salisbury wrote:

Hi Martin,

A bug was opened against the Ubuntu kernel[0].  After a kernel bisect,
it was found that reverting the following commit resolved this bug:

commit 66c28f97120e8a621afd5aa7a31c4b85c547d33d
Author: Martin K. Petersen 
Date:   Thu Jun 6 22:15:55 2013 -0400

 [SCSI] sd: Update WRITE SAME heuristics


The regression was introduced as of v3.11-rc1, but it also made it's way
into the stable trees.

I see that you are the author of this patch, so I wanted to run this by
you.  I was thinking of requesting a revert for v3.12, but I wanted to
get your feedback first.



James queued this up for 3.13


http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=735e39e680256a13e7be3492acfb4d9721287a42


Maybe we should try to convince James to take it into 3.12?


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Bernd Schubert


On 09/30/2013 09:34 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:

On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage
servers/targets used by from-file and to-file.


So?


mds1: orig-file
oss1/target1: orig-chunk1

mds1: target-file
ossN/targetN: target-chunk1

clientN: Performs the copy

Ideally, orig-chunk1 and target-chunk1 are on the same server and same 
target. Copy offload then even could done from the underlying fs, 
similiar as local splice.
If different ossN servers are used copies still have to be done over 
network by these storage servers, although the client only would need to 
initiate the copy. Still faster, but also not ideal.






What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm
working on FhGFS now, I still wonder what other file system need to do -
for example Lustre pre-allocates storage-target files on creating a
file, so file layout changes mean even more overhead there.


The problem you are describing is limited to a narrow set of storage
architectures. If copy offload using splice() doesn't make sense for
those architectures, then don't implement it for them.


But it _does_ make sense. The file system just needs a hint that a 
splice copy is going to come up.



You might be able to provide ioctls() to do these special hinted file
creations for those filesystems that need it, but the vast majority
don't, and you shouldn't enforce it on them.


And exactly for that we need a standard - it does not make sense if each 
and every distributed file system implements its own 
ioctl/libattr/libacl interface for that.





Anyway, if we could agree on to use libattr or libacl to teach the file
system about the upcoming splice call I would be fine.


libattr and libacl are generic libraries that exist to manipulate xattrs
and acls. They do not need to contain Lustre-specific code.



pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently?


I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, 
didn't find a better name yet), which would take in-file-path and 
out-file-path and allow the file system to create out-file-path with the 
same meta-layout as in-file-path. And it would need some flags, such as 
AUTO (file system decides if it makes sense to do a local copy) and 
FORCE (always try a local copy).



Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Bernd Schubert


On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage 
servers/targets used by from-file and to-file.




What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm 
working on FhGFS now, I still wonder what other file system need to do - 
for example Lustre pre-allocates storage-target files on creating a 
file, so file layout changes mean even more overhead there.
Anyway, if we could agree on to use libattr or libacl to teach the file 
system about the upcoming splice call I would be fine. Metadata overhead 
is probably negligible for large files.





Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Bernd Schubert


On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context 
of offloaded file copies. So the question is, what is the best way to 
address/discuss that?


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Bernd Schubert


On 09/30/2013 06:31 PM, Miklos Szeredi wrote:

Here's an example "cp" app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos


#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT(0x10)  /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;


off_t off = 0;



if (argc != 3)
errx(1, "usage: %s from to", argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, "opening %s", argv[1]);

res = fstat(in_fd, &stbuf);
if (res == -1)
err(1, "fstat");

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, "opening %s", argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX,
 SPLICE_F_DIRECT);
if (rres == -1)
err(1, "splice");
if (rres == 0)
break;

off += rres;
} while (off < stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, "close");

res = fsync(out_fd);
if (res == -1)
err(1, "fsync");

res = close(out_fd);
if (res == -1)
err(1, "close");

return 0;
}



It would be nice if there would be way if the file system would get a 
hint that the target file is supposed to be copy of another file. That 
way distributed file systems could also create the target-file with the 
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at 
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not 
sure if this would work for pNFS, though.



Bernd



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Drivers: scsi: FLUSH timeout

2013-09-24 Thread Bernd Schubert

On 09/24/2013 02:35 PM, KY Srinivasan wrote:

-Original Message-
From: Jack Wang [mailto:xjtu...@gmail.com]
Sent: Tuesday, September 24, 2013 5:08 AM
To: KY Srinivasan
Cc: Greg KH; linux-kernel@vger.kernel.org; de...@linuxdriverproject.org;
oher...@suse.com; jbottom...@parallels.com; h...@infradead.org; linux-
s...@vger.kernel.org; Mike Christie
Subject: Re: Drivers: scsi: FLUSH timeout

On 09/21/2013 07:24 AM, KY Srinivasan wrote:

-Original Message-
From: Greg KH [mailto:gre...@linuxfoundation.org]
Sent: Friday, September 20, 2013 1:32 PM
To: KY Srinivasan
Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org;
oher...@suse.com; jbottom...@parallels.com; h...@infradead.org; linux-
s...@vger.kernel.org
Subject: Re: Drivers: scsi: FLUSH timeout

On Fri, Sep 20, 2013 at 12:32:27PM -0700, K. Y. Srinivasan wrote:

The SD_FLUSH_TIMEOUT value is currently hardcoded.

Hardcoded where?  Please, more context.

This is defined in scsi/sd.h:

#define SD_FLUSH_TIMEOUT(60 * HZ)

On our cloud, we sometimes hit this timeout. I was wondering if we
could make this a module parameter. If this is acceptable, I can send
you a patch for this.

A module parameter don't make sense for a per-device value, does it?

Currently, the 60 second timeout is applied across devices. Ideally, I want to 
be
able to control the FLUSH TIMEOUT as we currently do I/O timeout. If this is
acceptable, I can work on a patch for that as well.

Regards,

K. Y

greg k-h

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi,

Back to 2010, Mike(cc-ed) try to add a flush time out interface, similar
to what you want here, no idea why it's just ignored?
http://www.spinics.net/lists/linux-scsi/msg45017.html

Thanks Jack. Mike, do you know what the concerns were as to why this
patch was not accepted?

See also this discussion:

http://marc.info/?l=linux-scsi&m=127167679221319&w=2

And retries have been added by commit 
c213e1407be6b04b144794399a91472e0ef92aec

Cheers,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-09-01 Thread Bernd Schubert

On 08/31/2013 09:48 PM, Nix wrote:
> On 31 Aug 2013, Greg KH said:
>> On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
>>> On 1 Aug 2013, Bernd Schubert said:
>>>
>>>> Once I noticed that scsi_get_vpd_page() works fine from other function
>>>> calls and that it is not 0x89, but already 0x0 that fails fixing it became
>>>> easy.
>>>>
>>>> Nix, any chance you could verify it also works for you?
>>>
>>> As an aside, this commit does indeed fix the bug I reported, but it
>>> doesn't seem to have gone anywhere, not even into -stable.
>>>
>>> Is it held up somehow?
>>>
>>> (stable has
>>>
>>> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
>>> Author: Martin K. Petersen 
>>> Date:   Tue Jul 30 22:58:34 2013 -0400
>>>
>>> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages 
>>> is set
>>>
>>> which IIRC was eventually found not to be necessary, because this fix
>>> works fine instead?)
>>>
>>> Possibly I'm misremembering the order of month-old events and Martin's
>>> fix was eventually considered better... in which case, sorry for the noise.
>>
>> Is that other patch even needed anymore, now that Martin's patch is in
>> the tree?
> 
> My understanding is that this patch is rather better, since Martin's
> patch prevents sending of the extended INQUIRY command at all: this one
> just uses a reduced buffer size, but can still issue the command. (But I
> may be misunderstanding everything.)

Hmm, I wonder if 7562523e84ddc742fe1f9db8bd76b01acca89f6b (linus tree) /
0ac10bd036f0f3b8ce7ac2390446eab9531c72eb (stable-tree) always works . It
tests if sdev->skip_vpd_pages is set, but
as far as I can see this only gets set for Seagate drives via
BLIST_SKIP_VPD_PAGES.
So if anything else than a Seagate drive is connected to an Areca
controller with older firmware it will still fail.


Cheers,
Bernd


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-26 Thread Bernd Schubert

Martin,

sorry for my late reply, I entirely lost track of this (customer issues,
vacation, lots of main work, ...).

On 08/02/2013 05:00 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert  writes:
> 
> Bernd,
> 
> Bernd> Once I noticed that scsi_get_vpd_page() works fine from other
> Bernd> function calls and that it is not 0x89, but already 0x0 that
> Bernd> fails fixing it became easy.
> 
> Bernd> Nix, any chance you could verify it also works for you?
> 
> Do we get an appropriate error back when we try to issue WRITE SAME
> 10/16? If so, I'm OK with this fix.
> 
> And thanks for looking into this!
> 


Is testing with sg_write_same sufficient?

With F/W V1.49:

> (squeeze)fslab2:~# lsscsi | grep sda
> [2:0:0:0]diskATA  HDS724040KLSA80  KFAO  /dev/sda

> (squeeze)fslab2:~# strace -f sg_write_same  --10 -v --num=0 --lba=0 /dev/sda

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, 
> 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=6, 
> flags=0, 
> data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...],
>  status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, 
> 00, 00, 20, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, 
> duration=0, info=0x1}) = 0
> write(2, "Write same:  Fixed format, curre"..., 114Write same:  Fixed format, 
> current;  Sense key: Illegal Request
>  Additional sense: Invalid command operation code
> ) = 114
> write(2, "Write same(10) command not suppo"..., 37Write same(10) command not 
> supported
> ) = 37


> (squeeze)fslab2:~# strace -f  sg_write_same  --16 -v --num=0 --lba=0 /dev/sda

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, 
> 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, 
> dxfer_len=512, timeout=6, flags=0, 
> data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...],
>  status=02, masked_status=01, sb[18]=[70, 00, 05, 00, 00, 00, 00, 0a, 00, 00, 
> 00, 00, 24, 00, 00, 00, 00, 00], host_status=0, driver_status=0x8, resid=0, 
> duration=0, info=0x1}) = 0
> write(2, "Write same:  Fixed format, curre"..., 104Write same:  Fixed format, 
> current;  Sense key: Illegal Request
>  Additional sense: Invalid field in cdb
> ) = 104
> write(2, "bad field in Write same(16) cdb,"..., 63bad field in Write same(16) 
> cdb, option probably not supported
> ) = 63



Now with F/W V1.46

> (squeeze)fslab2:~# lsscsi | grep sdk
> [10:0:1:2]   diskHitachi  HDS724040KLSA80  R001  /dev/sdk

> (squeeze)fslab2:~# cat /sys/class/scsi_host/host10/host_fw_model 
> ARC-1260


> (squeeze)fslab2:~# strace -f sg_write_same  --10 -v --num=0 --lba=0 /dev/sdk
> execve("/usr/bin/sg_write_same", ["sg_write_same", "--10", "-v", "--num=0", 
> "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[10]=[41, 00, 00, 00, 00, 00, 00, 
> 00, 00, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=6, 
> flags=0, 
> data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...],
>  status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, 
> 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, 
> resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same:  Fixed format, curre"..., 134Write same:  Fixed format, 
> current;  Sense key: Illegal Request
>  Additional sense: Invalid command operation code
>   Info fld=0x0 [0] 
> ) = 134
> write(2, "Write same(10) command not suppo"..., 37Write same(10) command not 
> supported
> ) = 37

> (squeeze)fslab2:~# strace -f  sg_write_same  --16 -v --num=0 --lba=0 /dev/sdk
> execve("/usr/bin/sg_write_same", ["sg_write_same", "--16", "-v", "--num=0", 
> "--lba=0", "/dev/sdk"], [/* 26 vars */]) = 0

> ioctl(3, SG_IO, {'S', SG_DXFER_TO_DEV, cmd[16]=[93, 00, 00, 00, 00, 00, 00, 
> 00, 00, 00, 00, 00, 00, 00, 00, 00], mx_sb_len=32, iovec_count=0, 
> dxfer_len=512, timeout=6, flags=0, 
> data[512]=["\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...],
>  status=00, masked_status=00, sb[19]=[f0, 00, 05, 00, 00, 00, 00, 0b, 00, 00, 
> 00, 00, 20, 00, 00, 00, 02, 00, 00], host_status=0, driver_status=0x8, 
> resid=0, duration=0, info=0x1}) = 0
> write(2, "Write same:  Fixed format, curre"..., 134Write same:  Fixed format, 
> current;  Sense key: Illegal Request
>  Additional sense: Invalid command operation code
>   Info fld=0x0 [0] 
> ) = 134
> write(2, "Write same(16) command not suppo"..., 37Write same(16) command not 
> supported
> ) = 37


Is this sufficient, or do you need something else?


Thanks,
Bernd



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Bernd Schubert


On 08/01/2013 06:04 PM, Nix wrote:

On 1 Aug 2013, Bernd Schubert verbalised:


On 07/30/2013 11:20 PM, Nix wrote:

On 30 Jul 2013, Bernd Schubert told this:


On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It
would better then to remove write-same support from the md-layer.


I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.


I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
lazy init it also will happen after mounting the file system, while
lazy init is running (inode zeroing).


Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!



I'm frequently doing tests with millions of files and reformating is 
ways faster than deleting the all these files.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Bernd Schubert


On 07/30/2013 11:20 PM, Nix wrote:

On 30 Jul 2013, Bernd Schubert told this:


On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It
would better then to remove write-same support from the md-layer.


I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.



I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with 
lazy init it also will happen after mounting the file system, while lazy 
init is running (inode zeroing).



Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-01 Thread Bernd Schubert


Whoops, the title is wrong, it should have been:

[PATCH] scsi disk: Limit get_vpd_page buf size

On 08/01/2013 04:34 PM, Bernd Schubert wrote:

Once I noticed that scsi_get_vpd_page() works fine from other function
calls and that it is not 0x89, but already 0x0 that fails fixing it became
easy.

Nix, any chance you could verify it also works for you?


From: Bernd Schubert 

Somehow older areca firmware versions have issues with
scsi_get_vpd_page() and a large buffer.
Even scsi_get_vpd_page(, page=0,)  failed in sd_read_write_same(),
while a similar request from sd_read_block_limits() worked fine.
Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46.

Fixes a regression with areca controllers and older firmware versions
introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d

Reported-by: Nix 
Signed-off-by: Bernd Schubert 
CC: sta...@vger.kernel.org
---
  drivers/scsi/sd.c |5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 80f39b8..02e50ae 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, 
unsigned char *buffer)
struct scsi_device *sdev = sdkp->device;

if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+   /* too large values might cause issues with arcmsr */
+   int vpd_buf_len = 64;
+
sdev->no_report_opcodes = 1;

/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
 * CODES is unsupported and the device has an ATA
 * Information VPD page (SAT).
 */
-   if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+   if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len))
sdev->no_write_same = 1;
}




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-01 Thread Bernd Schubert

Once I noticed that scsi_get_vpd_page() works fine from other function
calls and that it is not 0x89, but already 0x0 that fails fixing it became
easy.

Nix, any chance you could verify it also works for you?


From: Bernd Schubert 

Somehow older areca firmware versions have issues with
scsi_get_vpd_page() and a large buffer.
Even scsi_get_vpd_page(, page=0,)  failed in sd_read_write_same(),
while a similar request from sd_read_block_limits() worked fine.
Limiting the buf-size to 64-bytes fixes the issue with F/W V1.46.

Fixes a regression with areca controllers and older firmware versions
introduced by commit: 66c28f97120e8a621afd5aa7a31c4b85c547d33d

Reported-by: Nix 
Signed-off-by: Bernd Schubert 
CC: sta...@vger.kernel.org
---
 drivers/scsi/sd.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 80f39b8..02e50ae 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2651,13 +2651,16 @@ static void sd_read_write_same(struct scsi_disk *sdkp, 
unsigned char *buffer)
struct scsi_device *sdev = sdkp->device;
 
if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+   /* too large values might cause issues with arcmsr */
+   int vpd_buf_len = 64;
+
sdev->no_report_opcodes = 1;
 
/* Disable WRITE SAME if REPORT SUPPORTED OPERATION
 * CODES is unsupported and the device has an ATA
 * Information VPD page (SAT).
 */
-   if (!scsi_get_vpd_page(sdev, 0x89, buffer, SD_BUF_SIZE))
+   if (!scsi_get_vpd_page(sdev, 0x89, buffer, vpd_buf_len))
sdev->no_write_same = 1;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-31 Thread Bernd Schubert

On 07/31/2013 05:15 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert  writes:
> 
> Bernd,
> 
>>> Product revision level: R001 
> 
> It's clearly not verbatim passthrough...
> 
> Bernd> Besides the firmware, the difference might be that I'm exporting
> Bernd> single disks without any areca-raidset in between.  I can try to
> Bernd> confirm that tomorrow, I just need the system as it is till
> Bernd> tomorrow noon.
> 
> That would be a great data point. I don't have any Areca boards.
> 

Just tested it, areca-raidset does not make a difference, but the
firmware version does. After downgrading to 1.46 I have the same issue.

It is getting a bit late for me, but as this a pure development system,
which is also booted over nfs, I can investigate it tomorrow.


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Bernd Schubert


On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It 
would better then to remove write-same support from the md-layer.



Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Bernd Schubert


On 07/30/2013 01:34 AM, Martin K. Petersen wrote:

"Nix" == Nix   writes:


Bernd,

Nix> I can now confirm that reverting this commit causes this problem to
Nix> go away, and my machine boots fine again.

Can you please send me the output of sq_inq with your 1.49 firmware?

I made a tweak that allowed Nix to boot but we're trying to find a good
blacklist trigger. And that's tricky given that Areca allows you
manually specify the SCSI model string for each volume...



Sorry it got a bit late today.

Here it is.


(wheezy)fslab1:~# sg_inq -v /dev/sdc
inquiry cdb: 12 00 00 00 24 00
standard INQUIRY:
inquiry cdb: 12 00 00 00 60 00
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  BQue=0
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=1  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x3  QAS=0  IUS=0]
length=96 (0x60)   Peripheral device type: disk
 Vendor identification: Hitachi
 Product identification: HDS724040KLSA80
 Product revision level: R001
inquiry cdb: 12 01 00 00 fc 00
inquiry cdb: 12 01 80 00 fc 00
 Unit serial number: KRFS2CRAHXJZVD


Besides the firmware, the difference might be that I'm exporting single 
disks without any areca-raidset in between.
I can try to confirm that tomorrow, I just need the system as it is till 
tomorrow noon.



Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Bernd Schubert


On 07/29/2013 03:05 PM, Nix wrote:

On 29 Jul 2013, Bernd Schubert said:


Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:

arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset 
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
arcmsr: scsi  bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
   not that the machine would be much use without its RAID array even
   if this loop terminated at some point, so I only gave it a couple
   of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen 
Date:   Thu Jun 6 22:15:55 2013 -0400

[...]

Obviously, at this point, this machine has no modules loaded (it has
almost none loaded even when fully operational)


I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
patch is only in 3.10.3, but not yet in 3.10.1.


... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
3.10.2.)


Hmm, indeed that points to this commit. I just don't see what could fail 
there.


Could you try to run these commands with 3.10.1?

# # check if reporting opcodes works
# sg_opcodes -v  -n /dev/sdX

# check ata information page
# sg_vpd --page=0x89 /dev/sdX




 And I don't think this
commit can cause your issue at all, a failing heuristics would enable
WRITE SAME and would cause issues with linux-md, but there shouldn't
happen anything directly in the scsi-layer. Which was your last
working kernel version?


3.10.1. :)


Whoops, sorry, I missed that in your first sentence.



No changes to arcmsr between those versions... I suspect I'll have to
bisect, which will be a complete pig because every failure means a hard
powerdown of this box. Always-on servers rarely appreciate hard
powerdowns :(



Maybe just revert this commit? Helpful would be some scsi logging to see 
which command actually fails. I guess you don't have a serial console?



Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Bernd Schubert


Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:

My server's ARC-1210 has been working fine for years, but when I
upgraded from 3.10.1, it started failing:

Instead of

[0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
[0.804028] scsi0 : Areca SATA Host Adapter RAID Controller
  Driver Version 1.20.00.15 2010/08/05
[...]

[4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.115399] sd 7:0:0:1: [sdd] No Caching mode page present
[4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.118081]  sdd: sdd1
[4.124363] sd 7:0:0:1: [sdd] No Caching mode page present
[4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk

I now see (timestamps and some of the right edge chopped off because not
captured on my camera, no netconsole as this machine has all my storage
and is my loghost, and with this bug it can't get at any of that
storage).

sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
  sdd: sdd1
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] Attached SCSI removable disk
arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset 
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
arcmsr: scsi  bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
  not that the machine would be much use without its RAID array even
  if this loop terminated at some point, so I only gave it a couple
  of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen 
Date:   Thu Jun 6 22:15:55 2013 -0400

 SCSI: sd: Update WRITE SAME heuristics

so my, admittedly largely baseless, suspicions currently fall there.


Obviously, at this point, this machine has no modules loaded (it has
almost none loaded even when fully operational)


I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this 
patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this 
commit can cause your issue at all, a failing heuristics would enable 
WRITE SAME and would cause issues with linux-md, but there shouldn't 
happen anything directly in the scsi-layer.

Which was your last working kernel version?


Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument

2013-02-01 Thread Bernd Schubert


Hello Nicolas,

thanks for your review!


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2 2/2] coccicheck: Allow to show the executed command line

2013-01-29 Thread Bernd Schubert

On my system one of the tests failed with
"Fatal error: exception Failure("No OCaml compiler found! Install either 
ocamlopt or ocamlopt.opt")".

Investigating such issues is easier if the executed command line is being shown.

Signed-off-by: Bernd Schubert 
CC: Julia Lawall 
CC: Nicolas Palix 
CC: co...@systeme.lip6.fr
CC: Michal Marek 
---
 scripts/coccicheck |   28 +---
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/scripts/coccicheck b/scripts/coccicheck
index f8f15a2..85d3189 100755
--- a/scripts/coccicheck
+++ b/scripts/coccicheck
@@ -55,6 +55,14 @@ if [ "$ONLINE" = "0" ] ; then
 echo ''
 fi
 
+run_cmd() {
+   if [ $VERBOSE -ne 0 ] ; then
+   echo "Running: $@"
+   fi
+   eval $@
+}
+
+
 coccinelle () {
 COCCI="$1"
 
@@ -100,15 +108,21 @@ coccinelle () {
 fi
 
 if [ "$MODE" = "chain" ] ; then
-   $SPATCH -D patch   $FLAGS -sp_file $COCCI $OPT $OPTIONS   
|| \
-   $SPATCH -D report  $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
|| \
-   $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS   
|| \
-   $SPATCH -D org $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
|| exit 1
+   run_cmd $SPATCH -D patch   \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS   || \
+   run_cmd $SPATCH -D report  \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \
+   run_cmd $SPATCH -D context \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS   || \
+   run_cmd $SPATCH -D org \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1
 elif [ "$MODE" = "rep+ctxt" ] ; then
-   $SPATCH -D report  $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
&& \
-   $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
+   run_cmd $SPATCH -D report  \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \
+   run_cmd $SPATCH -D context \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
 else
-   $SPATCH -D $MODE   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
+   run_cmd $SPATCH -D $MODE   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 
1
 fi
 
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2 1/2] coccicheck: Allow the user to give a V= (verbose) argument

2013-01-29 Thread Bernd Schubert

Do not run with verbosity on/off depending on the ONLINE variable,
which gets set with C=1 or C=2, but allow the user to set the
verbosity using kernel default make V= paramemter.
Verbosity is off by default now.


Signed-off-by: Bernd Schubert 
CC: Julia Lawall 
CC: Nicolas Palix 
CC: co...@systeme.lip6.fr
CC: Michal Marek 
---
 Documentation/coccinelle.txt |4 
 scripts/coccicheck   |   11 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/coccinelle.txt b/Documentation/coccinelle.txt
index cf44eb6..dffa2d6 100644
--- a/Documentation/coccinelle.txt
+++ b/Documentation/coccinelle.txt
@@ -87,6 +87,10 @@ As any static code analyzer, Coccinelle produces false
 positives. Thus, reports must be carefully checked, and patches
 reviewed.
 
+To enable verbose messages set the V= variable, for example:
+
+   make coccicheck MODE=report V=1
+
 
  Using Coccinelle with a single semantic patch
 ~~~
diff --git a/scripts/coccicheck b/scripts/coccicheck
index 1a49d1c..f8f15a2 100755
--- a/scripts/coccicheck
+++ b/scripts/coccicheck
@@ -2,6 +2,15 @@
 
 SPATCH="`which ${SPATCH:=spatch}`"
 
+# The verbosity may be set by the environmental parameter V=
+# as for example with 'make V=1 coccicheck'
+
+if [ -n "$V" -a "$V" != "0" ]; then
+   VERBOSE=1
+else
+   VERBOSE=0
+fi
+
 if [ "$C" = "1" -o "$C" = "2" ]; then
 ONLINE=1
 
@@ -55,7 +64,7 @@ coccinelle () {
 #
 #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null
 
-if [ "$ONLINE" = "0" ] ; then
+if [ $VERBOSE -ne 0 ] ; then
 
FILE=`echo $COCCI | sed "s|$srctree/||"`
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument

2013-01-22 Thread Bernd Schubert


Hello Nicolas,

On 01/22/2013 03:31 PM, Nicolas Palix wrote:

Hi,

Thank you Bernd for your proposition.

I added Michal in CC, who is responsible for the integration.


Oh, sorry, I CCed everyone, but forgot Michal.



I was wondering if the V variable which already exists would not be better
than introducing a new variable. Bernd, is there any reason to not use V ?


I'm fine using 'V' either.



Your patch also remove the check of the ONLINE variable. In doing so,
I think that your patch will badly interfere with the online checking
performed with the C variable. Am I missing something ?


Hmm, I probably should have told in the patch description that verbosity 
defaults to 0 now. Shall I revert or make an extra patch for that? With 
the current patch and ONLINE != 0 nothing will change.



Cheers,
Bernd




Regards,

On Tue, Jan 22, 2013 at 2:34 PM, Bernd Schubert
 wrote:

Simply running "make coccicheck" returns very verbose output and warnings
might not be noticed.  Allow the user to set the verbosity level.


Signed-off-by: Bernd Schubert 
CC: Julia Lawall 
CC: Nicolas Palix 
CC: co...@systeme.lip6.fr
---
  scripts/coccicheck |8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/scripts/coccicheck b/scripts/coccicheck
index 1a49d1c..eab0b00 100755
--- a/scripts/coccicheck
+++ b/scripts/coccicheck
@@ -2,6 +2,12 @@

  SPATCH="`which ${SPATCH:=spatch}`"

+if [ -z "$VERBOSE" ] ; then
+   RUN_VERBOSE=0
+else
+   RUN_VERBOSE=$VERBOSE
+fi
+
  if [ "$C" = "1" -o "$C" = "2" ]; then
  ONLINE=1

@@ -55,7 +61,7 @@ coccinelle () {
  #
  #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null

-if [ "$ONLINE" = "0" ] ; then
+if [ "$RUN_VERBOSE" != "0" ] ; then

 FILE=`echo $COCCI | sed "s|$srctree/||"`








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] coccicheck: Allow to show the executed command line

2013-01-22 Thread Bernd Schubert

On my system one of the tests failed with
"Fatal error: exception Failure("No OCaml compiler found! Install either 
ocamlopt or ocamlopt.opt")".

Investigating such issues is easier if the executed command line is being shown.

Signed-off-by: Bernd Schubert 
CC: Julia Lawall 
CC: Nicolas Palix 
CC: co...@systeme.lip6.fr
---
 scripts/coccicheck |   28 +---
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/scripts/coccicheck b/scripts/coccicheck
index eab0b00..fb98534 100755
--- a/scripts/coccicheck
+++ b/scripts/coccicheck
@@ -52,6 +52,14 @@ if [ "$ONLINE" = "0" ] ; then
 echo ''
 fi
 
+run_cmd() {
+   if [ "$RUN_VERBOSE" != "0" ] ; then
+   echo "Running: $@"
+   fi
+   eval $@
+}
+
+
 coccinelle () {
 COCCI="$1"
 
@@ -97,15 +105,21 @@ coccinelle () {
 fi
 
 if [ "$MODE" = "chain" ] ; then
-   $SPATCH -D patch   $FLAGS -sp_file $COCCI $OPT $OPTIONS   
|| \
-   $SPATCH -D report  $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
|| \
-   $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS   
|| \
-   $SPATCH -D org $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
|| exit 1
+   run_cmd $SPATCH -D patch   \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS   || \
+   run_cmd $SPATCH -D report  \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || \
+   run_cmd $SPATCH -D context \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS   || \
+   run_cmd $SPATCH -D org \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff || exit 1
 elif [ "$MODE" = "rep+ctxt" ] ; then
-   $SPATCH -D report  $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff 
&& \
-   $SPATCH -D context $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
+   run_cmd $SPATCH -D report  \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS -no_show_diff && \
+   run_cmd $SPATCH -D context \
+   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
 else
-   $SPATCH -D $MODE   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 1
+   run_cmd $SPATCH -D $MODE   $FLAGS -sp_file $COCCI $OPT $OPTIONS || exit 
1
 fi
 
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] coccicheck: Allow the user to give a VERBOSE= argument

2013-01-22 Thread Bernd Schubert

Simply running "make coccicheck" returns very verbose output and warnings
might not be noticed.  Allow the user to set the verbosity level.


Signed-off-by: Bernd Schubert 
CC: Julia Lawall 
CC: Nicolas Palix 
CC: co...@systeme.lip6.fr
---
 scripts/coccicheck |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/scripts/coccicheck b/scripts/coccicheck
index 1a49d1c..eab0b00 100755
--- a/scripts/coccicheck
+++ b/scripts/coccicheck
@@ -2,6 +2,12 @@
 
 SPATCH="`which ${SPATCH:=spatch}`"
 
+if [ -z "$VERBOSE" ] ; then
+   RUN_VERBOSE=0
+else
+   RUN_VERBOSE=$VERBOSE
+fi
+
 if [ "$C" = "1" -o "$C" = "2" ]; then
 ONLINE=1
 
@@ -55,7 +61,7 @@ coccinelle () {
 #
 #$SPATCH -D $MODE $FLAGS -parse_cocci $COCCI $OPT > /dev/null
 
-if [ "$ONLINE" = "0" ] ; then
+if [ "$RUN_VERBOSE" != "0" ] ; then
 
FILE=`echo $COCCI | sed "s|$srctree/||"`
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-17 Thread Bernd Schubert


On 12/17/2012 11:00 AM, Borislav Petkov wrote:

+ Suresh.

On Mon, Dec 17, 2012 at 10:34:46AM +0100, Bernd Schubert wrote:

On 12/16/2012 09:39 PM, Borislav Petkov wrote:

On Sun, Dec 16, 2012 at 08:46:06PM +0100, Bernd Schubert wrote:

Hmm, I read it the other way around - x2apic depends on interrupt
remapping, but interrupt remapping can be used without x2apic.


Ok, you're right. X2APIC should depend on IRQ_REMAP:
https://lwn.net/Articles/289881/


The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to
support platforms with CPU's having > 8 bit APIC ID, say Y." I guess
may CPU has the latter?


I think it is what Yinghai said - you obviously need x2apic kernel
support if you have IRQ_REMAP on.


Can the kernel panic a bit improved to help user to understand what
needs to be enabled?


Well, your kernel enables IRQ_REMAP properly:

[0.031115] Enabled IRQ remapping in x2apic mode

I guess at that stage we could probably check for x2apic support and
scream loudly if it is not present... IMHO.


Hmm, I think that would the wrong place,


It has to be the right place because this "Enabled IRQ..." printk above
is from the IRQ remapping code which detects an x2apic mode in your
case.


as the initial 3.7.0 configuration didn't have IRQ_REMAP enabled.


Huh, so why do I see the above message in your dmesg output in
http://marc.info/?l=linux-kernel&m=135540103415652 then?


Oh huh, that is the dmesg from 3.4.7, which booted fine. I just sent it 
hoping it would help to see where the issue comes from.
When I run "make oldconfig" for 3.7.0 I accidentally unset 
CONFIG_IRQ_REMAP, wich also unset CONFIG_X86_X2APIC :(




Ok, let's sort things out here. Your .config has

# CONFIG_IRQ_REMAP is not set

but in the original dmesg you sent, the printk above comes from
intel_irq_remapping.c which gets enabled by CONFIG_IRQ_REMAP.

So, can you try enabling only CONFIG_IRQ_REMAP and leave
CONFIG_X86_X2APIC off to confirm the original observation?


No need, as I said above, the printk comes from a different kernel with 
a different config. With either CONFIG_IRQ_REMAP=false or 
CONFIG_X86_X2APIC=false the booting ends in a kernel panic.




Also, I'd guess your machine can boot with both options off?


No, only partly with "noapic", but that makes ahci to fail to detect 
disks later on.





And that was the reason why x2apic got disabled during the "make
oldconfig" process...

Is this message an indication for missing x2apic?

"smpboot: weird, boot (#255) not listed by the BIOS"


It's an indication that something is fishy with the apic IDs.

Thanks.




Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-17 Thread Bernd Schubert


On 12/16/2012 09:39 PM, Borislav Petkov wrote:

On Sun, Dec 16, 2012 at 08:46:06PM +0100, Bernd Schubert wrote:

Hmm, I read it the other way around - x2apic depends on interrupt
remapping, but interrupt remapping can be used without x2apic.


Ok, you're right. X2APIC should depend on IRQ_REMAP:
https://lwn.net/Articles/289881/


The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to
support platforms with CPU's having > 8 bit APIC ID, say Y." I guess
may CPU has the latter?


I think it is what Yinghai said - you obviously need x2apic kernel
support if you have IRQ_REMAP on.


Can the kernel panic a bit improved to help user to understand what
needs to be enabled?


Well, your kernel enables IRQ_REMAP properly:

[0.031115] Enabled IRQ remapping in x2apic mode

I guess at that stage we could probably check for x2apic support and
scream loudly if it is not present... IMHO.



Hmm, I think that would the wrong place, as the initial 3.7.0 
configuration didn't have IRQ_REMAP enabled. And that was the reason why 
x2apic got disabled during the "make oldconfig" process...


Is this message an indication for missing x2apic?

"smpboot: weird, boot (#255) not listed by the BIOS"


Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-17 Thread Bernd Schubert


On 12/16/2012 07:07 PM, Yinghai Lu wrote:

On Sun, Dec 16, 2012 at 10:01 AM, Bernd Schubert
 wrote:

can you post your .config for v3.7 ?

wonder if you have x2apic in .config


Which setting is it? Config is attached.


your config does not have

CONFIG_X86_X2APIC=y

set.

please enable that.

your BIOS pre-enable x2apic somehow, so you must have x2apic enabled in kernel.

it x2apic really can not be re-enabled by kernel, kernel would disable
x2apic automatically.


The system boots fine with x2apic enabled.


Thanks a bunch for your help,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-16 Thread Bernd Schubert

On 12/16/2012 08:13 PM, Borislav Petkov wrote:
> On Sun, Dec 16, 2012 at 07:28:59PM +0100, Bernd Schubert wrote:
>> CONFIG_X86_X2APIC depends on CONFIG_IRQ_REMAP, which I disabled as it
>> is marked as experimental...
> 
> You shouldn't pay too much attention to CONFIG_EXPERIMENTAL because it
> is on its way out from the kernel tree.

I usually don't too much, if I understand what it is about and what are
the consequences.

> 
> But if you don't want to have interrupt remapping on your system,
> you can disable it nevertheless. Wait, you can't: according to
> d0b03bd1c6725a3463290d7f9626e4b583518a5a, you can use x2apic without
> interrupt remapping but interrupt remapping needs to be enabled before
> x2apic.

Hmm, I read it the other way around - x2apic depends on interrupt
remapping, but interrupt remapping can be used without x2apic.

The help text of CONFIG_IRQ_REMAP also says "x2APIC enhancements or to
support platforms with CPU's having > 8 bit APIC ID, say Y." I guess may
CPU has the latter? Can the kernel panic a bit improved to help user to
understand what needs to be enabled?

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-16 Thread Bernd Schubert

On 12/16/2012 07:07 PM, Yinghai Lu wrote:
> On Sun, Dec 16, 2012 at 10:01 AM, Bernd Schubert
>  wrote:
>>> can you post your .config for v3.7 ?
>>>
>>> wonder if you have x2apic in .config
>>
>> Which setting is it? Config is attached.
> 
> your config does not have
> 
> CONFIG_X86_X2APIC=y
> 
> set.
> 
> please enable that.
> 
> your BIOS pre-enable x2apic somehow, so you must have x2apic enabled in 
> kernel.
> 
> it x2apic really can not be re-enabled by kernel, kernel would disable
> x2apic automatically.

Thanks! CONFIG_X86_X2APIC depends on
CONFIG_IRQ_REMAP, which I disabled as it is marked as experimental and
as this is my desktop system. So CONFIG_X86_X2APIC also got unset
automatically.

Will test a new kernel build in the morning.


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-16 Thread Bernd Schubert

On 12/16/2012 12:35 PM, Ingo Molnar wrote:
> 
> * Bernd Schubert  wrote:
> 
>> On 12/13/2012 01:16 PM, Bernd Schubert wrote:
>>> Hello,
>>>
>>> I just tried to boot 3.7 and it ends in an APIC panic. I 
>>> tried to use the recommended "apic=debug", but that does not 
>>> change anything in the output, at least not in the visible 
>>> part. The last known kernel to boot was 3.5. If it matters I 
>>> can try to boot 3.6.
>>
>> So linux-3.6 also boots. Any idea what is going on or do I 
>> really need to bisect?
> 
> Yeah, it's hard to tell based on that info alone - would be nice 
> to send in a log/screen capture of the crash and of course 
> bisecting would be useful as well, if the crash is 
> deterministic.

I already sent a screen capture in my first mail, but now uploaded two
pictures here:

http://www.aakef.fastmail.fm/linux/

I think the 2nd one is with apic=debug.

Unfortunately the system does not have ipmi, which would allow me to do
the bisecting now. It is also my desktop system at work and I don't know
yet when (and if at all) I will find the time to do the reboot cycles.


Cheers,
Bernd


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [regression] 3.7 ends in APIC panic

2012-12-14 Thread Bernd Schubert


On 12/13/2012 01:16 PM, Bernd Schubert wrote:

Hello,

I just tried to boot 3.7 and it ends in an APIC panic. I tried to use
the recommended "apic=debug", but that does not change anything in the
output, at least not in the visible part. The last known kernel to boot
was 3.5. If it matters I can try to boot 3.6.


So linux-3.6 also boots. Any idea what is going on or do I really need 
to bisect?



Thanks,
Bernd



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

memory allocation: smap large "Size", but unused

2012-11-27 Thread Bernd Schubert


Hello,

I'm just investigating why a user space program has a rather large 
VmSize, but small VmRSS size. Looking into /proc/$pid/smaps I notice 
several areas with an size of about 64MB, but otherwise that area is 
unused. So far I did not find a way how to reproduce that with malloc() 
calls.


7ffd34021000-7ffd3800 ---p  00:00 0
Size:  65404 kB
Rss:   0 kB
Pss:   0 kB
Shared_Clean:  0 kB
Shared_Dirty:  0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced:0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap:  0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB


Any idea how to do such an allocation from user space?


Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]:dir.c patch

2012-08-25 Thread Bernd Schubert

On 08/25/2012 10:37 PM, Christopher Sacchi wrote:
> Here is a non-style issue dir.c-patch, and as far as I can see from
> the lines of code, the compilation errors weren't about what I put in.
> This patch fixes a "break" statement inside an "if" statement, as
> obviously not correct.

Why should that not be correct? It breaks the while(1) loop?

> Here's the patch for the kernel version v3.6.0rc3:
> 
> --
> Signed-off-by: Christopher P. Sacchi 
> --- dir.c 2012-08-25 15:47:24.260443900 -0400
> +++ dir.c 2012-08-25 16:02:05.458845600 -0400
> @@ -580,7 +580,6 @@ static int ext4_dx_readdir(struct file *
>   return ret;
>   if (ret == 0) {
>   filp->f_pos = ext4_get_htree_eof(filp);
> - break;

So ext4_htree_fill_tree() did not return more entries and the while(1)
loop shall be stopped?


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Performance problems with 3ware 9500S-4LP and 2.6.25-rc3

2008-02-26 Thread Bernd Schubert

Hello Andre,

On Tuesday 26 February 2008 18:43:14 Andre Noll wrote:
> Hi
>
> we are experiencing massive performance problems with two of our
> Linux servers that contain 3ware controllers on a Tyan mainboard and
> a couple of 1T disks.
>
> During the daily cron job that uses rsync to sync a 500G file system
> from another machine to the raid on the 3ware controller the load
> jumps up, and the machine becomes sluggish as hell. For example, an
> ssh login to that machine takes minutes to complete and ldap becomes
> unreliable while the rsync job is running. Even Nagios complains
> about the machine being down while rsync is running.

do you have the write-back cache of the controller enabled for your disks? 
When you disable this cache, the controller will also disable the disks, 
cause a write-performance between 3 to 8MB/s per disks.

Cheers,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sleep before boot panic

2008-01-06 Thread Bernd Schubert

Hello Ingo,

On Sunday 06 January 2008, Ingo Oeser wrote:
> Hi Bernd,
>
> On Sunday 06 January 2008, you wrote:
> > Index: zd1211rw.git.beno/init/do_mounts.c
> > ===
> > --- zd1211rw.git.beno.orig/init/do_mounts.c 2008-01-06 18:44:23.0
> > +0100
> > +++ zd1211rw.git.beno/init/do_mounts.c  2008-01-06 18:45:44.0
> > +0100 @@ -330,6 +330,7 @@
> > printk("Please append a correct \"root=\" boot option; here are 
> > the
> > available partitions:\n");
> >
> > printk_all_partitions();
> > +   msleep(60 * 1000);
>
> ssleep(60);

feel free to replace it replace it :)

>
> > panic("VFS: Unable to mount root fs on %s", b);
> > }
>
> Better would be for this and similiar panic()s
> (fatal user/admin errors on boot) to NOT print a stack trace+registers,
> since it is useless and actually hides useful information.

There is no dump_stack() here, but disc detection is relatively early in boot 
process and on all these information are already scrolled off screen when the 
panic is done. For this and any other panic it would be optimal if scrolling 
still would work, but scrolling also requires kernel code, so I see there's a 
reason not to this for all panics. However, for this boot problem I tend to 
say there's no need to panic at all...

Btw, not all stack straces are useless, *most* of them are actually very 
useful.

Cheers,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sleep before boot panic

2008-01-06 Thread Bernd Schubert

Hi,

I just switched to libata (pata) on my laptop and the immediate panic made it 
impossible to figure out why my boot partition wasn't available.
After applying this little patch I could check boot printk output and then saw 
everything was properly recognized and only scsi-disk support was missing.


Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>

Index: zd1211rw.git.beno/init/do_mounts.c
===
--- zd1211rw.git.beno.orig/init/do_mounts.c 2008-01-06 18:44:23.0 
+0100
+++ zd1211rw.git.beno/init/do_mounts.c  2008-01-06 18:45:44.0 +0100
@@ -330,6 +330,7 @@
printk("Please append a correct \"root=\" boot option; here are 
the 
available partitions:\n");
 
printk_all_partitions();
+   msleep(60 * 1000);
panic("VFS: Unable to mount root fs on %s", b);
}
 


Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: everything in wait_for_completion, what is my system doing?

2007-12-07 Thread Bernd Schubert

Hello Andrew,

thanks for your help!

On Friday 07 December 2007 02:09:11 Andrew Morton wrote:
> On Wed, 5 Dec 2007 21:44:54 +0100
>
> Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > after scsi-recovery a system here went into some kind lock-up, everything
> > seems to be in wait_for_completion(). Please see the attached
> > blocked_states.txt and all_states.txt files.
> > This is 2.6.22.12, I can easily find out the line numbers if required.
> >
> > Any help is highly appreciated.
>
> Please cc linux-scsi on scsi-related reports.

Sorry, I these traces confused me a bit. I had absolutely no idea about a 
possible reason.

>
> > [blocked_states.txt  text/plain (20.5KB)]
> > [generate break]
> > [ 1818.566436] SysRq : Show Blocked State
> > [ 1818.570260]
> > [ 1818.570261]  free 
> >   sibling [ 1818.579253]   task PCstack   pid
> > father child younger older [ 1818.586987] events/7  D
> > 0155dd642280 026  2 (L-TLB) [ 1818.593747] 
> > 81012b529ac0 0046  810128280d18 [
> > 1818.601321]  8100ba2376f8 81012b689630 81012aff76b0
> > 00078023e215 [ 1818.608870]  00010003ca14 
> > 810001065400 000780430c13 [ 1818.616222] Call Trace:
> > [ 1818.618925]  [] io_schedule+0x28/0x36
> > [ 1818.624207]  [] get_request_wait+0x104/0x158
> > [ 1818.630112]  [] blk_get_request+0x36/0x6b
> > [ 1818.635755]  [] scsi_execute+0x51/0x129
> > [ 1818.641240]  []
> > :scsi_transport_spi:spi_execute+0x87/0xf8 [ 1818.648271] 
> > []
> > :scsi_transport_spi:spi_dv_device_echo_buffer+0x181/0x27d [ 1818.656739] 
> > [] :scsi_transport_spi:spi_dv_retrain+0x4e/0x240 [
> > 1818.664139]  []
> > :scsi_transport_spi:spi_dv_device+0x615/0x69c [ 1818.671542] 
> > [] :mptspi:mptspi_dv_device+0xb3/0x14b [ 1818.678042] 
> > [] :mptspi:mptspi_dv_renegotiate_work+0xcb/0xef [
> > 1818.685348]  [] run_workqueue+0x8e/0x120
> > [ 1818.690905]  [] worker_thread+0x106/0x117
> > [ 1818.696540]  [] kthread+0x4b/0x82
> > [ 1818.701474]  [] child_rip+0xa/0x12
> > [ 1818.706495]
> > [ 1818.708022] unionfs-fuse- D 01a76ef63463 0  1119  1
> > (NOTLB) [ 1818.714764]  810129765988 0082
> >  80337e22 [ 1818.722329]  8101297658c8
> > 81012b652f20 810129eec810 0006 [ 1818.729895] 
> > 00010005204e  81000105c400 000680337c3e [
> > 1818.737249] Call Trace:
> > [ 1818.739953]  [] schedule_timeout+0x8a/0xb6
> > [ 1818.745673]  [] io_schedule_timeout+0x28/0x36
> > [ 1818.751664]  [] congestion_wait+0x9d/0xc2
> > [ 1818.757300]  []
> > balance_dirty_pages_ratelimited_nr+0x196/0x22f [ 1818.764781] 
> > [] generic_file_buffered_write+0x52a/0x60d [
> > 1818.771641]  []
> > __generic_file_aio_write_nolock+0x45a/0x491 [ 1818.778852] 
> > [] generic_file_aio_write+0x61/0xc1 [ 1818.785101] 
> > [] nfs_file_write+0x138/0x1b7
> > [ 1818.790822]  [] do_sync_write+0xcc/0x112
> > [ 1818.796372]  [] vfs_write+0xc3/0x165
> > [ 1818.801575]  [] sys_pwrite64+0x68/0x96
> > [ 1818.806959]  [] system_call+0x7e/0x83
> > [ 1818.812250]  [<2b4eeec3ea73>]
> >
> > [snippage]
>
> Possibly your device driver had conniptions and stopped generating
> completion interrupts.
>
> Which driver is in use?

This is this time easily visible from the traces (mptspi_dv_device) ;) So its 
the mpt driver, we are using LSI22320 cards (I CC'ed Eric).

>
> I don't suppose it is repeatable.

Thats a clear "yes and no". Exactly this state we have got two or three times 
during an exhausting hardware stress test over the last weeks (with real and 
with simulated errors), but its not easily reproducible. Furthermore,  the 
hardware will go into production soon and I don't have the chance to simulate 
further errors.
However, we can easily get a similar state just on a raid6-rebuild (with high 
end hardware though. 
(You probably never won't run into into it with normal disks, we are doing 
software-raid over a bunch of several hardware raid systems).  

In the raid6-rebuild case the system is not completely locked up, just mostly. 
Somehow raid6-rebuild is still working, we can see this by the io usage 
status of the hardware-raids, but the system is completely blocked otherwise. 
Only pings and sysrq's are working.


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] faster workaround

2007-10-23 Thread Bernd Schubert

Hello Tejun,

On Tuesday 23 October 2007 10:08:01 Tejun Heo wrote:
> Jeff Garzik wrote:
> > Alan Cox wrote:
> >>> 2) Once we identified, over time, the set of drives affected by this
> >>> 3112 quirk (aka drives that didn't fully comply to SATA spec), the
> >>> debugging of corruption cases largely shifted to the standard
> >>> routine: update the BIOS, replace the
> >>> cables/RAM/power/mainboard/slot/etc. to be certain of problem location.
> >>
> >> Except for the continued series of later SI + Nvidia chipset (mostly)
> >> pattern which seems unanswered but also being later chips I assume
> >> unrelated to this problem.
> >
> > The SIL_FLAG_MOD15WRITE flag is set in sil_port_info[] is set according
> > to the best info we have from SiI, which indicates that 3114 and 3512 do
> > not have the same problem as the 3112.
>
> I don't think this data corruption problem w/ sil3114 is related to
> m15w.  m15w workaround slows down things quite a bit and is likely to
> hide problems on PCI bus side.  There are reports of data corruption
> with 3114 on nvidia (most common), via and now amd chipsets.  There's
> one on intel too but IIRC wasn't too definite.
>
> According to a user, freebsd didn't have data corruption problem on the
> same hardware.  I copied PCI FIFO setup code (ours is broken BTW) but it
> didn't fix the problem.
>
> I'll try to reproduce the problem locally and hunt it down.

thanks for your help and please tell me, if I can do anything. We have this 
problem on a production system, but the node in question will be rebooted in 
Thursday (ups needs to be replaced). If there are some tests/reboots/whatever 
I could do, it would be best to do it shortly after the scheduled reboot.

Actually I now would have attempted to port your mod15 patch 
(http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23, hoping it 
would solve Soerens problem and ours as well (ours magically already went 
away using the mod15 fix). Well, maybe I port it anyway to 2.6.23 to see if 
it also solves our problem.


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

2007-10-22 Thread Bernd Schubert

On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
> > but as much as it fits onto the disk. On reading back this file, the
> > filesystem will report errors somewhere between 50GB and 230GB (disk size
> > is 250GB).
>
> Wow, I really see lots of corruptions (well every 1-2 GB a couple of
> bytes are corrupted). Are you getting similiarly many in the 50G - 230G
> region?

I never tested what is corrupted. Well, a diff over 250GB would take quite a 
lot of time...

-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

2007-10-22 Thread Bernd Schubert

On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
> On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote:
> > Hello,
> >
> > On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
> > > Helo,
> > > [...]
> > >
> > > > Now when I write large files of zeros to root(sda&sdb) and read the
> > > > file back in it contains a few nonzero entries:
> > > >
> > > > # dd if=/dev/zero of=/foo bs=1M count=2000
> > > > # hexdump /foo
> > > > 000        
> > > > *
> > > > 1GB random parts, within large blocks of zeroes>
> > > >
> > > > I can reliably trigger this on the md0 / devmapper-root setup when I
> > > > write about 2GB of data (note that this machine has 1.5G of memory -
> > > > and still 1GB is often enough to see this problem). Here it does not
> > > > matter where in the filesystem I do these writes.
> >
> > Thats almost the same test as I'm always doing. Only I do not write only
> > 2GB,
>
> Well when I read your mail I thought that I could be seeing exactly the
> same bug... it still may be. However ``my'' problem does not go away
> with the mod15fix ...

Yeah, pity it did not fix it :( I will try to port Tejuns patch 
(http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23 today or 
tomorrow. If you are testing anyway, could you then also try this?

>
> > but as much as it fits onto the disk. On reading back this file, the
> > filesystem will report errors somewhere between 50GB and 230GB (disk size
> > is 250GB).
>
> Wow, I really see lots of corruptions (well every 1-2 GB a couple of
> bytes are corrupted). Are you getting similiarly many in the 50G - 230G
> region?
>
> > > Thanks.  I'll try to reproduce the problem here.  What's your
> > > motherboard?
> >
> > All tested S2882 boards here.
>
> I assume all equipped with lots of memory and mostly empty pci slots?

Yes, all pci-slots are free and the systems to have between 4 and 16GB memory 
(ecc, monitored with edac). Well, those are cluster systems (actually tyan 
names those B2882).
Do you think the configuration is related? Here it also happens with odirect, 
we tested this to minimize memory effects.


Cheers,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

2007-10-22 Thread Bernd Schubert

Hello,

On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
> Helo,
>
> Soeren Sonnenburg wrote:
> > I finally managed to find a *reproducible* setup and way to trigger
> > random corruptions using a sata sil 3114 controller connected to 4
> > seagate drives
> >
> > port 1: ST3400832AS sda
> > port 2: ST3400620AS sdb
> > port 3: ST3750640AS sdc
> > port 4: ST3750640AS sdd
> >
> > sda & sdb form md0 via a raid1 setup followed by an additional
> > devicemapper layer ( root ). sdc and sdb are separate and also have an
> > additional device mapper layer ( public ) and ( backups ).
> >
> > Now when I write large files of zeros to root(sda&sdb) and read the file
> > back in it contains a few nonzero entries:
> >
> > # dd if=/dev/zero of=/foo bs=1M count=2000
> > # hexdump /foo
> > 000        
> > *
> > 1GB random parts, within large blocks of zeroes>
> >
> > I can reliably trigger this on the md0 / devmapper-root setup when I
> > write about 2GB of data (note that this machine has 1.5G of memory - and
> > still 1GB is often enough to see this problem). Here it does not matter
> > where in the filesystem I do these writes.

Thats almost the same test as I'm always doing. Only I do not write only 2GB, 
but as much as it fits onto the disk. On reading back this file, the 
filesystem will report errors somewhere between 50GB and 230GB (disk size is 
250GB).

>
> Thanks.  I'll try to reproduce the problem here.  What's your motherboard?

All tested S2882 boards here.


Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] faster workaround

2007-10-15 Thread Bernd Schubert

On Friday 12 October 2007 23:08:21 Jeff Garzik wrote:
> Bernd Schubert wrote:
> > a) 2.6.23 + sil-patch I posted, this is on a customer system (though my
> > former group), I wouldn't like to use -mm there.
> >
> > b) .config is attached
> >
> > c) attached
> >
> > d) attached (don't get irritaded by those machine check events, thats
> > "GART TLB errorr", harmless warnings, just not disabled in the bios).
>
> Any chance you could provide dmesg on 2.6.23 without the sil patch?

Its attached.


Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
[0.00] Linux version 2.6.23-l162 ([EMAIL PROTECTED]) (gcc version 3.4.6 
(Ubuntu 3.4.6-5ubuntu1)) #7 SMP Mon Oct 15 11:50:28 CEST 2007
[0.00] Command line:  root=/dev/ram0 ramdisk_size=110592 console=tty0 
console=ttyS0,115200  
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009f400 (usable)
[0.00]  BIOS-e820: 0009f400 - 000a (reserved)
[0.00]  BIOS-e820: 000e - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - fbff (usable)
[0.00]  BIOS-e820: fbff - fbfff000 (ACPI data)
[0.00]  BIOS-e820: fbfff000 - fc00 (ACPI NVS)
[0.00]  BIOS-e820: ff78 - 0001 (reserved)
[0.00]  BIOS-e820: 0001 - 0004 (usable)
[0.00] Entering add_active_range(0, 0, 159) 0 entries of 3200 used
[0.00] Entering add_active_range(0, 256, 1032176) 1 entries of 3200 used
[0.00] Entering add_active_range(0, 1048576, 4194304) 2 entries of 3200 
used
[0.00] end_pfn_map = 4194304
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000F6F20, 0014 (r0 ACPIAM)
[0.00] ACPI: RSDT FBFF, 0038 (r1 A M I  OEMRSDT   7000626 MSFT  
 97)
[0.00] ACPI: FACP FBFF0200, 0081 (r1 A M I  OEMFACP   7000626 MSFT  
 97)
[0.00] ACPI: DSDT FBFF0410, 3751 (r1  0 00000 INTL  
2002026)
[0.00] ACPI: FACS FBFFF000, 0040
[0.00] ACPI: APIC FBFF0380, 0084 (r1 A M I  OEMAPIC   7000626 MSFT  
 97)
[0.00] ACPI: OEMB FBFFF040, 0041 (r1 A M I  OEMBIOS   7000626 MSFT  
 97)
[0.00] ACPI: SRAT FBFF3B70, 0110 (r1 A M I  OEMSRAT   7000626 MSFT  
 97)
[0.00] ACPI: ASF! FBFF3C80, 0086 (r1 AMIASF AMDSTRET1 INTL  
2002026)
[0.00] SRAT: PXM 0 -> APIC 0 -> Node 0
[0.00] SRAT: PXM 1 -> APIC 1 -> Node 1
[0.00] SRAT: Node 0 PXM 0 10-fc00
[0.00] Entering add_active_range(0, 256, 1032176) 0 entries of 3200 used
[0.00] SRAT: Node 1 PXM 1 2-4
[0.00] Entering add_active_range(1, 2097152, 4194304) 1 entries of 3200 
used
[0.00] SRAT: Node 0 PXM 0 10-2
[0.00] Entering add_active_range(0, 256, 1032176) 2 entries of 3200 used
[0.00] Entering add_active_range(0, 1048576, 2097152) 2 entries of 3200 
used
[0.00] SRAT: Node 0 PXM 0 0-2
[0.00] Entering add_active_range(0, 0, 159) 3 entries of 3200 used
[0.00] Entering add_active_range(0, 256, 1032176) 4 entries of 3200 used
[0.00] Entering add_active_range(0, 1048576, 2097152) 4 entries of 3200 
used
[0.00] NUMA: Using 33 for the hash shift.
[0.00] Bootmem setup node 0 -0002
[0.00] Bootmem setup node 1 0002-0004
[0.00] Zone PFN ranges:
[0.00]   DMA 0 -> 4096
[0.00]   DMA324096 ->  1048576
[0.00]   Normal1048576 ->  4194304
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[4] active PFN ranges
[0.00] 0:0 ->  159
[0.00] 0:  256 ->  1032176
[0.00] 0:  1048576 ->  2097152
[0.00] 1:  2097152 ->  4194304
[0.00] On node 0 totalpages: 2080655
[0.00]   DMA zone: 56 pages used for memmap
[0.00]   DMA zone: 1451 pages reserved
[0.00]   DMA zone: 2492 pages, LIFO batch:0
[0.00]   DMA32 zone: 14280 pages used for memmap
[0.00]   DMA32 zone: 1013800 pages, LIFO batch:31
[0.00]   Normal zone: 14336 pages used for memmap
[0.00]   Normal zone: 1034240 pages, LIFO batch:31
[0.00]   Movable zone: 0 pages used for memmap
[0.00] On node 1 totalpages: 2097152
[0.00]   DMA zone: 0 pages used for memmap
[0.00]   DMA32 zone: 0 pages used for memmap
[0.00]   Normal zone: 28672 pages used for memmap
[0.00]   Normal zone: 2068480 pages, LIFO batch:31
[0.00]   Movable zone: 0 pages used for memmap
[0.00] ACPI: PM-Timer IO Port: 0x1008
[0.00] ACPI: Local APIC address 0xfee0
[0.00] ACPI: LAPIC (acp

Re: [PATCH 3/3] faster workaround

2007-10-11 Thread Bernd Schubert

On Thursday 11 October 2007 17:04:45 Jeff Garzik wrote:
> Bernd Schubert wrote:
> > On Thursday 11 October 2007 16:19:37 Jeff Garzik wrote:
> >> 1) Just about the only valid optimization is to ensure that only the
> >> write path must be limited to small chunks, not both read- and
> >> write-paths.  Tejun had a patch to do this a long time ago, but it's an
> >> open question whether the large amount of code is worth it for a rare
> >> combination.
> >
> > How large? This patch is rather small? Where can I find it?
>
> http://home-tj.org/wiki/index.php/Sil_m15w

Thanks, I will take a look later on.

>
> > The problem came up, when 200GB drives were replaced by *newer* 250GB
> > drives (well maybe not the newest, no idea were they came from).
> >
> > Anyway, I'm testing for more than 24h already and didn't observe any data
> > corruption as without the patch. I know this is only an obersavation and
> > no definite prove...
> > Also, this is with 3114, maybe this chip behaves a bit different than
> > 3112?
>
> 3114 + new SATA drive is definitely a new one for us.
>
> It would help to (a) use the latest kernel, (b) post your .config with
> the latest kernel, (c) post lspci booted into latest kernel, and (d)
> post dmesg booted into latest kernel.

a) 2.6.23 + sil-patch I posted, this is on a customer system (though my former 
group), I wouldn't like to use -mm there.

b) .config is attached

c) attached

d) attached (don't get irritaded by those machine check events, thats "GART 
TLB errorr", harmless warnings, just not disabled in the bios).


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23
# Thu Oct 11 13:46:30 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-l162"
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_BLK_DEV_BSG is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
CONFIG_DEFAULT_DEADLINE=y
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="deadline"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_NODES_SHIFT=6
CONFIG_X86_64_ACPI_NUMA=

Re: [PATCH 3/3] faster workaround

2007-10-11 Thread Bernd Schubert

On Thursday 11 October 2007 16:19:37 Jeff Garzik wrote:
> 1) Just about the only valid optimization is to ensure that only the
> write path must be limited to small chunks, not both read- and
> write-paths.  Tejun had a patch to do this a long time ago, but it's an
> open question whether the large amount of code is worth it for a rare
> combination.

How large? This patch is rather small? Where can I find it?

>
> 2) Once we identified, over time, the set of drives affected by this
> 3112 quirk (aka drives that didn't fully comply to SATA spec), the
> debugging of corruption cases largely shifted to the standard routine:
> update the BIOS, replace the cables/RAM/power/mainboard/slot/etc. to be
> certain of problem location.

Replace this disk or the sata controller maybe, but usually people don't want 
to replace a big cluster, even if it is already 3 years old, this has to wait 
at least another 3 years.

The problem came up, when 200GB drives were replaced by *newer* 250GB drives 
(well maybe not the newest, no idea were they came from).

Anyway, I'm testing for more than 24h already and didn't observe any data 
corruption as without the patch. I know this is only an obersavation and no 
definite prove...
Also, this is with 3114, maybe this chip behaves a bit different than 3112?

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] faster workaround

2007-10-11 Thread Bernd Schubert

This is based on a patch from Jeff from 2004, but backported to 2.6.23 and 
furthermore, it will use the 7.5kiB/512B splitoff for blacklisted drives 
only.

Jeff, why did you replace ATA_SHT_USE_CLUSTERING and ATA_DMA_BOUNDARY?

 drivers/ata/libata-core.c |9 -
 drivers/ata/sata_sil.c|   58 ++--
 include/linux/libata.h|6 +++
 3 files changed, 62 insertions(+), 11 deletions(-)

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>


Index: linux-2.6.23-rc9/drivers/ata/libata-core.c
===
--- linux-2.6.23-rc9.orig/drivers/ata/libata-core.c 2007-10-02 
17:21:12.0 +0200
+++ linux-2.6.23-rc9/drivers/ata/libata-core.c  2007-10-11 10:46:18.0 
+0200
@@ -4073,7 +4073,7 @@ void ata_sg_clean(struct ata_queued_cmd 
  * spin_lock_irqsave(host lock)
  *
  */
-static void ata_fill_sg(struct ata_queued_cmd *qc)
+void ata_fill_sg(struct ata_queued_cmd *qc)
 {
struct ata_port *ap = qc->ap;
struct scatterlist *sg;
@@ -4217,10 +4217,15 @@ int ata_check_atapi_dma(struct ata_queue
  */
 void ata_qc_prep(struct ata_queued_cmd *qc)
 {
+   struct ata_port *ap = qc->ap;
+
if (!(qc->flags & ATA_QCFLAG_DMAMAP))
return;
 
-   ata_fill_sg(qc);
+   if (ap->ops->fill_sg)
+   ap->ops->fill_sg(qc);
+   else
+   ata_fill_sg(qc);
 }
 
 /**
Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c
===
--- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 
10:45:08.0 
+0200
+++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:57:51.0 
+0200
@@ -120,6 +120,7 @@ static int sil_scr_write(struct ata_port
 static int sil_set_mode (struct ata_port *ap, struct ata_device **r_failed);
 static void sil_freeze(struct ata_port *ap);
 static void sil_thaw(struct ata_port *ap);
+static void sil_fill_sg(struct ata_queued_cmd *qc);
 
 
 static const struct pci_device_id sil_pci_tbl[] = {
@@ -174,12 +175,12 @@ static struct scsi_host_template sil_sht
.queuecommand   = ata_scsi_queuecmd,
.can_queue  = ATA_DEF_QUEUE,
.this_id= ATA_SHT_THIS_ID,
-   .sg_tablesize   = LIBATA_MAX_PRD,
+   .sg_tablesize   = 120, /* max 15 kiB sectors ? */
.cmd_per_lun= ATA_SHT_CMD_PER_LUN,
.emulated   = ATA_SHT_EMULATED,
-   .use_clustering = ATA_SHT_USE_CLUSTERING,
+   .use_clustering = 1,
.proc_name  = DRV_NAME,
-   .dma_boundary   = ATA_DMA_BOUNDARY,
+   .dma_boundary   = 0x1fff,
.slave_configure= ata_scsi_slave_config,
.slave_destroy  = ata_scsi_slave_destroy,
.bios_param = ata_std_bios_param,
@@ -187,6 +188,7 @@ static struct scsi_host_template sil_sht
 
 static const struct ata_port_operations sil_ops = {
.port_disable   = ata_port_disable,
+   .fill_sg= sil_fill_sg,
.dev_config = sil_dev_config,
.tf_load= ata_tf_load,
.tf_read= ata_tf_read,
@@ -278,9 +280,9 @@ MODULE_LICENSE("GPL");
 MODULE_DEVICE_TABLE(pci, sil_pci_tbl);
 MODULE_VERSION(DRV_VERSION);
 
-static int slow_down = 0;
-module_param(slow_down, int, 0444);
-MODULE_PARM_DESC(slow_down, "Sledgehammer used to work around random 
problems, by limiting commands to 15 sectors (0=off, 1=on)");
+static int mod15_quirk = 0;
+module_param(mod15_quirk, int, 0444);
+MODULE_PARM_DESC(mod15_quirk, "Some disks from Seagate need a mod15 
workaround.");
 
 
 static unsigned char sil_get_device_cache_line(struct pci_dev *pdev)
@@ -534,6 +536,44 @@ static void sil_thaw(struct ata_port *ap
writel(tmp, mmio_base + SIL_SYSCFG);
 }
 
+static void sil_fill_sg(struct ata_queued_cmd *qc)
+{
+   struct ata_port *ap = qc->ap;
+   u32 addr, len;
+   unsigned int idx;
+
+   ata_fill_sg(qc);
+
+   /* check if we need the MOD15 workaround */
+   if (!(qc->dev->quirk & SIL_FLAG_MOD15WRITE))
+   return;
+
+   if (unlikely(qc->n_elem < 1))
+   return;
+
+   /* hardware S/G list may be longer (or shorter) than number of
+* PCI-mapped S/G entries (qc->n_elem), due to splitting
+* in ata_fill_sg(). Start at zero, and skip to end
+* of list, if we're not already there.
+   */
+   idx = 0;
+   while ((le32_to_cpu(ap->prd[idx].flags_len) & ATA_PRD_EOT) == 0)
+   idx++;
+
+   /* Errata workaround: if last segment is exactly 8K, split
+* into 7.5K and 512b pieces.
+*/
+   len = le32_to_cpu(ap->prd[idx].flags_len) & 0x;
+   if (len == 8192) {
+

Re: [PATCH 2/3] Re: sil3114 data corruption

2007-10-11 Thread Bernd Schubert

This will add the sil3114 back to the controllers with the mod15 bug. Without 
this patch no workaround for this controller is done and people might/will 
suffer from data corruption.

Also rather trivial, though with a huge effect, the speed for the effected 
disks will go down from about 45-50MB/s to 20-25MB/s. But better safe than 
lost data or damaged filesystem.

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>

Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c
===
--- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 
10:45:02.0 
+0200
+++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:45:08.0 
+0200
@@ -241,7 +241,8 @@ static const struct ata_port_info sil_po
},
/* sil_3114 */
{
-   .flags  = SIL_DFL_PORT_FLAGS | SIL_FLAG_RERR_ON_DMA_ACT,
+   .flags  = SIL_DFL_PORT_FLAGS | SIL_FLAG_RERR_ON_DMA_ACT
+   | SIL_FLAG_MOD15WRITE,
.pio_mask   = 0x1f, /* pio0-4 */
.mwdma_mask = 0x07, /* mwdma0-2 */
.udma_mask  = ATA_UDMA5,



-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] Re: sil3114 data corruption

2007-10-11 Thread Bernd Schubert

This will add the Seagate ST3250820AS to the mod15 blacklist.

I think this is rather trivial and should go into any any release as soon as 
possible, since there will be data corruption without it for this disk.

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>

Index: linux-2.6.23-rc9/drivers/ata/sata_sil.c
===
--- linux-2.6.23-rc9.orig/drivers/ata/sata_sil.c2007-10-11 
10:44:57.0 
+0200
+++ linux-2.6.23-rc9/drivers/ata/sata_sil.c 2007-10-11 10:45:02.0 
+0200
@@ -151,6 +151,7 @@ static const struct sil_drivelist {
{ "ST380011ASL",SIL_QUIRK_MOD15WRITE },
{ "ST3120022ASL",   SIL_QUIRK_MOD15WRITE },
{ "ST3160021ASL",   SIL_QUIRK_MOD15WRITE },
+   { "ST3250820AS",SIL_QUIRK_MOD15WRITE },
{ "Maxtor 4D060H3", SIL_QUIRK_UDMA5MAX },
{ }
 };



-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHES] Re: sil3114 data corruption

2007-10-11 Thread Bernd Schubert

On Wednesday 10 October 2007 11:12:20 Bernd Schubert wrote:
> On Monday 08 October 2007 17:09:17 Bernd Schubert wrote:
> > [sorry for sending twice, but after I read the sil sources, I see the
> > mail address had been wrong]
> >
> > Hi,
> >
> > somehow the sil3114 causes data corruption with some (newer?) disks.
> > Simply filling the filesystem with zeros and reading the these data will
> > make the kernel to report filesystem corruption.
> > This is definitely not an issue of memory, since the systems (several
> > tested) do have ECC memory and the memory is monitored with EDAC.
> >
> > kernel versions tested: 2.6.15-2.6.20
>
> Update: Setting sata_sil.slow_down=1 fill fix the problem, seems there are
> some drives missing in the quirk table.
>
> Jeff, I found an old patch/workaround from you
> (http://uwsg.indiana.edu/hypermail/linux/kernel/0403.1/1957.html), can you
> give me any further information why this never went into the driver?
>

I will send 3 mails with patches to fix this corruption.


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: do_nanosleep will not return

2007-10-08 Thread Bernd Schubert

On Monday 08 October 2007 16:32:52 Rik van Riel wrote:
> On Mon, 08 Oct 2007 15:20:26 +0200
>
> Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > Bernd Schubert wrote:
> > > we have a system here were e.g. "sleep 1" will never finish. This
> > > is an issue of 2.6.23, on all older kernel versions it did work
> > > fine.
> > >
> > > Seems to hang in do_nanosleep()
> >
> > Update: Enabling hpet in the bios and setting clocksource=hpet as
> > command line parameter will fix it, but still its not nice that
> > something that worked without a problem in 2.6.22 and below suddenly
> > doesn't work in 2.6.23.
>
> Which timer source is in use when the system hangs?

Well, not the systems hangs, only processes running nanosleep. Well, since the 
system is booted diskless, one of the very first commands is to 
run "/etc/init.d/portmap start", which has a sleep call in its script and so 
it will halt the boot process.

The problematic timer source is acpi_pm. Its also interesting that setting the 
timer source 
via /sys/devices/system/clocksource/clocksource0/current_clocksource won't 
fix that problem. Only the boot option clocksource={other than acpi_pm} does 
help.

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: do_nanosleep will not return

2007-10-08 Thread Bernd Schubert

Bernd Schubert wrote:

> Hi,
> 
> we have a system here were e.g. "sleep 1" will never finish. This is an
> issue of 2.6.23, on all older kernel versions it did work fine.
> 
> Seems to hang in do_nanosleep()
> 

Update: Enabling hpet in the bios and setting clocksource=hpet as command
line parameter will fix it, but still its not nice that something that
worked without a problem in 2.6.22 and below suddenly doesn't work in
2.6.23.

Bernd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.23 regression: do_nanosleep will not return

2007-10-08 Thread Bernd Schubert

Hi,

we have a system here were e.g. "sleep 1" will never finish. This is an 
issue of 2.6.23, on all older kernel versions it did work fine.

Seems to hang in do_nanosleep() 

[  153.775792] sleep S  0  5372   5341  
[  153.782385]  81007f0a9ea8 0082  
8efc
 
[  153.790635]  81007f0a9e48 802447b4 81007f0c3080 
0003
 
[  153.798938]  81007f0c39c8 81007f0c37c0 4001d908 

 
[  153.806991] Call Trace:  
[  153.809937]  [] do_nanosleep+0x42/0x75 
[  153.815727]  [<0001>]
[  153.819383]  
[  153.775792] sleep S  0  5372   5341  


[  330.669444] SysRq : Show Pending Timers  
[  330.673552] Timer List Version: v0.3 
[  330.677326] HRTIMER_MAX_CLOCK_BASES: 2   
[  330.681282] now at 255011372633 nsecs
[  330.829981] active timers:   
[  330.832859]  #0: , hrtimer_wakeup, S:01
[  330.838805]  # expires at 260156346358 nsecs [in 5144973725 nsecs]   

[  337.046189] now at 261387685432 nsecs
[  337.194966] active timers:   
[  337.197834]  #0: , hrtimer_wakeup, S:01
[  337.203793]  # expires at 260156346358 nsecs [in 18446744072478212542 nsecs] 
[  330.669444] SysRq : Show Pending Timers


Any ideas?

Thanks,
Bernd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2][RESEND] improve generic_file_buffered_write()

2007-09-07 Thread Bernd Schubert

No further response to our patches yet, so we are sending them again, 
re-diffed against 2.6.23-rc5

Hi,

recently we discovered writing to a nfs-exported lustre filesystem is rather 
slow (20-40 MB/s writing, but over 200 MB/s reading).

As I already explained on the nfs mailing list, this happens since there is an 
offset on the very first page due to the nfs header.

http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs

While this especially effects lustre, Olaf Kirch also noticed it on another 
filesystem before and wrote a nfs patch for it. This patch has two 
disadvantages  - it requires to move all data within the pages, IMHO rather 
cpu time consuming, furthermore, it presently causes data corruption when 
more than one nfs thread is running.

After thinking it over and over again we (Goswin and I) believe it would be 
best to improve generic_file_buffered_write().
If there is sufficient data now, as it is usual for aio writes, 
generic_file_buffered_write() will now fill each page as much as possible and 
only then prepare/commit it. Before generic_file_buffered_write() commited 
chunks of pages even though there were still more data.

Some statistics:

num_writes = 4669440, bytes_total = 20231249633, segs_total = 5738644, 
commit_loops = 7697604, commits_total = 6628750

commit_loops is the number commits without the patch and commits_total the 
number of commits we actually have now. This shows a saving of nearly 14% of 
prepare, commit, cond_sched calls.

<   1:  Write size =0,  Num segs =0
<   2:  Write size =20244,  Num segs =  4455583
<   4:  Write size = 6722,  Num segs =   24
<   8:  Write size =19653,  Num segs =   213842
<  16:  Write size =31778,  Num segs =0
<  32:  Write size =73395,  Num segs =0
<  64:  Write size =   148840,  Num segs =0
< 128:  Write size =   310178,  Num segs =0
< 256:  Write size =89027,  Num segs =0
< 512:  Write size =   111903,  Num segs =0
<1024:  Write size =   140509,  Num segs =0
<2048:  Write size =   244052,  Num segs =0
<4096:  Write size =   217164,  Num segs =0
<8192:  Write size =  2784875,  Num segs =0
<   16384:  Write size =   433506,  Num segs =0
<   32768:  Write size =11742,  Num segs =0
<   65536:  Write size =15783,  Num segs =0
<  131072:  Write size = 6851,  Num segs =0
<  262144:  Write size = 1562,  Num segs =0
<  524288:  Write size =  755,  Num segs =0
< 1048576:  Write size =  531,  Num segs =0
< 2097152:  Write size =  272,  Num segs =0
< 4194304:  Write size =  107,  Num segs =0
< 8388608:  Write size =0,  Num segs =0

Write size shows the number of writes with the total size smaller than denoted 
in the first column. Num segs shows the number of writes with less segments 
than denoted in the first column. Most writes (~95%) only have one segment. 
However, no nfs activity has been done, which is actually the case we made 
the patches for.

size\num1   2   3   4   5   6   7+   
<   1:  0   24  0   0   0   0   0   
<   2:  20244   0   0   0   0   641526  0   
<   4:  67220   0   0   0   0   0   
<   8:  19653   0   0   0   0   213842  0   
<  16:  31778   0   0   0   0   213856  0   
<  32:  73395   0   0   0   0   590 0   
<  64:  147730  0   0   0   0   93626   0   
< 128:  100888  0   0   0   0   119597  0   
< 256:  85588   0   0   0   0   12  0   
< 512:  111900  0   0   0   0   3   0   
<1024:  140509  0   0   0   0   0   0   
<2048:  244052  0   0   0   0   0   0   
<4096:  217160  4   0   0   0   0   0   
<8192:  2784855 20  0   0   0   0   0   
<   16384:  433506  0   0   0   0   0   0   
<   32768:  11742   0   0   0   0   0   0   
<   65536:  15783   0   0   0   0   0   0   
<  131072:  68510   0   0   0   0   0   
<  262144:  15620   0   0   0   0   0   
<  524288:  755 0   0   0   0   0   0   
&l

Re: patch: improve generic_file_buffered_write() (2nd try 2/2)

2007-09-05 Thread Bernd Schubert

I guess when aio was introduced this was probably forgotten. For small chunks 
or synchronous i/o the likehood is correct, but for big data chunks and aio 
the likehood is false.

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>
Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]>

Index: linux-2.6.20.3/mm/filemap.c
===
--- linux-2.6.20.3.orig/mm/filemap.c2007-09-05 18:51:59.0 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:53:12.0 +0200
@@ -2100,7 +2100,7 @@
/*
 * handle partial DIO write.  Adjust cur_iov if needed.
 */
-   if (likely(nr_segs == 1))
+   if (nr_segs == 1)
buf = iov->iov_base + written;
else {
filemap_set_next_iovec(&cur_iov, &iov_base, written);
@@ -2167,7 +2167,7 @@
vmtruncate(inode, isize);
}
}
-   if (likely(nr_segs == 1))
+   if (nr_segs == 1)
copied = filemap_copy_from_user(page, offset,
buf, bytes);
else
@@ -2213,7 +2213,7 @@
count -= copied;
pos += copied;
buf += copied;
-   if (unlikely(nr_segs > 1)) {
+   if (nr_segs > 1) {
filemap_set_next_iovec(&cur_iov,
&iov_base, copied);
if (count)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-05 Thread Bernd Schubert

Hello Randy,

thanks for your review.

On Wednesday 05 September 2007 17:35:29 Randy Dunlap wrote:
> On Wed, 5 Sep 2007 15:45:36 +0200 Bernd Schubert wrote:
> > Hi,
>
> meta-comments:
> > filemap.c |  144
> > +-
> >  1 file changed, 96 insertions(+), 48 deletions(-)
>
> Use "diffstat -p 1 -w 70" per Documentation/SubmittingPatches.

Thanks, never would have thought this is documented.

[...]

> Use proper kernel-doc notation, per
> Documentation/kernel-doc-nano-HOWTO.txt.

Ouch, I really should have read these files before.
 
Now I know why there are so few functions commented. Nobody wants to read the 
documentation.


>
> This comment block should be:
>
> /**
>  * generic_file_buffered_write - handle an iov
>  * @iocb: file operations
>  * @iov:  vector of data to write
>  * @nr_segs:  number of iov segments
>  * @pos:  position in the file
>  * @ppos: position in the file after this function
>  * @count:number of bytes to write
>  * @written:  offset in iov->base (data to skip on write)
>  *
>  * This function will do 3 main tasks for each iov:
>  * - prepare a write
>  * - copy the data from iov into a new page
>  * - commit this page

Thanks, done.

I also removed the FIXMEs and created a second patch.

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>
Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]>

 mm/filemap.c |  142 +
 1 file changed, 96 insertions(+), 46 deletions(-)

Index: linux-2.6.20.3/mm/filemap.c
===
--- linux-2.6.20.3.orig/mm/filemap.c2007-09-05 14:04:18.0 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:50:26.0 +0200
@@ -2057,6 +2057,21 @@
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/**
+ * generic_file_buffered_write - handle iov'ectors
+ * @iob:   file operations
+ * @iov:   vector of data to write
+ * @nr_segs:   number of iov segments
+ * @pos:   position in the file
+ * @ppos:  position in the file after this function
+ * @count: number of bytes to write
+ * written:offset in iov->base (data to skip on write)
+ *
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ */
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -2074,6 +2089,11 @@
const struct iovec *cur_iov = iov; /* current iovec */
size_t  iov_base = 0;  /* offset in the current iovec */
char __user *buf;
+   unsigned long   data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within 
page */
+   loff_t  wpos = pos; /* the position in the file we will return 
*/
+
+   /* position in file as index of pages */
+   unsigned long   index = pos >> PAGE_CACHE_SHIFT;
 
pagevec_init(&lru_pvec, 0);
 
@@ -2087,9 +2107,15 @@
buf = cur_iov->iov_base + iov_base;
}
 
+   page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+   if (!page) {
+   status = -ENOMEM;
+   goto out;
+   }
+
do {
-   unsigned long index;
unsigned long offset;
+   unsigned long data_end; /* end of data within the page */
size_t copied;
 
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -2106,6 +2132,8 @@
 */
bytes = min(bytes, cur_iov->iov_len - iov_base);
 
+   data_end = offset + bytes;
+
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
@@ -2114,34 +2142,30 @@
 */
fault_in_pages_readable(buf, bytes);
 
-   page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
-   if (!page) {
-   status = -ENOMEM;
-   break;
-   }
-
if (unlikely(bytes == 0)) {
status = 0;
copied = 0;
goto zero_length_segment;
}
 
-   status = a_ops->prepare_write(file, page, offset, offset+bytes);
-   if (unlikely(status)) {
-   loff_t isize = i_size_read(inode);
-
-   if (status != AOP_TRUNCATED_PAGE)
-   unlock_page(page);
-   page_cache_release(page);
-   if (status == AOP_TRUNCATED_PAGE)
-   contin

patch: improve generic_file_buffered_write()

2007-09-05 Thread Bernd Schubert

Hi,

recently we discovered writing to a nfs-exported lustre filesystem is rather 
slow (20-40 MB/s writing, but over 200 MB/s reading).

As I already explained on the nfs mailing list, this happens since there is an 
offset on the very first page due to the nfs header.

http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs

While this especially effects lustre, Olaf Kirch also noticed it on another 
filesystem before and wrote a nfs patch for it. This patch has two 
disadvantages  - it requires to move all data within the pages, IMHO rather 
cpu time consuming, furthermore, it presently causes data corruption when 
more than one nfs thread is running.

After thinking it over and over again we (Goswin and I) believe it would be 
best to improve generic_file_buffered_write().
If there is sufficient data now, as it is usual for aio writes, 
generic_file_buffered_write() will now fill each page as much as possible and 
only then prepare/commit it. Before generic_file_buffered_write() commited 
chunks of pages even though there were still more data.

The attached patch still has two FIXMEs, both for likely()/unlikely() 
conditions which IMHO don't reflect the likelyhood for the new aio data 
functions.

filemap.c |  144 
+-
 1 file changed, 96 insertions(+), 48 deletions(-)

Signed-off-by: Bernd Schubert <[EMAIL PROTECTED]>
Signed-off-by: Goswin von Brederlow <[EMAIL PROTECTED]>


Cheers,
Goswin and Bernd


Index: linux-2.6.20.3/mm/filemap.c
===
--- linux-2.6.20.3.orig/mm/filemap.c2007-09-04 13:43:04.0 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 12:39:23.0 +0200
@@ -2057,6 +2057,19 @@
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/**
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ * @iob:   file operations
+ * @iov:   vector of data to write
+ * @nr_segs:   number of iov segments
+ * @pos:   position in the file
+ * @ppos:  position in the file after this function
+ * @count: number of bytes to write
+ * written:offset in iov->base (data to skip on write)
+ */
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -2074,6 +2087,11 @@
const struct iovec *cur_iov = iov; /* current iovec */
size_t  iov_base = 0;  /* offset in the current iovec */
char __user *buf;
+   unsigned long   data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within 
page */
+   loff_t  wpos = pos; /* the position in the file we will return 
*/
+
+   /* position in file as index of pages */
+   unsigned long   index = pos >> PAGE_CACHE_SHIFT;
 
pagevec_init(&lru_pvec, 0);
 
@@ -2087,9 +2105,15 @@
buf = cur_iov->iov_base + iov_base;
}
 
+   page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+   if (!page) {
+   status = -ENOMEM;
+   goto out;
+   }
+
do {
-   unsigned long index;
unsigned long offset;
+   unsigned long data_end; /* end of data within the page */
size_t copied;
 
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -2106,6 +2130,8 @@
 */
bytes = min(bytes, cur_iov->iov_len - iov_base);
 
+   data_end = offset + bytes;
+
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
@@ -2114,95 +2140,117 @@
 */
fault_in_pages_readable(buf, bytes);
 
-   page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
-   if (!page) {
-   status = -ENOMEM;
-   break;
-   }
-
if (unlikely(bytes == 0)) {
status = 0;
copied = 0;
goto zero_length_segment;
}
 
-   status = a_ops->prepare_write(file, page, offset, offset+bytes);
-   if (unlikely(status)) {
-   loff_t isize = i_size_read(inode);
-
-   if (status != AOP_TRUNCATED_PAGE)
-   unlock_page(page);
-   page_cache_release(page);
-   if (status == AOP_TRUNCATED_PAGE)
-   continue;
+   if (data_end == PAGE_CACHE_SIZE || count == bytes) {
/*
-* prepare_write() may have instantiated a few

Re: API changes / 2.6.21 sysctl changes

2007-06-11 Thread Bernd Schubert

On Monday 11 June 2007 17:46:27 Alexey Dobriyan wrote:
> On Mon, Jun 11, 2007 at 03:13:12PM +0200, Bernd Schubert wrote:
> > in 2.6.21 register_sysctl_table(), struct ctl_table and probably
> > something else did change. Unfortunately so far I didn't figure out the
> > "something else".
>
> Do you have a problem porting your sysctls to newer kernels?

A little bit, yes. Well, I got it working, but I don't understand why I had to 
do that whay.

I'm porting lustre to newer kernel versions and up to 2.6.20 the procfs/sysctl 
logic was 

1.) register_sysctl_table() -> creates /proc/sys/lnet

2.) create_proc_entry()  -> add additional files in /proc/sys/lnet 

With 2.6.21 creating additional entries in /proc/sys/lnet fails and I have to 
first call "proc_mkdir("sys/lnet", NULL)". I did this proc_mkdir() call even 
before the register_sysctl_table() call, hoping that its correct.

My guess is that register_sysctl_table() doesn't create /proc/sys/lnet 
anymore, but I have now idea why. Either I did something wrong or its 
intended. 

Since I don't like guessing I ask for more documentation. However I think in 
general, each interface change should be documented. Its just such a waste of 
time of many people, just because one person doesn't want to spend additional 
5 min to write what did change.

Thanks
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

API changes / 2.6.21 sysctl changes

2007-06-11 Thread Bernd Schubert

Hi,

in 2.6.21 register_sysctl_table(), struct ctl_table and probably something 
else did change. Unfortunately so far I didn't figure out the "something 
else". 

Please, if generic interface modifications render all available documentation 
in the web invalid, is it so hard to also write kernel api documentation then 
(even if it so far does not exist in the Documentation/ dir)?

I mean the time overhead of thousands of coders digging through git commits is 
huge, just because API changes are not properly documented.

E.g.: Documentation/api/sysctl.txt

Up to 2.6.20:
struct ctl_table_header *register_sysctl_table(ctl_table * table, int 
insert_at_head);


Beginning with 2.6.21-rcX:
struct ctl_table_header *register_sysctl_table(ctl_table * table);

struct ctl_table:
removed entry struct proc_dir_entry *de
added entry ctl_table *parent

[Maybe also something like]

Additionaly to different functions calls, programmers also need to change ...


Thanks,
Bernd


Thanks,
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mkfs.ext2 triggered softlockup

2007-05-16 Thread Bernd Schubert

On Wednesday 16 May 2007 18:49:57 Michal Piotrowski wrote:
> Hi Bernd,
>
> On 16/05/07, Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > Maybe you still remember my report about an mkfs.ext2 triggered ram disk
> > corruption?
> >
> > http://lkml.org/lkml/2007/5/4/272
> >
> > Well, in principle I'm now doing the same stuff, only this time with
> > another initrd, which mounts the root-fs over nfs.
> >
> > [ 1596.928552] BUG: soft lockup detected on CPU#2!
> > [ 1596.933109]
> > [ 1596.933110] Call Trace:
> > [ 1596.933111][] softlockup_tick+0xd8/0xef
> > [ 1596.933129]  [] run_local_timers+0x13/0x15
> > [ 1596.933132]  [] update_process_times+0x4a/0x77
> > [ 1596.933138]  [] smp_local_timer_interrupt+0x34/0x54
> > [ 1596.933143]  [] smp_apic_timer_interrupt+0x61/0x78
> > [ 1596.933147]  [] apic_timer_interrupt+0x6b/0x70
> > [ 1596.933151][] free_buffer_head+0x24/0x3e
> > [ 1596.933162]  [] kmem_cache_free+0x1f4/0x201
> > [ 1596.933170]  [] free_buffer_head+0x24/0x3e
> > [ 1596.933175]  [] try_to_free_buffers+0x88/0x9f
> > [ 1596.933181]  [] try_to_release_page+0x39/0x40
> > [ 1596.933188]  [] invalidate_mapping_pages+0x9d/0x121
> > [ 1596.933196]  [] invalidate_inode_pages+0xf/0x11
> > [ 1596.933200]  [] invalidate_bdev+0x3b/0x3f
> > [ 1596.933203]  [] kill_bdev+0x13/0x29
> > [ 1596.933208]  [] __blkdev_put+0x62/0x141
> > [ 1596.933213]  [] blkdev_put+0xb/0xd
> > [ 1596.933218]  [] blkdev_close+0x2e/0x33
> > [ 1596.933222]  [] __fput+0xc3/0x172
> > [ 1596.933228]  [] fput+0x14/0x16
> > [ 1596.933233]  [] filp_close+0x61/0x6d
> > [ 1596.933238]  [] sys_close+0x8c/0xce
> > [ 1596.933244]  [] system_call+0x7e/0x83
> > [ 1596.933250]
>
> Can you tell me which kernel version you are using?

Sorry, forgot that. I think 2.6.20.6 or 2.6.20.7 (I always rename them to .3, 
for some reasons thats easier than to change our tftp-rembo config). The 
kernel is patches with lustre patches, hmm, one of them also adds a read-only 
test to the block device layer.
Probably I should test a vanilla kernel. Going to do that now...

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mkfs.ext2 triggered softlockup

2007-05-16 Thread Bernd Schubert

Maybe you still remember my report about an mkfs.ext2 triggered ram disk
corruption?

http://lkml.org/lkml/2007/5/4/272

Well, in principle I'm now doing the same stuff, only this time with another
initrd, which mounts the root-fs over nfs.

[ 1596.928552] BUG: soft lockup detected on CPU#2!
[ 1596.933109]
[ 1596.933110] Call Trace:
[ 1596.933111][] softlockup_tick+0xd8/0xef
[ 1596.933129]  [] run_local_timers+0x13/0x15
[ 1596.933132]  [] update_process_times+0x4a/0x77
[ 1596.933138]  [] smp_local_timer_interrupt+0x34/0x54
[ 1596.933143]  [] smp_apic_timer_interrupt+0x61/0x78
[ 1596.933147]  [] apic_timer_interrupt+0x6b/0x70
[ 1596.933151][] free_buffer_head+0x24/0x3e
[ 1596.933162]  [] kmem_cache_free+0x1f4/0x201
[ 1596.933170]  [] free_buffer_head+0x24/0x3e
[ 1596.933175]  [] try_to_free_buffers+0x88/0x9f
[ 1596.933181]  [] try_to_release_page+0x39/0x40
[ 1596.933188]  [] invalidate_mapping_pages+0x9d/0x121
[ 1596.933196]  [] invalidate_inode_pages+0xf/0x11
[ 1596.933200]  [] invalidate_bdev+0x3b/0x3f
[ 1596.933203]  [] kill_bdev+0x13/0x29
[ 1596.933208]  [] __blkdev_put+0x62/0x141
[ 1596.933213]  [] blkdev_put+0xb/0xd
[ 1596.933218]  [] blkdev_close+0x2e/0x33
[ 1596.933222]  [] __fput+0xc3/0x172
[ 1596.933228]  [] fput+0x14/0x16
[ 1596.933233]  [] filp_close+0x61/0x6d
[ 1596.933238]  [] sys_close+0x8c/0xce
[ 1596.933244]  [] system_call+0x7e/0x83
[ 1596.933250]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mkfs.ext2 triggerd RAM corruption

2007-05-05 Thread Bernd Schubert

On Sat, May 05, 2007 at 02:57:35PM -0400, Theodore Tso wrote:
> On Sat, May 05, 2007 at 03:36:37AM +0200, Bernd Schubert wrote:
> > distribution: modified debian sarge, in which aspect is the distribution
> > important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX
> > and not /dev/rd/0. Stracing it and grepping for open calls shows that
> > only /dev/sdaX is opened in read-write mode.
> 
> /dev/rd/0?  What's this?  Is this the partition where your root
> partition is found?  What is it?  Is it a ramdisk?  Or is it some kind
> of persistent storage device?
> 
> If it is a persistant storage device, do the corrupted files stay
> corrupted when you reboot?  (If it's a ramdisk which you load, then
> obviously it's getting reloaded on reboot.)  You didn't give enough
> information to be sure exactly what's going on.

Sorry, should have expressed myself more clearly, /dev/rd/0 is the
devfs-style name of the first ram disk device (don't like those devfs
names myself, but since I'm rather new in this group I couldn't convice
my boss to switch to short names yet ;) ). However, its only the
devfs-style of udev and not devfs itself.

> 
> The next thing to ask is how the files are corrupted.  Can you see
> save a copy of the corrupted files to stable storage, so you can see
> *how* they were corrupted.  Were large swaths of zeros getting written
> into it?

Yes, many zeros. Binary files, hexdump and diff are here:
http://www.q-leap.com/~bschubert/data-corruption

> 
> Next question; if you don't use these mke2fs parameters, can you
> reproduce the corruption?
> 
>   mkfs.ext2 -j -b 4096 -F -i 4096 -J size=400 -I 512 /dev/sda4
> 
> What if you change the it to:
> 
>   mkfs.ext2 -j -b 4096  /dev/sda4
> 
> Do you still see corruption problems?

No, no observable corruption.

> 
> > I already tested several partition types, e.g. something like this for a
> > test on sda3
> > 
> > beo-05:~# sfdisk -d /dev/sda
> > # partition table of /dev/sda
> > unit: sectors
> > 
> > /dev/sda1 : start=   63, size=  4208967, Id=83
> > /dev/sda2 : start=  4209030, size=  4209030, Id=83
> > /dev/sda3 : start=  8418060, size=313251435, Id=83
> > /dev/sda4 : start=0, size=0, Id= 0
> 
> What if the partition size is smaller; does that make the problem go
> away?  If so, can you do a binary search on the partition size where
> the problem appears?

Need to test this thouroughly, but will do it tomorrow, its too late
here for this kind of tests.

> 
> And what can you say about the SATA driver you were using; were all of
> the machines that you tested this on using the same SATA controller
> and same driver?  

As you can see from my previous reply ;) tested with at least two
different controllers - intel and nvidia (will reboot on the 4th system on 
Monday to
figure out its hardware, once the corruption happened, the system tend to
stop working).

> 
> Obviously if this were a generic kernel problem, we'd been hearing
> about this from a lot more people.  So there has to be something
> unique to your setup, and we need to figure out what that might happen
> to be.

I also still have problems to believe its a generic problem...


Thanks for your help,
Bernd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mkfs.ext2 triggerd RAM corruption

2007-05-05 Thread Bernd Schubert

On Sat, May 05, 2007 at 09:12:02PM +0200, Jan Engelhardt wrote:
> 
> On May 5 2007 14:57, Theodore Tso wrote:
> >On Sat, May 05, 2007 at 03:36:37AM +0200, Bernd Schubert wrote:
> >> distribution: modified debian sarge, in which aspect is the distribution
> >> important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX
> >> and not /dev/rd/0. Stracing it and grepping for open calls shows that
> >> only /dev/sdaX is opened in read-write mode.
> >
> >/dev/rd/0?  What's this?
> 
> devfs (hint hint) naming for /dev/ram0.

Yep, but udev knows devfs style ... (I already told I tested vanilla
kernels, so no patches).

Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Bernd Schubert

Jan-Benedict Glaw wrote:

> On Fri, 2007-05-04 16:59:51 +0200, Bernd Schubert <[EMAIL PROTECTED]>
> wrote:
>> To see whats going on, I copied the entire / (so the initrd) into a
>> tmpfs
>> root, chrooted into it, also bind mounted the main / into this chroot
>> and
>> compared several times /bin of chroot/bin and the bind-mounted /bin
>> while
>> the mkfs.ext2 command was running.
>> 
>> beo-05:/# diff -r /bin /oldroot/bin/
>> beo-05:/# diff -r /bin /oldroot/bin/
>> beo-05:/# diff -r /bin /oldroot/bin/
>> Binary files /bin/sleep and /oldroot/bin/sleep differ
>> beo-05:/# diff -r /bin /oldroot/bin/
>> Binary files /bin/bsd-csh and /oldroot/bin/bsd-csh differ
>> Binary files /bin/cat and /oldroot/bin/cat differ
>> ...
>> 
>> Also tested different schedulers, at least happens with deadline and
>> anticipatory.
>> 
>> The corruption does NOT happen on running the mkfs command on
>> /dev/sda1,
>> but happens with sda2, sda3 and sda3. Also doesn't happen with
>> extended
>> partitions of sda1.
> 
> Is sda2 the largest filesystem out of sda2, sda3 (and the logical
> partitions within the extended sda1, if these get mkfs'ed, too)?

I tested it that way:

- test on sda1, no further partitions
- test on sda2, sda1: ~2MB, everything else for sda2
- test on sda3, sda1: ~2MB, sda2: ~2MB, everything else for sda3
...
test on sda5: sda1: partition that has the extended partition,
everything in
sda5

> 
> I'm not too sure that this is a kernel bug, but probably a bad RAM
> chip. Did you run memtest86 for a while? ...and can you reproduce this
> problem on different machines?

Reproducible on 4 test-systems (2 with identical hardware, but then the
2 + 1 + 1 with entirely different hardware combinations) with ECC memory,
which is monitored by EDAC. Memory, CPU, etc. are already real life stress
tested with several applications, e.g. linpack. 
Though I don't entirely agree, my colleagues in this group are always
telling me, that their real life stress test shows more memory
corruptions than memtest. As soon as I have physical access again, I can also 
do a memtest86 run (would like to do it over the weekend, but don't know how
to convince stupid rembo how to boot memtest).
Anyway, a memory corruption is more than unlikely on these systems for
several reasons.


Thanks,
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Bernd Schubert

Theodore Tso wrote:

> On Fri, May 04, 2007 at 04:59:51PM +0200, Bernd Schubert wrote:
>> 
>> I'm presently rather puzzled, if this is really a kernel bug, its a
>> big
>> bug.
>> 
>> Summary: The system ramdisk (initrd) gets corrupted while running
>> mkfs.ext2 on a local sata disk partition.
> 
> What distribution are you using?  What's the hardware configuration,

distribution: modified debian sarge, in which aspect is the distribution
important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX
and not /dev/rd/0. Stracing it and grepping for open calls shows that
only /dev/sdaX is opened in read-write mode.

hardware:
beo-05 and beo-06: cpu: xeon, acpi shows S3000PTH board, memory 2GB
(board too new for EDAC), piix sata controller

beo-106: Dual Core AMD Opteron, no idea what kind of board, 4GB memory
(k8_edac monitored), nforce sata controller

beo-01: Presently can't connect to it, afaik another intel system

(all system are running in x86_64 mode)

> including amount of memory?  What is the partition table look
> like for /dev/sda?  What filesystems are mounted?  If you have any

I already tested several partition types, e.g. something like this for a
test on sda3

beo-05:~# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start=   63, size=  4208967, Id=83
/dev/sda2 : start=  4209030, size=  4209030, Id=83
/dev/sda3 : start=  8418060, size=313251435, Id=83
/dev/sda4 : start=0, size=0, Id= 0

For the tests nothing was mounted. 

> soft RAID partitions, are any of them using part of /dev/sda?  What

No raid during the tests on sda, of course. 
When sdaX was part of a raid testing the raid device, the corruption did
NOT happen.

> swap partitions are you using?  And do any of the swap partitions

Swap already entirely disabled.

> overlap with /dev/sda?  :-)

Suspected this first too, but the tested partition was never used as
swap partition (first always tested on sda4 and sda2 was used for swap),
later I entirely disabled the swap.

Thanks,
Bernd

PS: I took me about 10 hours of testing, before I wrote the first mail.
Took me that time to believe that its really a kernel bug.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mkfs.ext2 triggerd RAM corruption

2007-05-04 Thread Bernd Schubert

Hi,

I'm presently rather puzzled, if this is really a kernel bug, its a big bug. 

Summary: The system ramdisk (initrd) gets corrupted while running mkfs.ext2 on 
a local sata disk partition.

Reproduced on kernel versions: vanilla 2.6.16 - 2.6.20 (<2.6.16 doesn't run on 
any of the systems I can do tests with).
Please note: I could reproduce this on serveral systems, all of them use ECC 
memory and the memory of most of them the memory is monitored using EDAC. 

Details:

1.) Our systems boot from an initrd, all system services are running from the 
initrd/ramdisk.

2.) While setting up a lustre meta data storage server, lustre runs 
mkfs.ext2 -j -b 4096 -F -i 4096 -J size=400 -I 512 /dev/sda4
(Please note, I first observed this while using a lustre patched kernel, but I 
could reproduce this with vanilla kernels).


While this mkfs.ext2 command was running, suddenly running commands such as 
ps, top, ls, etc. resulted in segmentation faults.

To see whats going on, I copied the entire / (so the initrd) into a tmpfs 
root, chrooted into it, also bind mounted the main / into this chroot and 
compared several times /bin of chroot/bin and the bind-mounted /bin while the 
mkfs.ext2 command was running.

beo-05:/# diff -r /bin /oldroot/bin/
beo-05:/# diff -r /bin /oldroot/bin/
beo-05:/# diff -r /bin /oldroot/bin/
Binary files /bin/sleep and /oldroot/bin/sleep differ
beo-05:/# diff -r /bin /oldroot/bin/
Binary files /bin/bsd-csh and /oldroot/bin/bsd-csh differ
Binary files /bin/cat and /oldroot/bin/cat differ
...

Also tested different schedulers, at least happens with deadline and 
anticipatory.

The corruption does NOT happen on running the mkfs command on /dev/sda1, but 
happens with sda2, sda3 and sda3. Also doesn't happen with extended 
partitions of sda1.

Any idea whats going on?


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

aic79xx oops

2005-09-09 Thread Bernd Schubert

Hello,

this morning our server crashed without any log messages, nothing captured via 
serial cable and magic sysrqs also didn't work.
Anyway during the reboot on oops of the aic79xx module occured, see below .

This is a 2.6.11.12 kernel, patched with bluesmoke and the bio-clone fix. 
Furthermore, the drbd module is loaded. You may find a dmesg, lsmod and lspci 
information and the kernel config here:

http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/aic79xx-oops/


Ooops:

(none) login: ACPI: PCI interrupt :02:06.0[A] -> GSI 24 (level, low) -> 
IRQ 24
ACPI: PCI interrupt :02:06.1[B] -> GSI 25 (level, low) -> IRQ 25
scsi4 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11

aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs

scsi4:A:0:0: DV failed to configure device.  Please file a bug report against 
this driver.
(scsi4:A:0): 160.000MB/s transfers (80.000MHz DT, 16bit)
  Vendor: transtec  Model: T5008 Rev: 0001
  Type:   Direct-Access  ANSI SCSI revision: 03
scsi4:A:0:0: Tagged Queuing enabled.  Depth 32
SCSI device sdc: 4101521408 512-byte hdwr sectors (2099979 MB)
SCSI device sdc: drive cache: write back
SCSI device sdc: 4101521408 512-byte hdwr sectors (2099979 MB)
SCSI device sdc: drive cache: write back
 sdc: sdc1 sdc2 sdc3 < sdc5 sdc6 sdc7 sdc8 >
Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0
Attached scsi generic sg2 at scsi4, channel 0, id 0, lun 0,  type 0
scsi: host 4 channel 0 id 0 lun 0x0200080c0400 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun1002486961 has a LUN larger than allowed by the 
host adapter
scsi: host 4 channel 0 id 0 lun 0x0100407a27c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0x007a27c0d05d27c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0x305e27c0907b27c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0xf08227c0b08d27c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0x307827c0008527c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0xb06727c0 has a LUN larger than 
currently supported.
scsi: host 4 channel 0 id 0 lun 0x306727c0706727c0 has a LUN larger than 
currently supported.
  Vendor: transtec  Model: T5008 Rev: 0001
  Type:   Direct-Access  ANSI SCSI revision: 03
Unable to handle kernel NULL pointer dereference at virtual address 0403
 printing eip:
f8a4de8e
*pde = 
Oops:  [#1]
SMP 
Modules linked in: aic79xx
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.11.12-tc2) 
EIP is at ahd_send_async+0xde/0x2a0 [aic79xx]
eax: 000f   ebx: 0042   ecx: f7f05d28   edx: 
esi: 0400   edi: f7f74000   ebp:    esp: f7f05c64
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 1081, threadinfo=f7f04000 task=f7ee1540)
Stack: c0135aae 0006162f  0282  f7f05c88 0050  
    0001 c01230a0 0001 f7f05d00 c0107376 f7f05d00 c03fb620 
    f7f05d0a  c038f3a0 2000 f7323536 2000  
Call Trace:
 [] __do_IRQ+0x10e/0x160
 [] do_timer+0xc0/0xd0
 [] timer_interrupt+0xb6/0x130
 [] __do_IRQ+0x10e/0x160
 [] ahd_set_tags+0x55/0x70 [aic79xx]
 [] ahd_linux_device_queue_depth+0xa7/0xd0 [aic79xx]
 [] ahd_linux_free_target+0x112/0x160 [aic79xx]
 [] ahd_linux_slave_configure+0x72/0xe0 [aic79xx]
 [] ahd_linux_slave_configure+0x0/0xe0 [aic79xx]
 [] scsi_add_lun+0x2aa/0x300
 [] scsi_probe_and_add_lun+0xd9/0x220
 [] scsi_report_lun_scan+0x330/0x480
 [] scsi_probe_and_add_lun+0xf2/0x220
 [] scsi_scan_target+0x102/0x130
 [] scsi_scan_channel+0x7a/0x90
 [] scsi_scan_host_selected+0xb5/0xf0
 [] scsi_scan_host+0x2f/0x40
 [] ahd_linux_register_host+0x21e/0x270 [aic79xx]
 [] sysfs_add_file+0x58/0x80
 [] sysfs_create_file+0x2e/0x50
 [] pci_create_newid_file+0x27/0x30
 [] pci_register_driver+0x8e/0x90
 [] ahd_linux_detect+0x4c/0x70 [aic79xx]
 [] ahd_linux_init+0xf/0x13 [aic79xx]
 [] sys_init_module+0x167/0x200
 [] syscall_call+0x7/0xb
Code: c7 44 24 20 00 00 00 00 80 fb 42 0f b6 87 41 1d 00 00 8d 50 08 0f 44 c2 
8d 54 ed 00 01 d2 03 94 87 6c 18 00 00 8d b2 00 04 00 00 <0f> b6 56 03 3a 56 
09 0f 84 2b 01 00 00 a1 28 d1 a6 f8 85 c0 0f 



Even though there was an oops, the aic79xx module and the card still seem to 
work, we currently uncertain if we better should reboot again and/or try to 
use a newer kernel or can leave the system as it is without another reboot. 
Of course, we would also like to know the reason for the oops. Any help is 
appreciated.

Thanks in advance,
    Bernd


-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body

Re: 2.6.12.3 network slowdown?

2005-07-27 Thread Bernd Schubert

On Wednesday 27 July 2005 12:30, Mihai Rusu wrote:
> On Wed, Jul 27, 2005 at 01:44:43AM -0700, Howard Chu wrote:
> > I just recently compiled the 2.6.12.3 kernel for my x86_64 machine
> > (Asus A8V motherboard); was previously running a SuSE-compiled 2.6.8
> > kernel (SuSE 9.2 distro). I'm now seeing extremely slow throughput on
> > the onboard Yukon (Marvell) ethernet interface, but only in certain
> > conditions. Going back to the 2.6.8 kernel shows no slowdown.
>
> You might try the other SysKonnect driver as 2.6.12 ships with 2
> different drivers for this family of NICs.
>

No, AFAIK the rewritten driver is only in 2.6.13-rc or 2.6.12-mm (also already 
in previous -mm kernel versions).

Bernd

-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem with kernel 2.6.11

2005-07-14 Thread Bernd Schubert

Hello Francois,

> > > I have a problem with a program named Gaussian (http://www.gaussian.com)
> > > (versions g98 or g03) and FC 4.0 (default kernel 2.6.11): I am used to 
> > > take
> > > Gaussian binaries compiled on the RedHat 9.0 version, and used them on FC
> > > 2.0 or FC 3.0. If I try to do so, on FC 4.0. (with the default kernel)
> > > Gaussian stops (both g98 and g03 versions) with the following error
> > > message:

could you please tell me which compiler you used to compile Gaussian?
Its rather probably pgf77 (PGI), but the version is also important. If
it was 5.2, you just ran into bugs we already experienced some time ago.
I also posted a warning about that to the CCL list. On the CCL list I also saw
there were problems with PGI-6.0, but I never bothered to test this
myself, as our gaussian-binaries compiled with PGI-5.1 seem to work
fine. Also, binaries from the PGI compiler are to our experience rather
sensible to the glibc version. I'm not absolutely sure whats causing
that, but somehow I'm under the impression that the PGI-libraries, which all
binaries created with the PGI compiler are linked with, do some odd
optimizations.  So to make sure that its really a kernel issue you should use 
the 
libc of the compiler system (via LD_LIBRARY_PATH) or compile Gaussian
statically.

> stat64("/home/fyd/0QM_SCR/Gau-3174.inp", 0xbf9db114) = -1 ENOENT (No such file

I'm a bit tired now and maybe I'm interpreting it wrong, but I think you
should use strace -f ...

> rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
> --- SIGCHLD (Child exited) @ 0 (0) ---

Same here.

Cheers,
Bernd

-- 
Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.11.12: oops + panic

2005-07-10 Thread Bernd Schubert

Hello,

below is an oops that happend on one of our servers last week. This oops was 
still logged by syslog, but its seems this oops immediately followed another 
oops which made the kernel to panick. The last oops could not logged anymore 
by the syslogd and unfortunately capturing using a serial cable was also 
disabled. But I guess the second oops is only the result of the first oops, 
so anyone knows were it comes from or how to further debug it? 

Thanks,
Bernd

Unable to handle kernel NULL pointer dereference at virtual address 0004
 printing eip:
c013f958
*pde = 
Oops: 0002 [#1]
SMP
Modules linked in: quota_v2 drbd parport_pc lp parport ohci_hcd usbcore 
i2c_amd756 i2c_amd8111
dm_mod w83627hf eeprom lm85 i2c_sensor i2c_isa i2c_core sk98lin tg3 aic79xx
CPU:0
EIP:0060:[free_block+72/208]Not tainted VLI
EFLAGS: 00010016   (2.6.11.12)
EIP is at free_block+0x48/0xd0
eax:    ebx: f027cd80   ecx: cc5be600   edx: 00580a78
esi: c2b93ac0   edi:    ebp: c2b93ae8   esp: c2991edc
ds: 007b   es: 007b   ss: 0068
Process events/0 (pid: 6, threadinfo=c299 task=c2962a60)
Stack: c0407b60 c011440a c2b93af8 c2a0abd0 c2a0abc0 0002 c2b93ac0 c014010a
   c2b93ac0 c2a0abd0 0002 c2b93a38 c2b93ac0 0005 c2b93a60 c01401d4
   c2b93ac0 c2a0abc0  c2b93a38 c299 c2b93b50 0296 c2814ac0
Call Trace:
 [finish_task_switch+58/128] finish_task_switch+0x3a/0x80
 [drain_array_locked+122/192] drain_array_locked+0x7a/0xc0
 [cache_reap+132/496] cache_reap+0x84/0x1f0
 [worker_thread+441/592] worker_thread+0x1b9/0x250
 [cache_reap+0/496] cache_reap+0x0/0x1f0
 [default_wake_function+0/32] default_wake_function+0x0/0x20
 [default_wake_function+0/32] default_wake_function+0x0/0x20
 [worker_thread+0/592] worker_thread+0x0/0x250
 [kthread+183/192] kthread+0xb7/0xc0
 [kthread+0/192] kthread+0x0/0xc0
 [kernel_thread_helper+5/20] kernel_thread_helper+0x5/0x14
Code: 46 38 8d 6e 28 89 44 24 08 8b 44 24 24 8b 15 50 ea 4e c0 8b 0c b8 8d 81 
00 00 00 40 c1 e8
 0c c1 e0 05 8b 5c 02 1c 8b 53 04 8b 03 <89> 50 04 89 02 c7 43 04 00 02 20 00 
2b 4b 0c c7 03 00 01 10 00


-- 
Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CDR read problems with 2.6.11?

2005-04-13 Thread Bernd Schubert

[...]
> 
> [EMAIL PROTECTED]:~[1009]# mount /mnt/cdrom
> mount: wrong fs type, bad option, bad superblock on /dev/cdrom,
>missing codepage or other error
>In some cases useful info is found in syslog - try
>dmesg | tail  or so
>

[...]

> The drive is a NEC DVD+RW ND-5100A
> 
> Any suggestions on why I can't read (or burn correctly) the disks with 2.6.11?
>

I have seen exactly the same on my fathers computer and could solve this
by not starting the udftools. Didn't have the time to digg further into
this... 
Can you confirm thats really a udf problem? Just run
"/etc/init.d/udftools stop" or the similar for your distribution and try
mounting again.

Cheers,
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-03 Thread Bernd Schubert

On Thursday 03 March 2005 10:19, Andi Kleen wrote:
> On Wed, Mar 02, 2005 at 08:53:07AM -0800, Trond Myklebust wrote:
> > on den 02.03.2005 Klokka 12:33 (+0100) skreiv Bernd Schubert:
> > > > I can see no good reason for truncating inode number values on
> > > > platforms that actually do support 64-bit inode numbers, but I can
> > > > see several
> > >
> > > Well, at least we would have a reason ;)
> >
> > A 32-bit emulation mode is clearly a "platform" which does NOT support
> > 64-bit inode numbers, however there is (currently) no way for the kernel
> > to detect that you are running that. Any extra truncation should
> > therefore ideally be done by the emulation layer rather than the kernel
> > itself.
>
> The problem here is that glibc uses stat64() which supports
> 64bit inode numbers. But glibc does the overflow checking itself
> and generates the EOVERFLOW in user space. Nothing we can do
> about that. The 64bit inodes work under 32bit too, so your
> code checking for 64bitness is totally bogus.
>
> The old stat interface doesn't check that case currently either
> (will fix that), but that's not the problem here.
>
> But in general the emulation layer cannot do truncation because
> it doesn't know if it is ok to do for the low level file system.
> If anything this has to be done in the fs.
>

So what do you actually suggest? On the one hand you say even 32bit userspace 
supports 64bit inodes, if it wants. On the other hand you say the truncation 
needs to be done on file system level. 
To my mind this is contradicting, the first statement suggests to do the 
truncation in userspace, the second says it can only be done in the kernel?

Cheers,
 Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-02 Thread Bernd Schubert

On Wednesday 02 March 2005 17:53, Trond Myklebust wrote:
> on den 02.03.2005 Klokka 12:33 (+0100) skreiv Bernd Schubert:
> > > I can see no good reason for truncating inode number values on
> > > platforms that actually do support 64-bit inode numbers, but I can see
> > > several
> >
> > Well, at least we would have a reason ;)
>
> A 32-bit emulation mode is clearly a "platform" which does NOT support
> 64-bit inode numbers, however there is (currently) no way for the kernel
> to detect that you are running that. Any extra truncation should
> therefore ideally be done by the emulation layer rather than the kernel
> itself.
>

I already found the function in glibc and it looks as if it would be rather 
easy to do it there. I only hope the glibc maintainers will accept this kind 
of fixes (hope they won't say that nobody needs this).

Cheers,
 Bernd

PS: Also many thanks for fixing other bugs in the NFS client. Until 2.6.9 init 
somehow could not open /dev/console on a readonly mountpoint. With 2.6.11 
this problem has disappeared, thanks a lot for fixing this and other 
problems. I never had the time to write a bugreport for that.

-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-02 Thread Bernd Schubert

On Wednesday 02 March 2005 10:13, Trond Myklebust wrote:
> on den 02.03.2005 Klokka 09:18 (+0100) skreiv Andi Kleen:
> > On Wed, Mar 02, 2005 at 12:46:23AM +0100, Andreas Schwab wrote:
> > > Bernd Schubert <[EMAIL PROTECTED]> writes:
> > > > Hmm, after compiling with -D_FILE_OFFSET_BITS=64 it works fine. But
> > > > why does it work without this option on a 32bit kernel, but not on a
> > > > 64bit kernel?
> > >
> > > See nfs_fileid_to_ino_t for why the inode number is different between
> > > 32bit and 64bit kernels.
> >
> > Ok that explains it. Thanks.

Many thanks also from me!

> >
> > Best would be probably to just do the shift unconditionally on 64bit
> > kernels too.
> >
> > Trond, what do you think?
>
> Why would this be more appropriate than defining __kernel_ino_t on the
> x86_64 platform to be of the size that you actually want the kernel to
> support?
>
> I can see no good reason for truncating inode number values on platforms
> that actually do support 64-bit inode numbers, but I can see several

Well, at least we would have a reason ;)

> reasons why you might want not to (utilities that need to detect hard
> linked files for instance).

Anyway, glibc already seems to have a condition for that, so IMHO glibc also 
could truncate the inode numbers if needed. And finally glibc probably knows 
best if its compiled as 32bit or 64bit. Will take a look into the glibc 
sources.

Many, many thanks to all for their help!

Best wishes,
 Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-01 Thread Bernd Schubert

On Tuesday 01 March 2005 23:10, Andreas Schwab wrote:
> Bernd Schubert <[EMAIL PROTECTED]> writes:
> >> It is most likely some kind of user space problem.  I would change
> >> it to int err = stat(dir, &buf);
> >> and then go through it with gdb and see what value err gets assigned.
> >>
> >> I cannot see any kernel problem.
> >
> > The err value will become -1 here.
>
> That's because there are some values in the stat64 buffer delivered by the
> kernel which cannot be packed into the stat buffer that you pass to stat.
> Use stat64 or _FILE_OFFSET_BITS=64.

Hmm, after compiling with -D_FILE_OFFSET_BITS=64 it works fine. But why does 
it work without this option on a 32bit kernel, but not on a 64bit kernel?

32bit kernel, 32bit binary: always works
64bit kernel, 64bit binary: always works

64bit kernel, 32bit binary:
 - always works on knfsd mount points
 - always works with -D_FILE_OFFSET_BITS=64
 - only works on unfs3 mount points with _FILE_OFFSET_BITS=64 


Do I really have to write a bug report for every single debian package that 
access /etc  and /var to make the maintainers recompile it with 
-D_FILE_OFFSET_BITS=64? 
Btw, whats about Suse, are there all packages compiled with this option? ;)


Cheers, 
(a completely confused) Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-01 Thread Bernd Schubert

> strace didn't say so, and normally it doesn't lie about things like this.

Well, I show you the updated source code and strace output and if you still 
don't believe me, ask me for a login to our system ;)


#include 
#include 
#include 
#include 
#include 
#include 
#include 


int main(int argc, char **argv)
{
char *dir;
struct stat *buf;
int err;

dir = argv[1];

buf = malloc(sizeof(struct stat));

errno = 0;

err = stat(dir, buf);
if ( err ) {
fprintf(stderr, "err = %i\n", err);
fprintf(stderr, "stat for %s failed \n", dir);
fprintf(stderr, "ernno: %i (%s)\n", errno, strerror(errno));
} else
fprintf(stderr, "stat() works fine.\n");

return (0);
}


>
> > > [EMAIL PROTECTED] tests>./test_stat32 /mnt/test/yp
> > > stat for /mnt/test/yp failed
> > > ernno: 75 (Value too large for defined data type)
>
> errno is undefined unless a system call returned -1 before or
> you set it to 0 before.

See above.

>
> > > But why does stat64() on a 64-bit kernel tries to fill in larger data
> > > than
>
> A 64bit kernel has no stat64(). All stats are 64bit.

[EMAIL PROTECTED] tests>strace32 ./test_stat32 /mnt/test/yp
execve("./test_stat32", ["./test_stat32", "/mnt/test/yp"], [/* 43 vars */]) = 
0
uname({sys="Linux", node="hitchcock", ...}) = 0
brk(0)  = 0x80ad000
brk(0x80ce000)  = 0x80ce000
stat64("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0
write(2, "err = -1\n", 9err = -1
)   = 9
write(2, "stat for /mnt/test/yp failed \n", 30stat for /mnt/test/yp failed
) = 30
write(2, "ernno: 75 (Value too large for d"..., 50ernno: 75 (Value too large 
for defined data type)
) = 50
exit_group(0)   = ?

You certainly know much better than me, but I think strace shows that its 
calling stat64.

>
> > > on a 32-bit kernel and larger data also only for nfs-mount points? Hmm,
> > > I will tomorrow compare the tcp-packges sent by the server.
> >
> > So I still think thats a kernel bug.
>
> Your data so far doesn't support this assertion.

I have to admit that knfsd-mount moints are not affected, but on the other 
hand, I really cant't see anything in the ethereal captures. If someone 
should be interested, I have uploaded them:

http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/nfs-stat/


Cheers,
 Bernd


-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-03-01 Thread Bernd Schubert

Hello Andi,

sorry, due to some mail sending/refusing problems, I had to resend to the 
nfs-list, which prevented the answers there to be posted to the other CCs.

> It is most likely some kind of user space problem.  I would change
> it to int err = stat(dir, &buf);
> and then go through it with gdb and see what value err gets assigned.
>
> I cannot see any kernel problem.

The err value will become -1 here.

 Trond Myklebust already suggested to look at the results of errno:

On Tuesday 01 March 2005 00:43, Bernd Schubert wrote:
> On Monday 28 February 2005 23:26, you wrote:
> > Given that strace shows that both syscalls (stat64() and stat())
> > succeed, I expect the "problem" is probably just glibc setting an
> > EOVERFLOW error in the 32-bit case. That's what it is supposed to do if
> > a 64 bit value overflows the 32-bit buffers.
>
> Right, thanks.
>
> > Have you tried looking at errno?
>
> [EMAIL PROTECTED] tests>./test_stat32 /mnt/test/yp
> stat for /mnt/test/yp failed
> ernno: 75 (Value too large for defined data type)
>
> But why does stat64() on a 64-bit kernel tries to fill in larger data than
> on a 32-bit kernel and larger data also only for nfs-mount points? Hmm, I
> will tomorrow compare the tcp-packges sent by the server.

So I still think thats a kernel bug.


Thanks,
 Bernd

-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: 32bit emulation problems

2005-02-28 Thread Bernd Schubert


> As usual we are using unfs3 for /etc and /var, but for me that looks like a
> client problem. I'm even not sure if this is limited to NFS at all.

Sorry, that was easy to test, of course. This problem doesn't seem to exist on 
a local disk.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

x86_64: 32bit emulation problems

2005-02-28 Thread Bernd Schubert

Hi,

I'm just looking into a very strange problem. Some of our systems have 
athlon64 CPUs. Due to our diskless nfs environment we currently still prefer 
a 32bit userspace environment, but would like to be able to use a 64-bit 
chroot environment.

Well, currently there seems to be a stat64()  NFS problem when a x86_64 kernel 
is booted and stat64() comes from a 32bit libc.

Here's just an example:

hitchcock:/home/bernd/src/tests# ./test_stat64 /mnt/test/yp
stat() works fine.


hitchcock:/home/bernd/src/tests# ./test_stat32 /mnt/test/yp
stat for /mnt/test/yp failed 


The test program looks rather simple:

#include 
#include 
#include 
#include 
#include 
#include 
#include 


int main(int argc, char **argv)
{
char *dir;
struct stat buf;

dir = argv[1];

if (stat (dir, &buf) == -1)
fprintf(stderr, "stat for %s failed \n", dir);
else
fprintf(stderr, "stat() works fine.\n");
return (0);
}


Here are the strace outputs:
=

32bit:
--
hitchcock:/home/bernd/src/tests# strace32 ./test_stat32 /mnt/test/yp
execve("./test_stat32", ["./test_stat32", "/mnt/test/yp"], [/* 39 vars */]) = 
0
uname({sys="Linux", node="hitchcock", ...}) = 0
brk(0)  = 0x80ad000
brk(0x80ce000)  = 0x80ce000
stat64("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0
write(2, "stat for /mnt/test/yp failed \n", 30stat for /mnt/test/yp failed 
) = 30
exit_group(0)   = ?

64bit:
---
hitchcock:/home/bernd/src/tests# strace ./test_stat64 /mnt/test/yp
execve("./test_stat64", ["./test_stat64", "/mnt/test/yp"], [/* 39 vars */]) = 
0
uname({sys="Linux", node="hitchcock", ...}) = 0
brk(0)  = 0x572000
brk(0x593000)   = 0x593000
stat("/mnt/test/yp", {st_mode=S_IFDIR|0755, st_size=2704, ...}) = 0
write(2, "stat() works fine.\n", 19stat() works fine.
)= 19
_exit(0)= ?



Anyone having an idea whats going on? The ethereal capture also looks pretty 
normal. The kernel of this system is 2.6.9, but it also happens on another 
system with 2.6.11-rc5.
As usual we are using unfs3 for /etc and /var, but for me that looks like a 
client problem. I'm even not sure if this is limited to NFS at all.


Thanks in advance,
 Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swapper: page allocation failure. order:1, mode:0x20

2005-02-28 Thread Bernd Schubert

Hello Benjamin,

On Monday 28 February 2005 16:23, Benjamin L. Shi wrote:
> We've seen these, by adding the following tueables resolved the problem.
> More specifically, the lower zone protection made the difference.
>
> vm.vfs_cache_pressure=1000
> vm.lower_zone_protection=100
> vm.max_map_count = 32668
> vm.min_free_kbytes = 1
>

many thanks, we will test this now and set those values on all of our 2.6. 
systems.

Thanks a lot again,
 Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

swapper: page allocation failure. order:1, mode:0x20

2005-02-28 Thread Bernd Schubert

Oh no, not this page allocation problems again. In summer I already posted 
problems with page allocation errors with 2.6.7, but to me it seemed that 
nobody cared. That time we got those problems every morning during the cron 
jobs and our main file server always completely crashed.
This time its our cluster master system and first happend after an uptime 
of 89 days, kernel is 2.6.9. Besides of those messages, the system still 
seems to run stable

I really beg for help here, so please please please help me solving this 
probem. What can I do to solve it?

First a (dumb) question, what does 'page allocation failure' really mean? 
Is it some out of memory case?


Thanks a lot in advance for any help,
 Bernd





Feb 28 10:04:45 hitchcock kernel: swapper: page allocation failure. order:1, 
mode:0x20
Feb 28 10:04:45 hitchcock kernel:
Feb 28 10:04:45 hitchcock kernel: Call Trace: 
{__alloc_pages+878} {__get_free_pages+14}
Feb 28 10:04:45 hitchcock kernel:{kmem_getpages+38} 
{ip_frag_create+26}
Feb 28 10:04:45 hitchcock kernel:{cache_grow+190} 
{cache_alloc_refill+560}
Feb 28 10:04:45 hitchcock kernel:{__kmalloc+195} 
{alloc_skb+64}
Feb 28 10:04:45 hitchcock kernel:
{tg3_alloc_rx_skb+222} {tg3_rx+371}
Feb 28 10:04:45 hitchcock kernel:{tg3_poll+183} 
{net_rx_action+134}
Feb 28 10:04:45 hitchcock kernel:{__do_softirq+123} 
{do_softirq+50}
Feb 28 10:04:45 hitchcock kernel:{do_IRQ+347} 
{ret_from_intr+0}
Feb 28 10:04:45 hitchcock kernel:  
{default_idle+0} {default_idle+36}
Feb 28 10:04:45 hitchcock kernel:{cpu_idle+39}
Feb 28 10:05:41 hitchcock rpc.mountd: authenticated unmount request from 
beo-04:666 for /lib64 (/lib64)
Feb 28 10:04:45 hitchcock kernel: swapper: page allocation failure. order:1, 
mode:0x20
Feb 28 10:07:36 hitchcock kernel:
Feb 28 10:07:36 hitchcock kernel: Call Trace: 
{__alloc_pages+878} {__get_free_pages+14}
Feb 28 10:07:36 hitchcock kernel:{kmem_getpages+38} 
{cache_grow+190}
Feb 28 10:07:36 hitchcock kernel:
{cache_alloc_refill+560} {__kmalloc+195}
Feb 28 10:07:36 hitchcock kernel:{alloc_skb+64} 
{tg3_alloc_rx_skb+222}
Feb 28 10:07:36 hitchcock kernel:{tg3_rx+371} 
{tg3_poll+183}
Feb 28 10:07:36 hitchcock kernel:{net_rx_action+134} 
{__do_softirq+123}
Feb 28 10:07:36 hitchcock kernel:{do_softirq+50} 
{do_IRQ+347}
Feb 28 10:07:36 hitchcock kernel:{ret_from_intr+0}  
 {default_idle+0}
Feb 28 10:07:36 hitchcock kernel:{default_idle+36} 
{cpu_idle+39}
Feb 28 10:07:36 hitchcock kernel:
Feb 28 10:07:36 hitchcock kernel: swapper: page allocation failure. order:1, 
mode:0x20
Feb 28 10:07:36 hitchcock kernel:
Feb 28 10:07:36 hitchcock kernel: Call Trace: 
{__alloc_pages+878} {__get_free_pages+14}
Feb 28 10:07:36 hitchcock kernel:{kmem_getpages+38} 
{cache_grow+190}
Feb 28 10:07:36 hitchcock kernel:
{cache_alloc_refill+560} {__kmalloc+195}
Feb 28 10:07:36 hitchcock kernel:{alloc_skb+64} 
{tg3_alloc_rx_skb+222}
Feb 28 10:07:36 hitchcock kernel:{tg3_rx+371} 
{tg3_poll+183}
Feb 28 10:07:36 hitchcock kernel:{net_rx_action+134} 
{__do_softirq+123}
Feb 28 10:07:36 hitchcock kernel:{do_softirq+50} 
{do_IRQ+347}
Feb 28 10:07:36 hitchcock kernel:{ret_from_intr+0}  
 {default_idle+0}
Feb 28 10:07:36 hitchcock kernel:{default_idle+36} 
{cpu_idle+39}


-- 
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

91 matches

Mail list logo