Re: [braindump][RFC] signals and syscall restarts (Re: [PATCH v2 19/44] metag: Signal handling)

2012-12-07 Thread Al Viro
On Thu, Dec 06, 2012 at 10:09:55PM +, Al Viro wrote:
>   "Subtle and undocumented" is an extremely polite way to describe that.
> By now we had at least a dozen architectures step on that trap, simply because
> they had different calling conventions and the same logics did *not* "just
> work" there.  
> 
>   What we need to guarantee is
> * restarts do not happen on signals caught in interrupts or exceptions
> * restarts do not happen on signals caught in sigreturn()
> * restart should happen only once, even if we get through do_signal() many
> times.

FWIW, here's the current situation:

alpha: works.  Double restarts are prevented by the loop in do_work_pending()
resetting 'r0' (syscall number or 0 if restarts should not be done)
to 0 after the first call of do_signal(); all restart logics is conditional
on r0 != 0.  The logics making sure that we get the right value passed
to do_work_pending() is convoluted and had cost us at least one bug
(sigreturn/rt_sigreturn had stopped only once in case of straced process;
strace(1) got seriously confused and produced garbage).

arm: works.  Double restarts are prevented by logics similar to alpha
do_work_pending(); prevention of restarts on non-syscalls and sigreturn is
done by asm glue setting r8 ('why', aka 'tbl') to 0 in non-syscall entry points
and to syscall table address in syscall entry; zeroed in asm wrappers for
sigreturn/rt_sigreturn.  Used to be broken until several years ago.

arm64: works.  Syscall number is in pt_regs (->syscallno); -1 for non-syscall
ones.  Reaching do_signal() the first time around will set it to -1 and so will
sigreturn (in restore_sigframe()).  Restart logics is conditional on
->syscallno being non-negative.

avr32: _very_ odd logics used to decide whether to do restarts or not and
frankly, I do not believe that it could possibly work correctly - whatever
we do when building a sigframe, we don't touch SYSREG_SR in process, so
that won't prevent double restarts.  And if we had r12 (first argument of
syscall) restart-worthy at the entry, setup_syscall_restart() will leave
us with restart-worthy value in ->r12.  So e.g. pause(2) called when
r12 happened to contain -514 (it's a zero-argument syscall, so calling it
doesn't involve assignments to r12) will happily hit double restarts if
we have e.g. SIGCHLD coming often enough.  If that thing works, I would
really appreciate a detailed explanation of how it manages to do that - it
definitely deserves one.

blackfin: doesn't handle multiple signals; if you get a SIGSEGV generated
by failing attempt to set a sigframe up, too bad - you'll pass to userland
and coredump will hit at some later point when a hardware interrupt happend.
Restarts on sigreturn and non-syscalls are prevented by checking if
->orig_p0 is non-negative (similar to arm64 solution above) and it's easy to
turn into prevention of double restarts, which will become necessary as soon
as multiple signal handling gets fixed.  Actually, it's almost OK as is -
ERESTART_RESTARTBLOCK case is the only problematic one.

c6x: there's a flag next to pt_regs on stack and it's non-zero if and only
if we have a syscall.  Passed explicitly to do_notify_resume() to tell if
restarts are allowed.  As far as I can see, it is vulnerable to bogus
restarts on sigreturn (can be fixed by clearing the same flag in
do_rt_sigreturn() - simple *(long *)(regs + 1) = 0 in there will do).
It might be vulnerable to double restarts as well - pause(2) is not
enabled there, but it's not much comfort.  In the best case we are relying
on the following property:
no syscall can return -ERESTART... when called with the first
argument equal to that value.
Might be true (the usual suspects are pause() and ancient sigsuspend() of
3-argument variety and neither is used here), but it's brittle as hell.
Come to think of that, clone(2) would probably fit the bill - we ignore
all unknown bits in the mask and clone(2) *can* return -ERESTARTNOINTR.
So it's almost certainly vulnerable.  The same fix would do, but explicit
loop a-la arm might be better.

cris: same lossage as on blackfin (quits after the first signal).  Vulnerable
to bogus restarts on sigreturn.  Would be vulnerable to double restarts if it
handled multiple signals.

frv: works (similar to the situation on arm64.  Used to be broken until a
couple of years ago.

h8300: works.  Prevention of restarts on sigreturn and non-syscalls based
on sign of ->orig_er0; the first pass through syscall restart logics renders
regs->er0 (return value) non-restart-worthy - anything that used to be
restart-worthy becomes either non-negative (->orig_er0 has to be, or we
won't touch that at all) or -EINTR.  In other words, we can't hit that
sucker twice.

hexagon: broken.  Prevention of restarts on non-syscalls is based on
sign of ->syscall_nr.  sigreturn carefully sets it *positive* and that
makes it vulnerable to bogus restarts.  Moreover, double restarts are
possible as well, same as on c6x.

ia64: 

Re: Look Ma, da kernel is b0rken

2012-12-07 Thread Andreas Mohr
Hi,

On Fri, Dec 07, 2012 at 06:44:05PM +0100, Borislav Petkov wrote:
> On Fri, Dec 07, 2012 at 05:52:18PM +0100, Andreas Mohr wrote:
> > Hmm, anyone deeply familiar with ISA PnP ID magic? :)
> 
> Even if this is violating the ACPI spec, any fix for this needs to be
> tested on the hardware (and I can very well imagine that the hardware
> might be violating the spec too, nothing new here).
> 
> So even if you had a fix, you need to run it on the hardware to verify
> that it actually works.

And that demand actually applies to both the '@' change (questionable)
and the much less disputed (obviously correct) wrong conditional fixup,
since both introduce a notable change (either large, or possibly
improper) in behaviour.

> So the actual practical question turns into: do you have such hardware
> to verify your or anyone else's fix on?

Not the ALS100 (only ALS4000 here).
I possibly have some other ISA hardware, but probably none which contains
'@' data in their PnP id struct.
The driver for the well-known case of ISDN PnP cards
does not seem to contain it.
However ISTR that CMI8330 was quite widespread (did I have one? Do I??).
For identification, see http://www.yjfy.com/C/C-Media/soundchipset/CMI8330A.htm

I'm afraid I should get an old system back up and running,
exactly for such validation work cases (and perhaps so should a select few
other developers, too).

BTW, "my" fix? I thought that everybody had come to the conclusion by now
that I merely pointed out (in no uncertain terms to boot)
that something was broken :)

Andreas Mohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A typo about kernelcore= ?

2012-12-07 Thread anish kumar
On Fri, 2012-12-07 at 15:02 +0800, Han Pingtian wrote:
> Hi there,
> 
> I'm wondering this is a typo in Documentation/kernel-parameters.txt
> about "kernelcore=":
> 
>  In the event, a node is too small to have both
> kernelcore and Movable pages, kernelcore pages will
> take priority and other nodes will have a larger number
> of kernelcore pages.
> 
> I think it should be 
> 
>  In the event, a node is too small to have both
> kernelcore and Movable pages, kernelcore pages will
> take priority and other nodes will have a larger number
> of *Movable* pages.
> 
> Is it right? Thanks in advance!
adding the maintainer.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Documentation: cgroup: update the index file

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

There are new files added to cgroup documentation. Lets
update the index file listing all the remaining files

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 Documentation/cgroups/00-INDEX |8 
 1 file changed, 8 insertions(+)

diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index 3f58fa3..f78b90a 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -1,7 +1,11 @@
 00-INDEX
- this file
+blkio-controller.txt
+   - Description for Block IO Controller, implementation and usage details.
 cgroups.txt
- Control Groups definition, implementation details, examples and API.
+cgroup_event_listener.c
+   - A user program for cgroup listener.
 cpuacct.txt
- CPU Accounting Controller; account CPU usage for groups of tasks.
 cpusets.txt
@@ -10,9 +14,13 @@ devices.txt
- Device Whitelist Controller; description, interface and security.
 freezer-subsystem.txt
- checkpointing; rationale to not use signals, interface.
+hugetlb.txt
+   - HugeTLB Controller implementation and usage details.
 memcg_test.txt
- Memory Resource Controller; implementation details.
 memory.txt
- Memory Resource Controller; design, accounting, interface, testing.
+net_prio.txt
+   - Network priority cgroups details and usages.
 resource_counter.txt
- Resource Counter API.
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lib/parser.c: Fix up comments for valid return values from match_number

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Since, match_number as valid return values as -ENOMEM, -EINVAL and
-ERANGE. So, for all the functions calling match_number, the return
value should include these values. Fix up the comments to reflect
the correct values.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 lib/parser.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/parser.c b/lib/parser.c
index 52cfa69..807b2aa 100644
--- a/lib/parser.c
+++ b/lib/parser.c
@@ -157,7 +157,7 @@ static int match_number(substring_t *s, int *result, int 
base)
  *
  * Description: Attempts to parse the _t @s as a decimal integer. On
  * success, sets @result to the integer represented by the string and returns 
0.
- * Returns either -ENOMEM or -EINVAL on failure.
+ * Returns -ENOMEM, -EINVAL, or -ERANGE on failure.
  */
 int match_int(substring_t *s, int *result)
 {
@@ -171,7 +171,7 @@ int match_int(substring_t *s, int *result)
  *
  * Description: Attempts to parse the _t @s as an octal integer. On
  * success, sets @result to the integer represented by the string and returns
- * 0. Returns either -ENOMEM or -EINVAL on failure.
+ * 0. Returns -ENOMEM, -EINVAL, or -ERANGE on failure.
  */
 int match_octal(substring_t *s, int *result)
 {
@@ -185,7 +185,7 @@ int match_octal(substring_t *s, int *result)
  *
  * Description: Attempts to parse the _t @s as a hexadecimal integer.
  * On success, sets @result to the integer represented by the string and
- * returns 0. Returns either -ENOMEM or -EINVAL on failure.
+ * returns 0. Returns -ENOMEM, -EINVAL, or -ERANGE on failure.
  */
 int match_hex(substring_t *s, int *result)
 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] f2fs: Fix for parent inode information during server cache eviction

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Test Case:
[NFS Client]
ls -lR .

[NFS Server]
while [ 1 ]
do
echo 3 > /proc/sys/vm/drop_caches
done

Error: "No such file or directory"

When cache is dropped at the server, it results in lookup failure at the
NFS client. Even though the file exists. Looking at the code to rebuild 
the inode in case of cache eviction. It tries to initiate a lookup operation
for ".." to get the parent information using the on-disk inode number.

But, in case of f2fs we do not need to perform a lookup based upon name.
As for f2fs layout - the f2fs inode already has reference to the parent 
inode number. So, like a normal inode build-up, parent inode can also be 
regenerated by reading parent inode number.

f2fs_inode_info represents the in-memory inode, while f2fs_inode represents
the on-disk. So, we introduce the parent inode number in the f2fs_inode_info 
also.
Whild doing do_read_inode() reading on-disk information, we populate the parent 
inode number also.This way whenever we reference f2fs_inode using F2FS_I over 
VFS inode. We are sure to get the parent inode in the inode.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/dir.c   |   16 
 fs/f2fs/f2fs.h  |1 +
 fs/f2fs/inode.c |1 +
 fs/f2fs/namei.c |5 +++--
 4 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index d900c08..a4c9c9d 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -226,22 +226,6 @@ struct f2fs_dir_entry *f2fs_parent_dir(struct inode *dir, 
struct page **p)
return de;
 }
 
-ino_t f2fs_inode_by_name(struct inode *dir, struct qstr *qstr)
-{
-   ino_t res = 0;
-   struct f2fs_dir_entry *de;
-   struct page *page;
-
-   de = f2fs_find_entry(dir, qstr, );
-   if (de) {
-   res = le32_to_cpu(de->ino);
-   kunmap(page);
-   f2fs_put_page(page, 0);
-   }
-
-   return res;
-}
-
 void f2fs_set_link(struct inode *dir, struct f2fs_dir_entry *de,
struct page *page, struct inode *inode)
 {
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 8c3f1ef..0b56cbb 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -146,6 +146,7 @@ struct f2fs_inode_info {
unsigned int clevel;/* maximum level of given file name */
nid_t i_xattr_nid;  /* node id that contains xattrs */
struct extent_info ext; /* in-memory extent cache entry */
+   unsigned long parent_ino;
 };
 
 static inline void get_extent_info(struct extent_info *ext,
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index aa4ef4f..1880d8a 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -107,6 +107,7 @@ static int do_read_inode(struct inode *inode)
fi->flags = 0;
fi->data_version = le64_to_cpu(F2FS_CKPT(sbi)->checkpoint_ver) - 1;
fi->i_advise = ri->i_advise;
+   fi->parent_ino = le32_to_cpu(ri->i_pino);
get_extent_info(>ext, ri->i_ext);
f2fs_put_page(node_page, 1);
return 0;
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index 89b7675..ab021cc 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -183,8 +183,9 @@ out:
 
 struct dentry *f2fs_get_parent(struct dentry *child)
 {
-   struct qstr dotdot = QSTR_INIT("..", 2);
-   unsigned long ino = f2fs_inode_by_name(child->d_inode, );
+   unsigned long ino;
+   struct f2fs_inode_info *fi = F2FS_I(child->d_inode);
+   ino = fi->parent_ino;
if (!ino)
return ERR_PTR(-ENOENT);
return d_obtain_alias(f2fs_iget(child->d_inode->i_sb, ino));
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] f2fs: introduce accessor to retrieve number of dentry slots

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Simplify code by providing the accessor macro to retrieve the
number of dentry slots for a given filename length.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/dir.c   |   13 +
 include/linux/f2fs_fs.h |3 +++
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index fc02d8b..d900c08 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -99,8 +99,7 @@ static struct f2fs_dir_entry *find_in_block(struct page 
*dentry_page,
NR_DENTRY_IN_BLOCK, 0);
while (bit_pos < NR_DENTRY_IN_BLOCK) {
de = _blk->dentry[bit_pos];
-   slots = (le16_to_cpu(de->name_len) + F2FS_NAME_LEN - 1) /
-   F2FS_NAME_LEN;
+   slots = GET_DENTRY_SLOTS(le16_to_cpu(de->name_len));
 
if (early_match_name(name, namelen, namehash, de)) {
if (!memcmp(dentry_blk->filename[bit_pos],
@@ -130,7 +129,7 @@ static struct f2fs_dir_entry *find_in_level(struct inode 
*dir,
unsigned int level, const char *name, int namelen,
f2fs_hash_t namehash, struct page **res_page)
 {
-   int s = (namelen + F2FS_NAME_LEN - 1) / F2FS_NAME_LEN;
+   int s = GET_DENTRY_SLOTS(namelen);
unsigned int nbucket, nblock;
unsigned int bidx, end_block;
struct page *dentry_page;
@@ -383,7 +382,7 @@ int f2fs_add_link(struct dentry *dentry, struct inode 
*inode)
int namelen = dentry->d_name.len;
struct page *dentry_page = NULL;
struct f2fs_dentry_block *dentry_blk = NULL;
-   int slots = (namelen + F2FS_NAME_LEN - 1) / F2FS_NAME_LEN;
+   int slots = GET_DENTRY_SLOTS(namelen);
int err = 0;
int i;
 
@@ -465,8 +464,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, 
struct page *page,
struct address_space *mapping = page->mapping;
struct inode *dir = mapping->host;
struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
-   int slots = (le16_to_cpu(dentry->name_len) + F2FS_NAME_LEN - 1) /
-   F2FS_NAME_LEN;
+   int slots = GET_DENTRY_SLOTS(le16_to_cpu(dentry->name_len));
void *kaddr = page_address(page);
int i;
 
@@ -641,8 +639,7 @@ static int f2fs_readdir(struct file *file, void *dirent, 
filldir_t filldir)
file->f_pos += bit_pos - start_bit_pos;
goto success;
}
-   slots = (le16_to_cpu(de->name_len) + F2FS_NAME_LEN - 1)
-   / F2FS_NAME_LEN;
+   slots = GET_DENTRY_SLOTS(le16_to_cpu(de->name_len));
bit_pos += slots;
}
bit_pos = 0;
diff --git a/include/linux/f2fs_fs.h b/include/linux/f2fs_fs.h
index c2fbbc3..fdcef79 100644
--- a/include/linux/f2fs_fs.h
+++ b/include/linux/f2fs_fs.h
@@ -363,6 +363,9 @@ typedef __le32  f2fs_hash_t;
 
 /* One directory entry slot covers 8bytes-long file name */
 #define F2FS_NAME_LEN  8
+#define F2FS_NAME_LEN_BITS 3
+
+#define GET_DENTRY_SLOTS(x)((x + F2FS_NAME_LEN - 1) >> F2FS_NAME_LEN_BITS)
 
 /* the number of dentry in a block */
 #define NR_DENTRY_IN_BLOCK 214
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] f2fs: remove redundant call to f2fs_put_page in delete entry

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Since, we anyway need to put the page after deleting entry. So, there is no
need to make same call under different conditions.
Move out the f2fs_put_page from the two conditions and call at once.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/dir.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 2a20c50..fc02d8b 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -514,10 +514,9 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, 
struct page *page,
ClearPageUptodate(page);
dec_page_count(sbi, F2FS_DIRTY_DENTS);
inode_dec_dirty_dents(dir);
-   f2fs_put_page(page, 1);
-   } else {
-   f2fs_put_page(page, 1);
}
+   f2fs_put_page(page, 1);
+
mutex_unlock_op(sbi, DENTRY_OPS);
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] f2fs: rewrite f2fs_bio_alloc to make it simpler

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Since, GFP_NOFS(__GFP_WAIT) is used for allocation requests of bio in f2fs.
So, there is no chance of returning NULL from the BIO allocation.

Making the bio allocation routine for f2fs simpler.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/segment.c |   24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 969df1a..8894b39 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -647,28 +647,18 @@ struct bio *f2fs_bio_alloc(struct block_device *bdev, 
sector_t first_sector,
int nr_vecs, gfp_t gfp_flags)
 {
struct bio *bio;
-repeat:
+
/* allocate new bio */
bio = bio_alloc(gfp_flags, nr_vecs);
 
-   if (bio == NULL && (current->flags & PF_MEMALLOC)) {
-   while (!bio && (nr_vecs /= 2))
-   bio = bio_alloc(gfp_flags, nr_vecs);
-   }
-   if (bio) {
-   bio->bi_bdev = bdev;
-   bio->bi_sector = first_sector;
+   bio->bi_bdev = bdev;
+   bio->bi_sector = first_sector;
 retry:
-   bio->bi_private = kmalloc(sizeof(struct bio_private),
-   GFP_NOFS | __GFP_HIGH);
-   if (!bio->bi_private) {
-   cond_resched();
-   goto retry;
-   }
-   }
-   if (bio == NULL) {
+   bio->bi_private = kmalloc(sizeof(struct bio_private),
+   GFP_NOFS | __GFP_HIGH);
+   if (!bio->bi_private) {
cond_resched();
-   goto repeat;
+   goto retry;
}
return bio;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask

2012-12-07 Thread Namjae Jeon
From: Namjae Jeon 

Since, GFP_NOFS and __GFP_ZERO is being used to set gfp_mask.
We can instead make use of already predefined macro GFP_F2FS_ZERO.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/namei.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index 2d720ca..89b7675 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -293,7 +293,7 @@ static int f2fs_mkdir(struct inode *dir, struct dentry 
*dentry, umode_t mode)
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;
inode->i_mapping->a_ops = _dblock_aops;
-   mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS | __GFP_ZERO);
+   mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
 
set_inode_flag(F2FS_I(inode), FI_INC_LINK);
err = f2fs_add_link(dentry, inode);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/27] 3.6.10-stable review

2012-12-07 Thread satoru takeuchi
Hi Greg,

2012/12/7 Greg Kroah-Hartman :
> This is the start of the stable review cycle for the 3.6.10 release.
> There are 27 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.

This kernel can be built and boot without any problem.
Building a kernel with this kernel also works fine.

 - Build Machine: debian wheezy x86_64
   CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz x 4
   memory: 8GB

 - Test machine: debian wheezy x86_64(KVM guest on the Build Machine)
   vCPU: x2
   memory: 2GB

I reviewed the following patches and it looks good to me.

> Alan Cox 
> ACPI: missing break
>
> Mike Galbraith 
> Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
>
> Mike Galbraith 
> workqueue: exit rescuer_thread() as TASK_RUNNING
>
> Naoya Horiguchi 
> mm: soft offline: split thp at the beginning of soft_offline_page()
>
> Jianguo Wu 
> mm/vmemmap: fix wrong use of virt_to_page

Thanks,
Satoru

>
> -
>
> Diffstat:
>
>  Makefile |   4 +-
>  arch/arm/Kconfig |   1 +
>  arch/arm/mach-dove/include/mach/pm.h |   2 +-
>  arch/arm/mach-dove/irq.c |  14 ++-
>  arch/arm/mach-kirkwood/pcie.c|  11 ++-
>  arch/s390/Kconfig|   1 +
>  arch/x86/include/asm/fpu-internal.h  |  15 +--
>  arch/x86/kernel/cpu/amd.c|  14 +++
>  arch/x86/kernel/smpboot.c|   5 +
>  drivers/acpi/processor_driver.c  |   1 +
>  drivers/edac/i7300_edac.c|   8 +-
>  drivers/edac/i7core_edac.c   |   6 +-
>  drivers/gpu/drm/i915/intel_lvds.c|  16 
>  drivers/gpu/drm/radeon/evergreen.c   | 191 
> +++--
>  drivers/gpu/drm/radeon/evergreen_reg.h   |   2 +
>  drivers/gpu/drm/radeon/evergreend.h  |   7 ++
>  drivers/gpu/drm/radeon/radeon_asic.h |   1 +
>  drivers/md/raid1.c   |   2 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  11 ++-
>  drivers/net/ethernet/realtek/8139cp.c|  22 ++---
>  drivers/net/usb/qmi_wwan.c   |  42 
>  drivers/net/wireless/iwlwifi/dvm/rxon.c  |  12 +--
>  drivers/target/target_core_transport.c   |   6 +-
>  kernel/sched/auto_group.c|   4 -
>  kernel/sched/auto_group.h|   5 -
>  kernel/workqueue.c   |   4 +-
>  mm/memory-failure.c  |   8 ++
>  mm/sparse.c  |  10 +-
>  mm/vmscan.c  |  27 --
>  net/mac80211/offchannel.c|   2 -
>  30 files changed, 288 insertions(+), 166 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/20] 3.4.23-stable review

2012-12-07 Thread satoru takeuchi
Hi Greg,

2012/12/7 Greg Kroah-Hartman :
> This is the start of the stable review cycle for the 3.4.23 release.
> There are 20 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Sun Dec  9 00:50:03 UTC 2012.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.23-rc1.gz
> and the diffstat can be found below.

This kernel can be built and boot without any problem.
Building a kernel with this kernel also works fine.

 - Build Machine: debian wheezy x86_64
   CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz x 4
   memory: 8GB

 - Test machine: debian wheezy x86_64(KVM guest on the Build Machine)
   vCPU: x2
   memory: 2GB

I reviewed the following patches and it looks good to me.

> Alan Cox 
> ACPI: missing break
>
> Mike Galbraith 
> Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
>
> Mike Galbraith 
> workqueue: exit rescuer_thread() as TASK_RUNNING
>
> Naoya Horiguchi 
> mm: soft offline: split thp at the beginning of soft_offline_page()
>
> Jianguo Wu 
> mm/vmemmap: fix wrong use of virt_to_page

Thanks,
Satoru

>
> -
>
> Diffstat:
>
>  Makefile |   4 +-
>  arch/arm/Kconfig |   1 +
>  arch/arm/mach-dove/include/mach/pm.h |   2 +-
>  arch/arm/mach-dove/irq.c |  14 ++-
>  arch/arm/mach-kirkwood/pcie.c|  11 ++-
>  arch/s390/Kconfig|   1 +
>  arch/x86/include/asm/fpu-internal.h  |  15 +--
>  arch/x86/kernel/smpboot.c|   5 +
>  drivers/acpi/processor_driver.c  |   1 +
>  drivers/edac/i7300_edac.c|   8 +-
>  drivers/gpu/drm/i915/intel_lvds.c|  16 
>  drivers/gpu/drm/radeon/evergreen.c   | 191 
> +++--
>  drivers/gpu/drm/radeon/evergreen_reg.h   |   2 +
>  drivers/gpu/drm/radeon/evergreend.h  |   7 ++
>  drivers/gpu/drm/radeon/radeon_asic.h |   1 +
>  drivers/md/raid10.c  | 100 ++-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  11 ++-
>  fs/nfs/blocklayout/blocklayout.c | 176 
> +++---
>  fs/nfs/blocklayout/blocklayout.h |   1 +
>  kernel/sched/auto_group.c|   4 -
>  kernel/sched/auto_group.h|   5 -
>  kernel/workqueue.c   |   4 +-
>  mm/memory-failure.c  |   8 ++
>  mm/sparse.c  |  10 +-
>  scripts/package/buildtar |   2 +-
>  25 files changed, 408 insertions(+), 192 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ext3 journal commit while seek & write to file

2012-12-07 Thread Keith Chew
Hi

There is a thread in the sqlite mailing list that was started by me,
but it did not finish because it appears that my findings are more
related to the kernel instead of sqlite. I really hope someone here
can give me some guidance.

The summary of my system is:
- kernel 2.6.39.4 (also tested with 3.6.9)
- ext3 with data=ordered,commit=5
- disk has write-cache off
- sqlite does an insert to the DB every second

I have found that it takes 1ms to write to the DB each second, except
for when the kernel commits its journal (ie every 5 seconds). At those
times, the write goes up to 160ms.

You can see from the strace below that the write() after the seek does
take longer (in this case 148ms) compared to the usual 1ms:
---
[pid 17913] 17:58:14.390431 _llseek(98, 4826072, [4826072], SEEK_SET)
= 0 <0.13>
[pid 17913] 17:58:14.390667 write(98,
"\0\0\0\5\0\0\0\215\"'\201\230\305\360\331\370G\305\25\3358W\234\336",
24) = 24 <0.000137>
[pid 17913] 17:58:14.390956 _llseek(98, 4826096, [4826096], SEEK_SET)
= 0 <0.12>
[pid 17913] 17:58:14.391134 write(98,
"\r\0\0\0\1\3<\0\3<\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 1024 <0.148882>
---

I have also tried to write a small program which appends 1KB to the
end of a file every second, and I do not see this latency on that app.
Profiling mysql when doing a write every second also do not suffer
from this problem. I have looked into the sqlite code, but cannot find
anything unusual.

Is there anything I can do to improve this situation?

PS: When I set commit=1, strace shows it takes 160ms every second to
write to disk.

Regards
Keith
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Dave Chinner
On Fri, Dec 07, 2012 at 06:52:51PM -0800, Joel Becker wrote:
> On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> > On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > > >>The other things that I think we should try would be to convert over
> > > >>larger chunks as we discussed on the list back in the summer (just
> > > >>because the user writes 4KB does not mean that we cannot flip over
> > > >>1MB and zero that).
> > > >Writing a megabyte is not free.  If you assume that your HDD has a
> > > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > > >take 8-10ms.  It might be a win if you amortize it over a large number
> > > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > > >(99.9 percentile latency numbers matters because eventually you'll
> > > >have a user request which hits multiple serial long latency
> > > >operations, and then the delay looks **really** user visible.)
> > > >
> > > > - Ted
> > > 
> > > Writing 4KB at a time to a disk cost XX units of time.
> > > 
> > > Writing to the same sector (especially for a HDD), cost XX units + a 
> > > small amount.
> > > 
> > > I suggest that we try it out.
> > > 
> > > For SSD's, much better to use specific HW offload commands if
> > > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > > performance boost since no actual data is moved...
> > 
> > Yup, that could be done quite trivially in XFS. Just mark the
> > preallocated extents as "busy" rather than unwritten, mark the
> > transaction as synchronous and the transaction commit will issue a
> > discard on the preallocated ranges before returning to userspace.
> > The extra overhead to the preallocation command is unlikely to be
> > noticed, and unwritten extent conversion overhead just goes away...
> > 
> > No fallocate() API changes necessary, though I think it would be
> > better if the user application gave a hint that it preferred "writing
> > zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> > as there are workloads where one will always be clearly better than
> > the other...
> 
>   Wait, I missed something.  We're letting fallocate be dumb?
> Let's not do that, then.

No, not at all. Read again. There are workloads where explicitly
using unwritten extents are the best thing to do. For others,
zeroing rather than using unwritten extents may be better. All I
suggested was an additional flag that allows applications to tell
the filesystem preallocate zero space, but to do it via writing
zeros rather than unwritten extents.

>   Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
> extent if it is 1MB or less, or some MB multiple if it is large enough
> to slice it.  This is for very similar reasons to unwritten clearing,
> with the added benefit of less fragmentation from CoW.
>   On spinning media, any read/write of up to 1MB is roughly about
> the same penalty as reading/writing a sector.  You're already paying the
> seek.

Sure, but that's filesystem implementation details, and something
that I don't care about. Every filesystem implements preallocation
via fallocate in a similar manner (i.e. all have copied XFS's
unwritten extents technique) but that doesn't mean it's optimal for
every workload that needs preallocation.

The deficiencies of ext4's unwritten extent implementation is what
Ted has been trying to address by exposing stale data, rather than
looking at the problem as "is there a better way to preallocate for
this workload?"

That's where:

> On SSD, WRITE_SAME is *way* better than leaking data.

TRIM/WRITE_SAME can be used. It's way faster than actually writing
zeros, but from the user perspective, that's exactly how it appears
to them. IOWs, filesystems can implement the FALLOC_FL_WRITE_ZEROS method 
using this hardware offload, just be dumb and write zeros through
the page cache or ignore it altogether and just use unwritten
extents anyway...

>   At the end of the day, you have to pay for zeroing.  You can do
> it up front, or you can do it at write time.

And the application should be able to tell us which it prefers



>   We should not be leaking data so that we can be lazy.

You're in violent agreement with me about that.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tpmdd-devel] [PATCH 1/1] TPM: STMicroelectronics ST33 I2C KERNEL 3.x.x OOPS!

2012-12-07 Thread Peter Hüwe
Hi,

since I don't have a tpm I simply tried out whether the handling without a tpm 
is correct, which it unfortunately isn't as my kernel oopses.

Steps to reproduce:
# insmod  /data/data-old/linux-2.6/drivers/char/tpm/tpm_i2c_stm_st33.ko
# echo st33zp24_i2c 0x20 /sys/bus/i2c/devices/i2c-1/new_device 



BUG: unable to handle kernel NULL pointer dereference at 0018
IP: [] tpm_st33_i2c_probe+0x87/0x21a [tpm_i2c_stm_st33]
PGD 3dcbbe067 PUD 3e994b067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: tpm_i2c_stm_st33(O) tpm(O) tpm_bios(O) w83627ehf hwmon_vid 
ipv6 snd_hda_codec_hdmi snd_hda_codec_realtek joydev usbhid coretemp kvm_intel 
kvm ghash_clmulni_intel sg pcspkr snd_hda_intel snd_hda_codec snd_hwdep 
snd_pcm snd_timer snd i2c_i801 snd_page_alloc mei [last unloaded: 
tpm_i2c_stm_st33]
CPU 2
Pid: 4454, comm: bash Tainted: GW  O 3.6.9 #14 To Be Filled By O.E.M. 
To Be Filled By O.E.M./Z77 Pro4
RIP: 0010:[]  [] 
tpm_st33_i2c_probe+0x87/0x21a 
[tpm_i2c_stm_st33]
RSP: 0018:8803dcb87c18  EFLAGS: 00010286
RAX: 8803eb0d1800 RBX: 8803ff710400 RCX: 
RDX:  RSI: 88041d0026d0 RDI: 0001
RBP: 8803dcb87c38 R08: 000b R09: 0001
R10: 0001 R11: 8803ff710400 R12: 
R13: 8803ff710c00 R14: 8803ff710c28 R15: 8803ff710c04
FS:  7f520567a700() GS:88041f28() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0018 CR3: 0003e48b CR4: 001407e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process bash (pid: 4454, threadinfo 8803dcb86000, task 8803c87226c0)
Stack:
 8803ff710c28 a02ffb80 a02ff426 8803ff710c00
 8803dcb87c78 8134079a a02ffe00 8803ff710c28
 ffed a02ffe00  88040ebe4460
Call Trace:
 [] ? request_locality+0xc2/0xc2 [tpm_i2c_stm_st33]
 [] i2c_device_probe+0xa4/0xcd
 [] driver_probe_device+0xa9/0x1bf
 [] __device_attach+0x35/0x3a
 [] ? __driver_attach+0x7a/0x7a
 [] bus_for_each_drv+0x51/0x87
 [] device_attach+0x6e/0x8e
 [] bus_probe_device+0x2d/0x98
 [] device_add+0x3e1/0x548
 [] ? device_pm_init+0x60/0x84
 [] device_register+0x17/0x1b
 [] i2c_new_device+0x12b/0x175
 [] i2c_sysfs_new_device+0xd6/0x15a
 [] dev_attr_store+0x1b/0x1d
 [] sysfs_write_file+0xef/0x12b
 [] vfs_write+0xa9/0x119
 [] sys_write+0x45/0x6c
 [] system_call_fastpath+0x16/0x1b
Code: ff 48 c7 c6 a7 fa 2f a0 48 85 c0 48 89 c3 74 cd 48 8b 3d d7 d0 2a e1 be 
d0 00 00 00 4d 8b a5 c8 00 00 00 e8 3b 32 de e0 48 85 c0 <49> 89 44 24 18 0f 
84 5a 01 00 00 48 8b 3d b1 d0 2a e1 be d0 00
RIP  [] tpm_st33_i2c_probe+0x87/0x21a [tpm_i2c_stm_st33]
 RSP 
CR2: 0018
---[ end trace ddc5676681e8ed72 ]---




Thanks,
Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] char/tpm: Use struct dev_pm_ops for power management

2012-12-07 Thread Peter Hüwe
Am Donnerstag, 6. Dezember 2012, 17:27:02 schrieb Kent Yoder:
> On Thu, Dec 06, 2012 at 01:20:51AM +0100, Peter Huewe wrote:
> > This patch converts the suspend and resume functions for
> > tpm_i2c_stm_st33 to the new dev_pm_ops.
> > 
> > Signed-off-by: Peter Huewe 
> 
>   One minor tweak, the PM funcs need to be inside CONFIG_PM_SLEEP to
> avoid warnings when compiled without PM support.  Applied with that
> change only.
> 
> Thanks Peter!
> Kent
Great to hear!

Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 03/44] Add CONFIG_HAVE_64BIT_ALIGNED_STRUCT for taskstats

2012-12-07 Thread H. Peter Anvin

On 12/05/2012 08:08 AM, James Hogan wrote:

On 64 bit architectures with no efficient unaligned access, taskstats
has to add some padding to a reply to prevent unaligned access warnings.
However this also needs to apply to 32 bit architectures with 64 bit
struct alignment such as metag (which has 64 bit memory accesses).


Wait... 64-bit struct alignment on structures with only 32-bit members? 
 That might be... interesting... in a number of places...


-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Debugging: Keep track of page owners

2012-12-07 Thread Steven Rostedt
On Fri, Dec 07, 2012 at 04:24:17PM -0500, Dave Hansen wrote:
> 
> diff -puN /dev/null Documentation/page_owner.c

Can we stop putting code into Documentation? We have tools, samples and
usr directories. I'm sure this could fit into one of them.

-- Steve

> --- /dev/null 2012-06-13 15:09:09.708529931 -0400
> +++ linux-2.6.git-dave/Documentation/page_owner.c 2012-12-07 
> 16:22:43.872270758 -0500
> @@ -0,0 +1,141 @@
> +/*
> + * User-space helper to sort the output of /sys/kernel/debug/page_owner
> + *
> + * Example use:
> + * cat /sys/kernel/debug/page_owner > page_owner_full.txt
> + * grep -v ^PFN page_owner_full.txt > page_owner.txt
> + * ./sort page_owner.txt sorted_page_owner.txt
> +*/
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct block_list {
> + char *txt;
> + int len;
> + int num;
> +};
> +
> +
> +static struct block_list *list;
> +static int list_size;
> +static int max_size;
> +
> +struct block_list *block_head;
> +
> +int read_block(char *buf, FILE *fin)
> +{
> + int ret = 0;
> + int hit = 0;
> + char *curr = buf;
> +
> + for (;;) {
> + *curr = getc(fin);
> + if (*curr == EOF) return -1;
> +
> + ret++;
> + if (*curr == '\n' && hit == 1)
> + return ret - 1;
> + else if (*curr == '\n')
> + hit = 1;
> + else
> + hit = 0;
> + curr++;
> + }
> +}
> +
> +static int compare_txt(struct block_list *l1, struct block_list *l2)
> +{
> + return strcmp(l1->txt, l2->txt);
> +}
> +
> +static int compare_num(struct block_list *l1, struct block_list *l2)
> +{
> + return l2->num - l1->num;
> +}
> +
> +static void add_list(char *buf, int len)
> +{
> + if (list_size != 0 &&
> + len == list[list_size-1].len &&
> + memcmp(buf, list[list_size-1].txt, len) == 0) {
> + list[list_size-1].num++;
> + return;
> + }
> + if (list_size == max_size) {
> + printf("max_size too small??\n");
> + exit(1);
> + }
> + list[list_size].txt = malloc(len+1);
> + list[list_size].len = len;
> + list[list_size].num = 1;
> + memcpy(list[list_size].txt, buf, len);
> + list[list_size].txt[len] = 0;
> + list_size++;
> + if (list_size % 1000 == 0) {
> + printf("loaded %d\r", list_size);
> + fflush(stdout);
> + }
> +}
> +
> +int main(int argc, char **argv)
> +{
> + FILE *fin, *fout;
> + char buf[1024];
> + int ret, i, count;
> + struct block_list *list2;
> + struct stat st;
> +
> + fin = fopen(argv[1], "r");
> + fout = fopen(argv[2], "w");
> + if (!fin || !fout) {
> + printf("Usage: ./program  \n");
> + perror("open: ");
> + exit(2);
> + }
> +
> + fstat(fileno(fin), );
> + max_size = st.st_size / 100; /* hack ... */
> +
> + list = malloc(max_size * sizeof(*list));
> +
> + for(;;) {
> + ret = read_block(buf, fin);
> + if (ret < 0)
> + break;
> +
> + buf[ret] = '\0';
> + add_list(buf, ret);
> + }
> +
> + printf("loaded %d\n", list_size);
> +
> + printf("sorting \n");
> +
> + qsort(list, list_size, sizeof(list[0]), compare_txt);
> +
> + list2 = malloc(sizeof(*list) * list_size);
> +
> + printf("culling\n");
> +
> + for (i=count=0;i + if (count == 0 ||
> + strcmp(list2[count-1].txt, list[i].txt) != 0) {
> + list2[count++] = list[i];
> + } else {
> + list2[count-1].num += list[i].num;
> + }
> + }
> +
> + qsort(list2, count, sizeof(list[0]), compare_num);
> +
> + for (i=0;i + fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
> + }
> + return 0;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Joel Becker
On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > >>The other things that I think we should try would be to convert over
> > >>larger chunks as we discussed on the list back in the summer (just
> > >>because the user writes 4KB does not mean that we cannot flip over
> > >>1MB and zero that).
> > >Writing a megabyte is not free.  If you assume that your HDD has a
> > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > >take 8-10ms.  It might be a win if you amortize it over a large number
> > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > >(99.9 percentile latency numbers matters because eventually you'll
> > >have a user request which hits multiple serial long latency
> > >operations, and then the delay looks **really** user visible.)
> > >
> > >   - Ted
> > 
> > Writing 4KB at a time to a disk cost XX units of time.
> > 
> > Writing to the same sector (especially for a HDD), cost XX units + a small 
> > amount.
> > 
> > I suggest that we try it out.
> > 
> > For SSD's, much better to use specific HW offload commands if
> > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > performance boost since no actual data is moved...
> 
> Yup, that could be done quite trivially in XFS. Just mark the
> preallocated extents as "busy" rather than unwritten, mark the
> transaction as synchronous and the transaction commit will issue a
> discard on the preallocated ranges before returning to userspace.
> The extra overhead to the preallocation command is unlikely to be
> noticed, and unwritten extent conversion overhead just goes away...
> 
> No fallocate() API changes necessary, though I think it would be
> better if the user application gave a hint that it preferred "writing
> zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> as there are workloads where one will always be clearly better than
> the other...

Wait, I missed something.  We're letting fallocate be dumb?
Let's not do that, then.
Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
extent if it is 1MB or less, or some MB multiple if it is large enough
to slice it.  This is for very similar reasons to unwritten clearing,
with the added benefit of less fragmentation from CoW.
On spinning media, any read/write of up to 1MB is roughly about
the same penalty as reading/writing a sector.  You're already paying the
seek.  On SSD, WRITE_SAME is *way* better than leaking data.
At the end of the day, you have to pay for zeroing.  You can do
it up front, or you can do it at write time.  A certain large commercial
database takes advantage of fallocate+unwritten by getting large swaths
of contiguous storage; it then writes to the whole space before using
it.  This allows the allocation benefits of fallocate, doesn't pay for
unneeded zeros, and yet peforms correctly at runtime.
We should not be leaking data so that we can be lazy.

Joel


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Life's Little Instruction Book #182

"Be romantic."

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Debugging: Keep track of page owners

2012-12-07 Thread Steven Rostedt
On Fri, Dec 07, 2012 at 02:26:14PM -0800, Andrew Morton wrote:
> On Fri, 07 Dec 2012 16:24:17 -0500
> Dave Hansen  wrote:
> 
> > To: a...@osdl.org
> 
> It's years since I was called that.

"Help me a...@osdl.org. You're my only hope"...

> 
> > From: m...@skynet.ie (Mel Gorman)
> 
> And him that.
> 

"m...@skynet.ie is my father!" - Luke Skynet.ie

-- Steve ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [update] Re: new execve/kernel_thread design

2012-12-07 Thread Chris Metcalf
On 12/7/2012 5:23 PM, Al Viro wrote:
> Current situation:
>
> * most of the architectures are OK - alpha arm arm64 c6x frv hexagon ia64 m68k
> microblaze mips openrisc parisc sparc s390 tile um unicore32 x86 xtensa
>
> * powerpc *still* awaits an ACK from maintainers; no reports of any breakage
> on linux-next and seems to be doing fine on my tests.
>
> * sh - still nothing from Paul; I'm going to assume that what we have in
> linux-next is OK
>
> * mn10300 - untested, AFAIK
> * avr32, blackfin, cris, h8300, score - maintainers seem to be MIA
> * m32r - maintainer is not MIA, but I'm not sure if anyone, including
> maintainer, has working m32r test boxen anymore...  Anyway, not a word on
> m32r patches in that pile.
>
> Folks, this is the final warning - I *will* send a pull request on the
> stuff currently in linux-next as soon as the merge window opens.  It had
> been sitting there for a long time by now and you've all been Cc'd on
> that thread all along.

IMHO, Al has done a great job reaching out to the architecture maintainers with 
this round of changes.  I think it should be a model for any kind of similar 
tree-wide work.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] gpio: add TS-5500 DIO blocks support

2012-12-07 Thread Vivien Didelot
Technologic Systems TS-5500 provides digital I/O lines exposed through
pin blocks. On this platform, there are three of them, named DIO1, DIO2
and LCD port, that may be used as a DIO block.

The TS-5500 pin blocks are described in the product's wiki:
http://wiki.embeddedarm.com/wiki/TS-5500#Digital_I.2FO

This driver is not limited to the TS-5500 blocks. It can be extended to
support similar boards pin blocks, such as on the TS-5600.

This patch is the V2 of the previous https://lkml.org/lkml/2012/9/25/671
with corrections suggested by Linus Walleij.

Signed-off-by: Vivien Didelot 
Signed-off-by: Jerome Oufella 
---
 drivers/gpio/Kconfig  |   8 +
 drivers/gpio/Makefile |   1 +
 drivers/gpio/gpio-ts5500.c| 466 ++
 include/linux/platform_data/gpio-ts5500.h |  27 ++
 4 files changed, 502 insertions(+)
 create mode 100644 drivers/gpio/gpio-ts5500.c
 create mode 100644 include/linux/platform_data/gpio-ts5500.h

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 47150f5..d1e6474 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -189,6 +189,14 @@ config GPIO_STA2X11
  Say yes here to support the STA2x11/ConneXt GPIO device.
  The GPIO module has 128 GPIO pins with alternate functions.
 
+config GPIO_TS5500
+   tristate "TS-5500 DIO blocks and compatibles"
+   help
+ This driver supports Digital I/O exposed by pin blocks found on some
+ Technologic Systems platforms. It includes, but is not limited to, 3
+ blocks of the TS-5500: DIO1, DIO2 and the LCD port, and the TS-5600
+ LCD port.
+
 config GPIO_VT8500
bool "VIA/Wondermedia SoC GPIO Support"
depends on ARCH_VT8500
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index 9aeed67..e33f344 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_ARCH_DAVINCI_TNETV107X) += gpio-tnetv107x.o
 obj-$(CONFIG_GPIO_TPS6586X)+= gpio-tps6586x.o
 obj-$(CONFIG_GPIO_TPS65910)+= gpio-tps65910.o
 obj-$(CONFIG_GPIO_TPS65912)+= gpio-tps65912.o
+obj-$(CONFIG_GPIO_TS5500)  += gpio-ts5500.o
 obj-$(CONFIG_GPIO_TWL4030) += gpio-twl4030.o
 obj-$(CONFIG_GPIO_TWL6040) += gpio-twl6040.o
 obj-$(CONFIG_GPIO_UCB1400) += gpio-ucb1400.o
diff --git a/drivers/gpio/gpio-ts5500.c b/drivers/gpio/gpio-ts5500.c
new file mode 100644
index 000..0634cee
--- /dev/null
+++ b/drivers/gpio/gpio-ts5500.c
@@ -0,0 +1,466 @@
+/*
+ * Digital I/O driver for Technologic Systems TS-5500
+ *
+ * Copyright (c) 2012 Savoir-faire Linux Inc.
+ * Vivien Didelot 
+ *
+ * Technologic Systems platforms have pin blocks, exposing several Digital
+ * Input/Output lines (DIO). This driver aims to support single pin blocks.
+ * In that sense, the support is not limited to the TS-5500 blocks.
+ * Actually, the following platforms have DIO support:
+ *
+ * TS-5500:
+ *   Documentation: http://wiki.embeddedarm.com/wiki/TS-5500
+ *   Blocks: DIO1, DIO2 and LCD port.
+ *
+ * TS-5600:
+ *   Documentation: http://wiki.embeddedarm.com/wiki/TS-5600
+ *   Blocks: LCD port (identical to TS-5500 LCD).
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* List of supported Technologic Systems platforms DIO blocks */
+enum ts5500_blocks { TS5500_DIO1, TS5500_DIO2, TS5500_LCD, TS5600_LCD };
+
+struct ts5500_priv {
+   const struct ts5500_dio *pinout;
+   struct gpio_chip gpio_chip;
+   spinlock_t lock;
+   bool strap;
+   u8 hwirq;
+};
+
+/*
+ * Hex 7D is used to control several blocks (e.g. DIO2 and LCD port).
+ * This flag ensures that the region has been requested by this driver.
+ */
+static bool hex7d_reserved;
+
+/*
+ * This structure is used to describe capabilities of DIO lines,
+ * such as available directions and connected interrupt (if any).
+ */
+struct ts5500_dio {
+   const u8 value_addr;
+   const u8 value_mask;
+   const u8 control_addr;
+   const u8 control_mask;
+   const bool no_input;
+   const bool no_output;
+   const u8 irq;
+};
+
+#define TS5500_DIO_IN_OUT(vaddr, vbit, caddr, cbit)\
+   {   \
+   .value_addr = vaddr,\
+   .value_mask = BIT(vbit),\
+   .control_addr = caddr,  \
+   .control_mask = BIT(cbit),  \
+   }
+
+#define TS5500_DIO_IN(addr, bit)   \
+   {   \
+   .value_addr = addr, \
+   .value_mask = BIT(bit), \
+   .no_output = true,  \
+   }
+
+#define TS5500_DIO_IN_IRQ(addr, bit, _irq) \

Re: [PATCH] X86/acpi: remove redundant logic of acpi memory hotadd

2012-12-07 Thread Wen Congyang
At 12/08/2012 06:19 AM, Rafael J. Wysocki Wrote:
> On Tuesday, December 04, 2012 01:39:54 AM Liu, Jinsong wrote:
>> Resend it, add Rafael and linux-a...@vger.kernel.org
> 
> I wonder what memory hotplug people think about that.
> 
> Thanks,
> Rafael
> 
> 
>> ===
>> From 1d39279e45c54ce531691da5ffe261e7689dd92c Mon Sep 17 00:00:00 2001
>> From: Liu Jinsong 
>> Date: Wed, 14 Nov 2012 18:52:06 +0800
>> Subject: [PATCH] X86/acpi: remove redundant logic of acpi memory hotadd
>>
>> When memory hotadd, acpi_memory_enable_device has already been done
>> at drv->ops.add (acpi_memory_device_add), no need to do it again
>> at notify callback.
>>
>> At acpi_memory_enable_device, acpi_memory_get_device_resources
>> is also a redundant action, since it has been done at drv->ops.add.
>>
>> Signed-off-by: Liu Jinsong 
>> ---
>>  drivers/acpi/acpi_memhotplug.c |   17 -
>>  1 files changed, 0 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>> index 24c807f..a6489fd 100644
>> --- a/drivers/acpi/acpi_memhotplug.c
>> +++ b/drivers/acpi/acpi_memhotplug.c
>> @@ -220,15 +220,6 @@ static int acpi_memory_enable_device(struct 
>> acpi_memory_device *mem_device)
>>  struct acpi_memory_info *info;
>>  int node;
>>  
>> -
>> -/* Get the range from the _CRS */
>> -result = acpi_memory_get_device_resources(mem_device);
>> -if (result) {
>> -printk(KERN_ERR PREFIX "get_device_resources failed\n");
>> -mem_device->state = MEMORY_INVALID_STATE;
>> -return result;
>> -}
>> -
>>  node = acpi_get_node(mem_device->device->handle);
>>  /*
>>   * Tell the VM there is more memory here...
>> @@ -357,14 +348,6 @@ static void acpi_memory_device_notify(acpi_handle 
>> handle, u32 event, void *data)
>>  break;
>>  }
>>  
>> -if (acpi_memory_check_device(mem_device))
>> -break;

Hmm, if acpi_memory_check_device() fails, it means the memory device disappears
I don't know if a real hardware uses this way to remove memory device.

>> -
>> -if (acpi_memory_enable_device(mem_device)) {
>> -printk(KERN_ERR PREFIX "Cannot enable memory device\n");
>> -break;
>> -}

If acpi_memory_get_device() doesn't fail, it means that the device has been 
managed
by this driver, so I think we can do this cleanup.

Thanks
Wen Congyang

>> -
>>  ost_code = ACPI_OST_SC_SUCCESS;
>>  break;
>>  
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 7/9] trace: use this_cpu_ptr per-cpu helper

2012-12-07 Thread Steven Rostedt
On Fri, 2012-11-30 at 14:38 +0800, Shan Wei wrote:
> Shan Wei said, at 2012/11/16 16:34:
> > Shan Wei said, at 2012/11/13 9:53:
> >> From: Shan Wei 
> >>
> >> typeof() is a pointer to array of 1024 char, or char (*)[1024].
> >> But, typeof([0]) is a pointer to char which match the return type 
> >> of get_trace_buf().
> >> As well-known, the value of  is equal to [0].
> >> so return this_cpu_ptr(_buffer->buffer[0]) can avoid type cast. 
> >>
> >> Signed-off-by: Shan Wei 
> > 
> > Steven Rostedt,  would you like to pick it up to your tree?
> 
> ping..

Sorry for the late reply. I've been hacking on other things and my queue
of stuff for tip still hasn't been pulled yet. I'll pull this into my
tree, and see if I can get it in by the merge window. If the merge
window opens this weekend, I'll add this to my 3.9 queue.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] regulator: lp872x: Kill _rdev_to_offset() function

2012-12-07 Thread Axel Lin
There is only one user calling _rdev_to_offset() function.
Remove _rdev_to_offset() makes the code simpler.

Signed-off-by: Axel Lin 
---
 drivers/regulator/lp872x.c |   36 +++-
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/regulator/lp872x.c b/drivers/regulator/lp872x.c
index 9289ead..8e3c7ae 100644
--- a/drivers/regulator/lp872x.c
+++ b/drivers/regulator/lp872x.c
@@ -181,20 +181,6 @@ static inline int lp872x_update_bits(struct lp872x *lp, u8 
addr,
return regmap_update_bits(lp->regmap, addr, mask, data);
 }
 
-static int _rdev_to_offset(struct regulator_dev *rdev)
-{
-   enum lp872x_regulator_id id = rdev_get_id(rdev);
-
-   switch (id) {
-   case LP8720_ID_LDO1 ... LP8720_ID_BUCK:
-   return id;
-   case LP8725_ID_LDO1 ... LP8725_ID_BUCK2:
-   return id - LP8725_ID_BASE;
-   default:
-   return -EINVAL;
-   }
-}
-
 static int lp872x_get_timestep_usec(struct lp872x *lp)
 {
enum lp872x_id chip = lp->chipid;
@@ -234,28 +220,20 @@ static int lp872x_get_timestep_usec(struct lp872x *lp)
 static int lp872x_regulator_enable_time(struct regulator_dev *rdev)
 {
struct lp872x *lp = rdev_get_drvdata(rdev);
-   enum lp872x_regulator_id regulator = rdev_get_id(rdev);
+   enum lp872x_regulator_id rid = rdev_get_id(rdev);
int time_step_us = lp872x_get_timestep_usec(lp);
-   int ret, offset;
+   int ret;
u8 addr, val;
 
if (time_step_us < 0)
return -EINVAL;
 
-   switch (regulator) {
-   case LP8720_ID_LDO1 ... LP8720_ID_LDO5:
-   case LP8725_ID_LDO1 ... LP8725_ID_LILO2:
-   offset = _rdev_to_offset(rdev);
-   if (offset < 0)
-   return -EINVAL;
-
-   addr = LP872X_LDO1_VOUT + offset;
-   break;
-   case LP8720_ID_BUCK:
-   addr = LP8720_BUCK_VOUT1;
+   switch (rid) {
+   case LP8720_ID_LDO1 ... LP8720_ID_BUCK:
+   addr = LP872X_LDO1_VOUT + rid;
break;
-   case LP8725_ID_BUCK1:
-   addr = LP8725_BUCK1_VOUT1;
+   case LP8725_ID_LDO1 ... LP8725_ID_BUCK1:
+   addr = LP872X_LDO1_VOUT + rid - LP8725_ID_BASE;
break;
case LP8725_ID_BUCK2:
addr = LP8725_BUCK2_VOUT1;
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/11] 3.0.56-stable review

2012-12-07 Thread satoru takeuchi
Hi Greg,

2012/12/7 Greg Kroah-Hartman :
> This is the start of the stable review cycle for the 3.0.56 release.
> There are 11 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.

This kernel can be built and boot without any problem.
Building a kernel with this kernel also works fine.

 - Build Machine: debian wheezy x86_64
   CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz x 4
   memory: 8GB

 - Test machine: debian wheezy x86_64(KVM guest on the Build Machine)
   vCPU: x2
   memory: 2GB

I reviewed the following patches and it looks good to me.

> Jan Kara 
>scsi: Silence unnecessary warnings about ioctl to partition
>
> Alan Cox 
> ACPI: missing break
>
> Mike Galbraith 
> Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
>
> Mike Galbraith 
> workqueue: exit rescuer_thread() as TASK_RUNNING
>
> Naoya Horiguchi 
> mm: soft offline: split thp at the beginning of soft_offline_page()
>
> Jianguo Wu 
> mm/vmemmap: fix wrong use of virt_to_page

Thanks,
Satoru

>
> Responses should be made by Sun Dec  9 00:55:07 UTC 2012.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.56-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>
> -
> Pseudo-Shortlog of commits:
>
> Greg Kroah-Hartman 
> Linux 3.0.56-rc1
>
> Jan Kara 
> scsi: Silence unnecessary warnings about ioctl to partition
>
> Chris Wilson 
> drm/i915: Add no-lvds quirk for Supermicro X7SPA-H
>
> Calvin Walton 
> i915: Quirk no_lvds on Gigabyte GA-D525TUD ITX motherboard
>
> Alan Cox 
> ACPI: missing break
>
> Michal Kubecek 
> route: release dst_entry.hh_cache when handling redirects
>
> Mike Galbraith 
> Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
>
> Mike Galbraith 
> workqueue: exit rescuer_thread() as TASK_RUNNING
>
> Naoya Horiguchi 
> mm: soft offline: split thp at the beginning of soft_offline_page()
>
> Jianguo Wu 
> mm/vmemmap: fix wrong use of virt_to_page
>
> Russell King - ARM Linux 
> Dove: Fix irq_to_pmu()
>
> Russell King - ARM Linux 
> Dove: Attempt to fix PMU/RTC interrupts
>
>
> -
>
> Diffstat:
>
>  Makefile |  4 ++--
>  arch/arm/mach-dove/include/mach/pm.h |  2 +-
>  arch/arm/mach-dove/irq.c | 14 +-
>  block/scsi_ioctl.c   |  5 -
>  drivers/acpi/processor_driver.c  |  1 +
>  drivers/gpu/drm/i915/intel_lvds.c| 16 
>  kernel/sched_autogroup.c |  4 
>  kernel/sched_autogroup.h |  5 -
>  kernel/workqueue.c   |  4 +++-
>  mm/memory-failure.c  |  8 
>  mm/sparse.c  | 10 --
>  net/ipv4/route.c |  4 
>  12 files changed, 56 insertions(+), 21 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] tty: remove trailing spaces in tty/vt

2012-12-07 Thread Cong Ding
remove trailing blank spaces in tty/vt by shell command:
sed 's/\s\+$//g' -i

I have manually reviewed that everything is correct.


Signed-off-by: Cong Ding 
---
 drivers/tty/vt/consolemap.c |   62 +-
 drivers/tty/vt/keyboard.c   |4 +-
 drivers/tty/vt/vc_screen.c  |2 +-
 drivers/tty/vt/vt.c |   14 +-
 drivers/tty/vt/vt_ioctl.c   |   24 
 5 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 248381b..ffb5855 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -58,7 +58,7 @@ static unsigned short translations[][256] = {
 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef,
 0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7,
 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff
-  }, 
+  },
   /* VT100 graphics mapped to Unicode */
   {
 0x, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007,
@@ -96,7 +96,7 @@ static unsigned short translations[][256] = {
   },
   /* IBM Codepage 437 mapped to Unicode */
   {
-0x, 0x263a, 0x263b, 0x2665, 0x2666, 0x2663, 0x2660, 0x2022, 
+0x, 0x263a, 0x263b, 0x2665, 0x2666, 0x2663, 0x2660, 0x2022,
 0x25d8, 0x25cb, 0x25d9, 0x2642, 0x2640, 0x266a, 0x266b, 0x263c,
 0x25b6, 0x25c0, 0x2195, 0x203c, 0x00b6, 0x00a7, 0x25ac, 0x21a8,
 0x2191, 0x2193, 0x2192, 0x2190, 0x221f, 0x2194, 0x25b2, 0x25bc,
@@ -128,7 +128,7 @@ static unsigned short translations[][256] = {
 0x03a6, 0x0398, 0x03a9, 0x03b4, 0x221e, 0x03c6, 0x03b5, 0x2229,
 0x2261, 0x00b1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00f7, 0x2248,
 0x00b0, 0x2219, 0x00b7, 0x221a, 0x207f, 0x00b2, 0x25a0, 0x00a0
-  }, 
+  },
   /* User mapping -- default to codes for direct font mapping */
   {
 0xf000, 0xf001, 0xf002, 0xf003, 0xf004, 0xf005, 0xf006, 0xf007,
@@ -189,12 +189,12 @@ static void set_inverse_transl(struct vc_data *conp, 
struct uni_pagedir *p, int
int j, glyph;
unsigned short *t = translations[i];
unsigned char *q;
-   
+
if (!p) return;
q = p->inverse_translations[i];
 
if (!q) {
-   q = p->inverse_translations[i] = (unsigned char *) 
+   q = p->inverse_translations[i] = (unsigned char *)
kmalloc(MAX_GLYPH, GFP_KERNEL);
if (!q) return;
}
@@ -284,7 +284,7 @@ static void update_user_maps(void)
 {
int i;
struct uni_pagedir *p, *q = NULL;
-   
+
for (i = 0; i < MAX_NR_CONSOLES; i++) {
if (!vc_cons_allocated(i))
continue;
@@ -375,12 +375,12 @@ int con_get_trans_new(ushort __user * arg)
for (i=0; i current font conversion 
+ * Unicode -> current font conversion
  *
  * A font has at most 512 chars, usually 256.
  * But one font position may represent several Unicode chars.
@@ -397,7 +397,7 @@ static void con_release_unimap(struct uni_pagedir *p)
u16 **p1;
int i, j;
 
-   if (p == dflt) dflt = NULL;  
+   if (p == dflt) dflt = NULL;
for (i = 0; i < 32; i++) {
if ((p1 = p->uni_pgdir[i]) != NULL) {
for (j = 0; j < 32; j++)
@@ -428,12 +428,12 @@ void con_free_unimap(struct vc_data *vc)
con_release_unimap(p);
kfree(p);
 }
-  
+
 static int con_unify_unimap(struct vc_data *conp, struct uni_pagedir *p)
 {
int i, j, k;
struct uni_pagedir *q;
-   
+
for (i = 0; i < MAX_NR_CONSOLES; i++) {
if (!vc_cons_allocated(i))
continue;
@@ -489,7 +489,7 @@ con_insert_unipair(struct uni_pagedir *p, u_short unicode, 
u_short fontpos)
}
 
p2[unicode & 0x3f] = fontpos;
-   
+
p->sum += (fontpos << 20) + unicode;
 
return 0;
@@ -531,7 +531,7 @@ int con_clear_unimap(struct vc_data *vc, struct unimapinit 
*ui)
console_unlock();
return ret;
 }
-   
+
 int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list)
 {
int err = 0, err1, i;
@@ -545,22 +545,22 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct 
unipair __user *list)
console_unlock();
return -EIO;
}
-   
+
if (!ct) {
console_unlock();
return 0;
}
-   
+
if (p->refcount > 1) {
int j, k;
u16 **p1, *p2, l;
-   
+
err1 = con_do_clear_unimap(vc, NULL);
if (err1) {
console_unlock();
return err1;
}
-   
+
/*
 * Since refcount was > 1, con_clear_unimap() allocated a
 * a new uni_pagedir for this vc.  Re: p != q
@@ -591,7 +591,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct 
unipair 

[PATCH 1/2] tty: remove trailing spaces in tty/hvc

2012-12-07 Thread Cong Ding
remove trailing blank spaces in tty/hvc by shell command:
sed 's/\s\+$//g' -i

I have manually reviewed that everything is correct.

Signed-off-by: Cong Ding 
---
 drivers/tty/hvc/hvc_console.c |6 +++---
 drivers/tty/hvc/hvc_xen.c |2 +-
 drivers/tty/hvc/hvcs.c|4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 13ee53b..088fdfc 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -11,12 +11,12 @@
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
- * 
+ *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- * 
+ *
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
@@ -187,7 +187,7 @@ static struct tty_driver *hvc_console_device(struct console 
*c, int *index)
 }
 
 static int __init hvc_console_setup(struct console *co, char *options)
-{  
+{
if (co->index < 0 || co->index >= MAX_NR_HVC_CONSOLES)
return -ENODEV;
 
diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index 19843ec..c598d55 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -126,7 +126,7 @@ static int domU_write_console(uint32_t vtermno, const char 
*data, int len)
 */
while (len) {
int sent = __write_console(cons, data, len);
-   
+
data += sent;
len -= sent;
 
diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c
index 8776357..5f5ddf5 100644
--- a/drivers/tty/hvc/hvcs.c
+++ b/drivers/tty/hvc/hvcs.c
@@ -166,8 +166,8 @@ MODULE_VERSION(HVCS_DRIVER_VERSION);
 /*
  * The hcall interface involves putting 8 chars into each of two registers.
  * We load up those 2 registers (in arch/powerpc/platforms/pseries/hvconsole.c)
- * by casting char[16] to long[2].  It would work without __ALIGNED__, but a 
- * little (tiny) bit slower because an unaligned load is slower than aligned 
+ * by casting char[16] to long[2].  It would work without __ALIGNED__, but a
+ * little (tiny) bit slower because an unaligned load is slower than aligned
  * load.
  */
 #define __ALIGNED____attribute__((__aligned__(8)))
-- 
1.7.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Chris Mason
On Fri, Dec 07, 2012 at 05:17:05PM -0700, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:

[ dead and beaten fallocate ponies ]

> 
> > On a single flash drive doing random 4K writes, xfs does 950MB/s into
> > regular extents but only 400MB/s into preallocated extents.
> > 
> > http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png
> 
> This is bordering on irrelevancy, but can you provide the workload
> you were running to generate this graph?  Random 4k writes could be
> anything, really.

This one was fio aio/dio, I'll dig out the job file and rerun it on
3.7-rc on Monday.  Any real random write is going to show this with
enough load.

> 
> In my experience, applications that actually do processing between
> random write IOs don't see anywhere near the same degradation as
> such micro-benchmarks tend to indicate can occur with unwritten
> extents. Are you seeing this level of degradation in real-world applications?
> If you give me a reason to fix it (and the hardware to test it on),
> I'm pretty sure I can bring the overhead down to just a few percent
> on fully featured SSDs like FusionIO devices...

We should have a card I can send, drop me the address.

For the workload...that's harder.  We can talk all day about what a
normal random write workload is, but if you have a fio job that you
think represents real world, I can run that.

[ much nodding ;) ]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Incrementing module reference count

2012-12-07 Thread Aaron Williams

Hi,

I have a kernel module which other modules register with in order to 
export access functions. So far I have everything working but I want to 
prevent a module that is registered with my module from unloading since 
now my module is dependent on the other module.


Is there a way I can cause the reference count of the module registering 
with my module to increase? I tried calling get_device with the device 
structure of the module that is registering but that does not seem to work.


For example, I have the following function:

/**
 * Adds a mapping of a device node to a memory accessor
 *
 * @param[in] dev - device
 * @param[in] macc - memory accessor
 *
 * @returns 0 for success or -ENOMEM
 */
int of_memory_accessor_register(struct device *dev,
struct memory_accessor *macc)
{
struct of_macc_entry *mentry;

mentry = kmalloc(sizeof(*mentry), GFP_KERNEL);
if (mentry == NULL)
return -ENOMEM;

mentry->dev = dev;
mentry->macc = macc;

mutex_lock();

get_device(dev);
list_add(&(mentry->list), _list);

mutex_unlock();

return 0;
}
EXPORT_SYMBOL(of_memory_accessor_register);

Basically my module is used for things like serial EEPROMs and whatnot 
so that external modules can find the accessor functions based on the 
device tree. In my case I am updating the Vitesse VSC848X driver so that 
it can read the SFP module when it is plugged in using the AT24 I2C 
EEPROM module. I want to prevent the at24 module from unloading while 
other modules in turn are using it. The at24 module does not export any 
symbols.


NOTE that I plan to change the above code so that I only increment the 
eeprom modules reference count when another module actually uses it and 
release it only when all other modules are no longer using the accessor 
functions.


I also am not subscribed to the LKML so please CC me on any responses.

-Aaron

--
Aaron Williams
Software Engineer
Cavium, Inc.
(408) 943-7198  (510) 789-8988 (cell)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document how capability bits work

2012-12-07 Thread Andy Lutomirski
On Fri, Dec 7, 2012 at 5:10 PM, Rob Landley  wrote:
> On 12/07/2012 01:32:18 PM, Andy Lutomirski wrote:
>>
>> On Fri, Dec 7, 2012 at 11:21 AM, Serge Hallyn
>>  wrote:
>> > Quoting Andy Lutomirski (l...@amacapital.net):
>> >> Signed-off-by: Andy Lutomirski 
>> >> ---
>> >>  Documentation/security/capabilities.txt | 161
>> >> 
>> >>  1 file changed, 161 insertions(+)
>> >>  create mode 100644 Documentation/security/capabilities.txt
>> >
>> > TBH, I think a pointer to the capabilities.7 man page would be better.
>> > (plus, if you feel they are needed, updates to the man page)
>>
>> Updating capabilities.7 wouldn't be a bad idea, but IMO it certainly
>> needs work.  For example, it says:
>
> ...
>
>> I would be happy to revise this patch to reference capabilities.7.
>
>
> The capabilities.7 man page is existing maintained documentation on how to
> use this from userspace, which seems to be the point of your document.
> Having include/linux/uapi/capability.h mention its existence might be good.
> Feeding fixes to the documentation we've already got would be good.
>
> I read your document having largely ignored capabilities for years, and
> don't feel I have a better understanding of them after reading it. (I'm
> aware they exist, I'm aware they're used as a justification for extended
> attributes, I'm aware people think breaking a fireplace into a bunch of
> candleflames increases fire safety. I'm aware of
> http://forums.grsecurity.net/viewtopic.php?f=7=2522 and I _used_ to be
> aware of
> http://userweb.kernel.org/~morgan/sendmail-capabilities-war-story.html but
> kernel.org never bothered putting most of itself back together after the
> breakin last year and archive.org doesn't have a copy. I'm aware that a
> decade ago at Atlanta Linux Showcase in california Ted Tso was sad nobody
> was using them yet. But I haven't hugely been tracking changes over the last
> 5 years in how they work. It looks like figuring out who has what involves
> working through exercises in set theory that cannot be explained using a 127
> bit ascii set. Personally, I prefer "more dangerous" security setups that
> don't require I pull out scratch paper to reason about the state of the
> system, so perhaps I'm biased here.)

Heh.  I agree this stuff is shockingly complicated.  (And this
document isn't wriiten in ASCII...)

I actually wrote this file because I was reading the code and trying
to figure out wtf was going on.  This is the result :)  I'll see if I
can improve capabilities.7.

Any pointers to things you wanted to understand?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 04/44] trace/ring_buffer: handle 64bit aligned structs

2012-12-07 Thread Steven Rostedt
On Wed, 2012-12-05 at 16:08 +, James Hogan wrote:
> Some 32 bit architectures have 64 bit struct alignment (for example
> Meta which has 64 bit read/write instructions). These require 8 byte
> alignment of event data too, so use CONFIG_HAVE_64BIT_ALIGNED_STRUCT
> instead of CONFIG_64BIT to decide alignment, and align
> buffer_data_page::data accordingly.
> 
> Signed-off-by: James Hogan 

Acked-by: Steven Rostedt 

-- Steve

> Cc: Frederic Weisbecker 
> Cc: Ingo Molnar 
> ---
>  kernel/trace/ring_buffer.c |7 +--
>  1 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index b979426..c4dc029 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -177,7 +177,8 @@ void tracing_off_permanent(void)
>  #define RB_MAX_SMALL_DATA(RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
>  #define RB_EVNT_MIN_SIZE 8U  /* two 32bit words */
>  
> -#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
> +#if !defined(CONFIG_HAVE_64BIT_ALIGNED_STRUCT) || \
> + defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
>  # define RB_FORCE_8BYTE_ALIGNMENT0
>  # define RB_ARCH_ALIGNMENT   RB_ALIGNMENT
>  #else
> @@ -185,6 +186,8 @@ void tracing_off_permanent(void)
>  # define RB_ARCH_ALIGNMENT   8U
>  #endif
>  
> +#define RB_ALIGN_DATA__aligned(RB_ARCH_ALIGNMENT)
> +
>  /* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */
>  #define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX
>  
> @@ -333,7 +336,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);
>  struct buffer_data_page {
>   u64  time_stamp;/* page time stamp */
>   local_t  commit;/* write committed index */
> - unsigned chardata[];/* data of buffer page */
> + unsigned chardata[] RB_ALIGN_DATA;  /* data of buffer page */
>  };
>  
>  /*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PATCH reduce impact of FIFREEZE on userland processes

2012-12-07 Thread Dave Chinner
On Fri, Dec 07, 2012 at 08:59:52AM +, Alun wrote:
> Dave Chinner  said, in message
> 20121207004255.GC27172@dastard:
> > 
> > The problem wth doing this is that the sync can delay the freeze
> > process by quite some time under the exact conditions you describe.
> > If you want freeze to take effect immediately (i.e instantly stop
> > new modifications), then adding a sync will break this semantic.
> > THere are existing users of freeze that require this behaviour...
> 
> Ahh, that would be the subtlety I was worried might exist! Thanks.
> 
> The specific issue that brought me here was that, on a fairly heavily
> loaded file server (>1000 connected Windows clients), taking an LVM
> snapshot caused enough of an interruption to service that many of the
> Windows clients disconnected and reconnected, so causing a huge process
> load on the server - enough that we'd completely lose service and have
> to reboot. Chasing this down, I noticed that FIFREEZE does a filesystem
> sync, and it seemed to me that adding another one prior to blocking
> writes was an easy hit. 

Yup, that's typical.

> I'm not trying to argue my case here - you've convinced me that this
> change in semantics is risky and removes flexibility.
> 
> I'll try and chase this up by submitting patches to lvcreate and
> fsfreeze (in the former case, I think there's no reason not to run
> syncfs; in the latter perhaps it should be a command line option).

Is that even necesary? users can issue the sync themselves if
necessary

> > That, to me, is irrelevant, because something is normally done while
> > the filesystem is frozen. It's not uncommon for freeze periods to
> > extend to minutes while work is done by whatever required the
> > freeze. Hence the few seconds it takes to acheive the frozen state is
> > mostly irrelevant.
> 
> You've referred twice to existing systems that would break in the
> presence of this change.  I'm really having trouble thinking of a
> situation where it's critical to have writes suspended *NOW* and where
> it's valid to keep them suspended for minutes.

Say you get your filesystem reporting a read error in a directory.
There are people out there that will immediately freeze the
filesystem (to prevent potential damage from being propagated) while
they investigate the problem and determine their next action. This
may even involve running non-modifying fsck on the underlying block
device while the filesystem is frozen...

Then there is systems like HA servers that share a filesystem in a
primary/secondary setup - freezes are often used in failover
situations. This ensures all cached dirty data is written to disk
in preparation for the other node to mount it. Freezing the
filesystem ensures that spurious errors are not returned to
applications/clients while the failover takes place. Hence the
filesystem can remain frozen for some time while everything on the
new primary node is started up and fences/STONITHs the frozen
node

Then there's co-ordinating management operations on filesystems that
span multiple storage arrays (e.g. for hardware based snapshots,
cloning, etc), VM guest migration between two physical hosts, and so
on. Freeze is use for a lot more things than LVM snapshots...

> I'd have thought that,
> in the vast majority of cases, the critical thing was to minimise the
> time for which writes were suspended.

In the obvious use cases, yes. Once you look outside snapshots to
consider applications that need a stable, unchanging filesystem in
an application transparent manner, you'll find lots of interesting
uses for FIFREEZE

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] acpi: Introduce prepare_remove device operation

2012-12-07 Thread Toshi Kani
On Fri, 2012-12-07 at 13:57 +0800, Jiang Liu wrote:
> On 2012-12-7 10:57, Toshi Kani wrote:
> > On Fri, 2012-12-07 at 00:40 +0800, Jiang Liu wrote:
> >> On 12/04/2012 08:10 AM, Toshi Kani wrote:
> >>> On Mon, 2012-12-03 at 12:25 +0800, Hanjun Guo wrote:
>  On 2012/11/30 6:27, Toshi Kani wrote:
> > On Thu, 2012-11-29 at 12:48 +0800, Hanjun Guo wrote:
 :
> >>>
> >>> If I read the code right, the framework calls ACPI drivers differently
> >>> at boot-time and hot-add as follows.  That is, the new entry points are
> >>> called at hot-add only, but .add() is called at both cases.  This
> >>> requires .add() to work differently.
> >>>
> >>> Boot: .add()
> >>> Hot-Add : .add(), .pre_configure(), configure(), etc.
> >>>
> >>> I think the boot-time and hot-add initialization should be done
> >>> consistently.  While there is difficulty with the current boot sequence,
> >>> the framework should be designed to allow them consistent, not make them
> >>> diverged.
> >> Hi Toshi,
> >>We have separated hotplug operations from driver binding/unbinding 
> >> interface
> >> due to following considerations.
> >> 1) Physical CPU and memory devices are initialized/used before the ACPI 
> >> subsystem
> >>is initialized. So under normal case, .add() of processor and 
> >> acpi_memhotplug only
> >>figures out information about device already in working state instead 
> >> of starting
> >>the device.
> > 
> > I agree that the current boot sequence is not very hot-plug friendly...
> > 
> >> 2) It's impossible to rmmod the processor and acpi_memhotplug driver at 
> >> runtime 
> >>if .remove() of CPU and memory drivers do really remove the CPU/memory 
> >> device
> >>from the system. And the ACPI processor driver also implements CPU PM 
> >> funcitonality
> >>other than hotplug.
> > 
> > Agreed.
> > 
> >> And recently Rafael has mentioned that he has a long term view to get rid 
> >> of the
> >> concept of "ACPI device". If that happens, we could easily move the hotplug
> >> logic from ACPI device drivers into the hotplug framework if the hotplug 
> >> logic
> >> is separated from the .add()/.remove() callbacks. Actually we could even 
> >> move all
> >> hotplug only logic into the hotplug framework and don't rely on any ACPI 
> >> device
> >> driver any more. So we could get rid of all these messy things. We could 
> >> achieve
> >> that by:
> >> 1) moving code shared by ACPI device drivers and the hotplug framework 
> >> into the core.
> >> 2) moving hotplug only code to the framework.
> > 
> > Yes, the framework should allow such future work.  I also think that the
> > framework itself should be independent from such ACPI issue.  Ideally,
> > it should be able to support non-ACPI platforms.
> The same point here. The ACPI based hotplug framework is designed as:
> 1) an ACPI based hotplug slot driver to handle platform specific logic.
>Platform may provide platform specific slot drivers to discover, manage
>hotplug slots. We have provided a default implementation of slot driver
>according to the ACPI spec.

The ACPI spec does not define that _EJ0 is required to receive a hot-add
request, i.e. bus/device check.  This is a major issue.  Since Windows
only supports hot-add, I think there are platforms that only support
hot-add today.

> 2) an ACPI based hotplug manager driver, which is a platform independent
>driver and manages all hotplug slot created by the slot driver.

It is surely impressive work, but I think is is a bit overdoing.  I
expect hot-pluggable servers come with management console and/or GUI
where a user can manage hardware units and initiate hot-plug operations.
I do not think the kernel needs to step into such area since it tends to
be platform-specific. 

> We haven't gone further enough to provide an ACPI independent hotplug 
> framework
> because we only have experience with x86 and Itanium, both are ACPI based.
> We may try to implement an ACPI independent hotplug framework by pushing all
> ACPI specific logic into the slot driver, I think it's doable. But we need
> suggestions from experts of other architectures, such as SPARC and Power.
> But seems Power already have some sorts of hotplug framework, right?

I do not know about the Linux hot-plug support on other architectures.
PA-RISC SuperDome also supports Node hot-plug, but it is not supported
by Linux.  Since ARM is getting used by servers, I would not surprise if
there will be an ARM based server with hot-plug support in future.

> >> Hi Rafael, what's your thoughts here?
> >>
> >>>
> >>> 1. Validate phase - Verify if the request is a supported operation.  
> >>> All
> >>> known restrictions are verified at this phase.  For instance, if a
> >>> hot-remove request involves kernel memory, it is failed in this phase.
> >>> Since this phase makes no change, no rollback is necessary to fail. 
> >>
> >> Yes, we have done this in acpihp_drv_pre_execute, and check following 
> 

Re: [PATCH v2 1/2] zsmalloc: add function to query object size

2012-12-07 Thread Nitin Gupta
On Sun, Dec 2, 2012 at 11:52 PM, Minchan Kim  wrote:
> On Sun, Dec 02, 2012 at 11:20:42PM -0800, Nitin Gupta wrote:
>>
>>
>> On Nov 30, 2012, at 5:54 AM, Minchan Kim  wrote:
>>
>> > On Thu, Nov 29, 2012 at 10:54:48PM -0800, Nitin Gupta wrote:
>> >> Changelog v2 vs v1:
>> >> - None
>> >>
>> >> Adds zs_get_object_size(handle) which provides the size of
>> >> the given object. This is useful since the user (zram etc.)
>> >> now do not have to maintain object sizes separately, saving
>> >> on some metadata size (4b per page).
>> >>
>> >> The object handle encodes  pair which currently points
>> >> to the start of the object. Now, the handle implicitly stores the size
>> >> information by pointing to the object's end instead. Since zsmalloc is
>> >> a slab based allocator, the start of the object can be easily determined
>> >> and the difference between the end offset encoded in the handle and the
>> >> start gives us the object size.
>> >>
>> >> Signed-off-by: Nitin Gupta 
>> > Acked-by: Minchan Kim 
>> >
>> > I already had a few comment in your previous versoin.
>> > I'm OK although you ignore them because I can make follow up patch about
>> > my nitpick but could you answer below my question?
>> >
>> >> ---
>> >> drivers/staging/zsmalloc/zsmalloc-main.c |  177 
>> >> +-
>> >> drivers/staging/zsmalloc/zsmalloc.h  |1 +
>> >> 2 files changed, 127 insertions(+), 51 deletions(-)
>> >>
>> >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
>> >> b/drivers/staging/zsmalloc/zsmalloc-main.c
>> >> index 09a9d35..65c9d3b 100644
>> >> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
>> >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
>> >> @@ -112,20 +112,20 @@
>> >> #define MAX_PHYSMEM_BITS 36
>> >> #else /* !CONFIG_HIGHMEM64G */
>> >> /*
>> >> - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
>> >> just
>> >> + * If this definition of MAX_PHYSMEM_BITS is used, OFFSET_BITS will just
>> >>  * be PAGE_SHIFT
>> >>  */
>> >> #define MAX_PHYSMEM_BITS BITS_PER_LONG
>> >> #endif
>> >> #endif
>> >> #define _PFN_BITS(MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> >> -#define OBJ_INDEX_BITS(BITS_PER_LONG - _PFN_BITS)
>> >> -#define OBJ_INDEX_MASK((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
>> >> +#define OFFSET_BITS(BITS_PER_LONG - _PFN_BITS)
>> >> +#define OFFSET_MASK((_AC(1, UL) << OFFSET_BITS) - 1)
>> >>
>> >> #define MAX(a, b) ((a) >= (b) ? (a) : (b))
>> >> /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
>> >> #define ZS_MIN_ALLOC_SIZE \
>> >> -MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
>> >> +MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OFFSET_BITS))
>> >> #define ZS_MAX_ALLOC_SIZEPAGE_SIZE
>> >>
>> >> /*
>> >> @@ -256,6 +256,11 @@ static int is_last_page(struct page *page)
>> >>return PagePrivate2(page);
>> >> }
>> >>
>> >> +static unsigned long get_page_index(struct page *page)
>> >> +{
>> >> +return is_first_page(page) ? 0 : page->index;
>> >> +}
>> >> +
>> >> static void get_zspage_mapping(struct page *page, unsigned int *class_idx,
>> >>enum fullness_group *fullness)
>> >> {
>> >> @@ -433,39 +438,86 @@ static struct page *get_next_page(struct page *page)
>> >>return next;
>> >> }
>> >>
>> >> -/* Encode  as a single handle value */
>> >> -static void *obj_location_to_handle(struct page *page, unsigned long 
>> >> obj_idx)
>> >> +static struct page *get_prev_page(struct page *page)
>> >> {
>> >> -unsigned long handle;
>> >> +struct page *prev, *first_page;
>> >>
>> >> -if (!page) {
>> >> -BUG_ON(obj_idx);
>> >> -return NULL;
>> >> -}
>> >> +first_page = get_first_page(page);
>> >> +if (page == first_page)
>> >> +prev = NULL;
>> >> +else if (page == (struct page *)first_page->private)
>> >> +prev = first_page;
>> >> +else
>> >> +prev = list_entry(page->lru.prev, struct page, lru);
>> >>
>> >> -handle = page_to_pfn(page) << OBJ_INDEX_BITS;
>> >> -handle |= (obj_idx & OBJ_INDEX_MASK);
>> >> +return prev;
>> >>
>> >> -return (void *)handle;
>> >> }
>> >>
>> >> -/* Decode  pair from the given object handle */
>> >> -static void obj_handle_to_location(unsigned long handle, struct page 
>> >> **page,
>> >> -unsigned long *obj_idx)
>> >> +static void *encode_ptr(struct page *page, unsigned long offset)
>> >> {
>> >> -*page = pfn_to_page(handle >> OBJ_INDEX_BITS);
>> >> -*obj_idx = handle & OBJ_INDEX_MASK;
>> >> +unsigned long ptr;
>> >> +ptr = page_to_pfn(page) << OFFSET_BITS;
>> >> +ptr |= offset & OFFSET_MASK;
>> >> +return (void *)ptr;
>> >> +}
>> >> +
>> >> +static void decode_ptr(unsigned long ptr, struct page **page,
>> >> +unsigned int *offset)
>> >> +{
>> >> +*page = pfn_to_page(ptr >> OFFSET_BITS);
>> >> +*offset = ptr & OFFSET_MASK;
>> >> +}
>> >> +
>> >> +static struct page *obj_handle_to_page(unsigned long handle)
>> >> +{

Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes

2012-12-07 Thread Darrick J. Wong
On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote:
> On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara  wrote:
> > > > > 
> > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > > > ext3 doesn't properly isolate pages from changes during 
> > > > > > > > > writeback.  Since the
> > > > > > > > > recommended fix is to use ext4, for now we'll just print a 
> > > > > > > > > warning if the user
> > > > > > > > > tries to mount in write mode.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Darrick J. Wong 
> > > > > > > > > ---
> > > > > > > > >  fs/ext3/super.c |8 
> > > > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > > > index 5366393..5b3725d 100644
> > > > > > > > > --- a/fs/ext3/super.c
> > > > > > > > > +++ b/fs/ext3/super.c
> > > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct 
> > > > > > > > > super_block *sb, struct ext3_super_block *es,
> > > > > > > > >   "forcing read-only mode");
> > > > > > > > >   res = MS_RDONLY;
> > > > > > > > >   }
> > > > > > > > > + if (!read_only &&
> > > > > > > > > + 
> > > > > > > > > queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > > > + ext3_msg(sb, KERN_ERR,
> > > > > > > > > + "error: ext3 cannot safely write data 
> > > > > > > > > to a disk "
> > > > > > > > > + "requiring stable pages writes; forcing 
> > > > > > > > > read-only "
> > > > > > > > > + "mode.  Upgrading to ext4 is 
> > > > > > > > > recommended.");
> > > > > > > > > + res = MS_RDONLY;
> > > > > > > > > + }
> > > > > > > > >   if (read_only)
> > > > > > > > >   return res;
> > > > > > > > >   if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > > > 
> > > > > > > Yes, it's still broken even with this new set of changes.  Now 
> > > > > > > that I think
> > > > > > > about it a little more, I recall that writeback mode was actually 
> > > > > > > fine, so this
> > > > > > > is a little harsh.
> > > > > > > 
> > > > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I 
> > > > > > > guess?) tries
> > > > > > > to write mapped buffers back through the journal?  Taking it out 
> > > > > > > seems to fix
> > > > > > > ordered mode, though I have a suspicion that it might very well 
> > > > > > > break ordered
> > > > > > > mode too.
> > > > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > > > PageWriteback) will break things. So please, change warning to:
> > > > 
> > > > Maybe we should just fix this anyway?
> > > > 
> > > > I still have the patch that adds PG_stable (and changes the
> > > > wait_for_page_stable() test to use this flag instead of PG_writeback) 
> > > > kicking
> > > > around in my tree.  I wrote a patch to jbd that changes 
> > > > journal_do_submit_data
> > > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable 
> > > > bit in
> > > > the end_io processing.
> > > > 
> > > > It seems to get rid of the checksum-after-write errors, though I'm not
> > > > convinced it's correct.  But, I'll send both patches along.
> > >   I'll check the patches. Fixing PageWriteback logic for ext3 is not 
> > > easily
> > > doable due to lock ranking constraints - PageWriteback has to be set under
> > > PageLocked but that ranks above transaction start so kjournald cannot grab
> > > page locks so it cannot set PageWriteback... And changing the lock 
> > > ordering
> > > is a major surgery.
> > > 
> > > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > > and ->page_mkwrite implementations in case stable writes are required. If
> > > your approach with a separate page bit doesn't work out (and I have some
> > > doubts about that as mm people are *really* thrifty with page bits).
> > > 
> > > > > > /*
> > > > > >  * In data=ordered mode, kjournald writes buffers without 
> > > > > > setting
> > > > > >  * PageWriteback bit thus generic code does not properly wait 
> > > > > > for
> > > > > >  * writeback of those buffers to finish.
> > > > > >  */
> > > > > > if (!read_only &&
> > > > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > 

Re: [PATCH] Document how capability bits work

2012-12-07 Thread Rob Landley

On 12/07/2012 01:32:18 PM, Andy Lutomirski wrote:

On Fri, Dec 7, 2012 at 11:21 AM, Serge Hallyn
 wrote:
> Quoting Andy Lutomirski (l...@amacapital.net):
>> Signed-off-by: Andy Lutomirski 
>> ---
>>  Documentation/security/capabilities.txt | 161  


>>  1 file changed, 161 insertions(+)
>>  create mode 100644 Documentation/security/capabilities.txt
>
> TBH, I think a pointer to the capabilities.7 man page would be  
better.

> (plus, if you feel they are needed, updates to the man page)

Updating capabilities.7 wouldn't be a bad idea, but IMO it certainly
needs work.  For example, it says:

...

I would be happy to revise this patch to reference capabilities.7.


The capabilities.7 man page is existing maintained documentation on how  
to use this from userspace, which seems to be the point of your  
document. Having include/linux/uapi/capability.h mention its existence  
might be good. Feeding fixes to the documentation we've already got  
would be good.


I read your document having largely ignored capabilities for years, and  
don't feel I have a better understanding of them after reading it. (I'm  
aware they exist, I'm aware they're used as a justification for  
extended attributes, I'm aware people think breaking a fireplace into a  
bunch of candleflames increases fire safety. I'm aware of  
http://forums.grsecurity.net/viewtopic.php?f=7=2522 and I _used_ to  
be aware of  
http://userweb.kernel.org/~morgan/sendmail-capabilities-war-story.html  
but kernel.org never bothered putting most of itself back together  
after the breakin last year and archive.org doesn't have a copy. I'm  
aware that a decade ago at Atlanta Linux Showcase in california Ted Tso  
was sad nobody was using them yet. But I haven't hugely been tracking  
changes over the last 5 years in how they work. It looks like figuring  
out who has what involves working through exercises in set theory that  
cannot be explained using a 127 bit ascii set. Personally, I prefer  
"more dangerous" security setups that don't require I pull out scratch  
paper to reason about the state of the system, so perhaps I'm biased  
here.)


Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tty: vt/Makefile: set the variables to static

2012-12-07 Thread Cong Ding
tty: vt/Makefile: set the variables to static

In the file drivers/tty/vt/defkeymap.c generated by command
loadkeys --mktable defkeymap.map > defkeymap.c
the 6 variables: shift_map, altgr_map, ctrl_map, shift_ctrl_map, alt_map,
and ctrl_alt_map should be static because they are used only in this file.
There is no reason to remove the static by sed command.

Signed-off-by: Cong Ding 
---
 drivers/tty/vt/Makefile |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/Makefile b/drivers/tty/vt/Makefile
index 14a51c9..17ae94c 100644
--- a/drivers/tty/vt/Makefile
+++ b/drivers/tty/vt/Makefile
@@ -27,8 +27,6 @@ $(obj)/defkeymap.o:  $(obj)/defkeymap.c
 ifdef GENERATE_KEYMAP
 
 $(obj)/defkeymap.c: $(obj)/%.c: $(src)/%.map
-   loadkeys --mktable $< > $@.tmp
-   sed -e 's/^static *//' $@.tmp > $@
-   rm $@.tmp
+   loadkeys --mktable $< > $@
 
 endif
-- 
1.7.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 -tip 3/4] tracing: make a snapshot feature available from userspace

2012-12-07 Thread Steven Rostedt
On Fri, 2012-12-07 at 11:07 +0900, Hiraku Toyooka wrote:
> Hi, Steven,
> 
> (2012/11/30 23:17), Steven Rostedt wrote:
> [snip]
>  >
>  > Actually, I would have:
>  >
>  >   status\input | 0  | 1  |else|
>  >  --++++
>  >  not allocated |(do nothing)| alloc+swap |   EINVAL   |
>  >  --++++
>  >allocated   |free|   swap |   clear|
>  >  --++++
>  >
>  > Perhaps we don't need to do the clear on swap, just let the trace
>  > continue where it left off? But in case we should swap...
>  >
> 
> I think we don't need the clear on swap too.
> I'll update my patches like this table.
> 
>  > There's a fast way to clear the tracer. Look at what the wakeup tracer
>  > does. We can make that generic. If you want, I can write that code up
>  > too. Hmm, maybe I'll do that, as it will speed things up for
>  > everyone :-)
>  >
> 
> (I looked over the wakeup tracer, but I couldn't find that code...)

Heh, sorry, you needed to look at the "update_max_tr()" in
kernel/trace/trace.c. Where we update the time_start value. Then the
output skips all timestamps before that start. This is much more
efficient than a 'reset', as we don't need to sync or anything. Just
record the timestamp of where we want to consider the buffer started,
and ignore any event before that.


> I think that seq_read() calls s_stop() even if s_start() failed.
> 
> seq_read()@fs/seq_file.c:
> 
>  p = m->op->start(m, );
>  while (1) {
>  err = PTR_ERR(p);
>  if (!p || IS_ERR(p))
>  break;
>  ...
>  }
>  m->op->stop(m, p);
> 
> So, I think we need the check in s_stop(), don't we?

Crap, you're right. Hmm, why was I thinking that it didn't. I better go
and review some of my recent code to make sure that I didn't have that
wrong assumption.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/20] 3.4.23-stable review

2012-12-07 Thread Shuah Khan
On Fri, Dec 7, 2012 at 5:52 PM, Shuah Khan  wrote:
> On Fri, Dec 7, 2012 at 5:49 PM, Shuah Khan  wrote:
>> On Thu, Dec 6, 2012 at 5:54 PM, Greg Kroah-Hartman
>>  wrote:
>>> This is the start of the stable review cycle for the 3.4.23 release.
>>> There are 20 patches in this series, all will be posted as a response
>>> to this one.  If anyone has any issues with these being applied, please
>>> let me know.
>>>
>>> Responses should be made by Sun Dec  9 00:50:03 UTC 2012.
>>> Anything received after that time might be too late.
>>>
>>> The whole patch series can be found in one patch at:
>>> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.23-rc1.gz
>>> and the diffstat can be found below.
>>>
>>> thanks,
>>>
>>> greg k-h
>>
>> Patche applied cleanly and compiled and booted on the following systems:
>>
>> HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
>> HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics
>>
>> Cross-compile tests
>> alpha: defconfig passed
>> arm: defconfig passed
>> c6x: defconfig passed
>> mips: defconfig passed
>> mipsel: defconfig passed
>> powerpc: wii_defconfig passed
>> sh: defconfig passed
>> sparc: defconfig passed
>> tile: tilegx_defconfig passed
>
> Correction.
> tile: tilegx_defconfig failed:
>
> tile tilegx_defconfig
>
> Just a snippet, I can send a full log.
>
>  LD  init/built-in.o
>   HOSTCC  usr/gen_init_cpio
>   /home/shuah/lkml/linux_stable_testing_3.0.56/scripts/gen_initramfs_list.sh:
> Cannot open 'usr/contents.txt'
> make[1]: *** [usr/initramfs_data.cpio] Error 1
> make: *** [usr] Error 2
>
> Didn't get a chance to investigate.

Sorry tile failure is for 3.0.56

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/27] 3.6.10-stable review

2012-12-07 Thread Shuah Khan
On Fri, Dec 7, 2012 at 5:46 PM, Shuah Khan  wrote:
> On Thu, Dec 6, 2012 at 5:58 PM, Greg Kroah-Hartman
>  wrote:
>> This is the start of the stable review cycle for the 3.6.10 release.
>> There are 27 patches in this series, all will be posted as a response
>> to this one.  If anyone has any issues with these being applied, please
>> let me know.
>>
>> Responses should be made by Sun Dec  9 00:57:22 UTC 2012.
>> Anything received after that time might be too late.
>>
>> The whole patch series can be found in one patch at:
>> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.6.10-rc1.gz
>> and the diffstat can be found below.
>>
>> thanks,
>
> Patch applied cleanly and compiled and booted on the following systems:
>
> HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
> HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics
>
> Cross-compile tests
> alpha: defconfig passed
> arm: defconfig passed
> c6x: not applicable
> mips: defconfig passed
> mipsel: defconfig passed
> powerpc: wii_defconfig failed on 3.0.56 - fixed it and sending a patch.
> sh: defconfig passed on all
> sparc: defconfig passed on all
> tile: tilegx_defconfig passed on all

Correction: c6x passed.

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read O_DIRECT regression in 3.7-rc8 (bisected)

2012-12-07 Thread Linus Torvalds


On Fri, 7 Dec 2012, Linus Torvalds wrote:
> 
> This (TOTALLY UNTESTED) patch adds it the same iov_shorten() logic that 
> the write side has. It does it differently (in fs/block_dev.c rather than 
> in mm/filemap.c), but I actually suspect this is a nicer way to do it, and 
> maybe we should do the write side truncation this way too.

Ok, rebooted and tested with your test-case, and it seems to work fine. 
But I presume that you actually found the problem some other way, so you'd 
want to verify that it fixes whatever original bigger issue you hit.

Thanks,
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/11] 3.0.56-stable review

2012-12-07 Thread Shuah Khan
On Thu, Dec 6, 2012 at 5:56 PM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 3.0.56 release.
> There are 11 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Sun Dec  9 00:55:07 UTC 2012.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.56-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

Patches and compiled and booted on the following systems:

HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics

Cross-compile tests
alpha: defconfig passed
arm: defconfig passed
c6x: not applicable to 3.0.56
mips: defconfig passed
mipsel: defconfig passed
powerpc: wii_defconfig failed on 3.0.56 - sending a patch
sh: defconfig passed on all
sparc: defconfig passed on all
tile: tilegx_defconfig failed:

 LD  init/built-in.o
  HOSTCC  usr/gen_init_cpio
  /home/shuah/lkml/linux_stable_testing_3.0.56/scripts/gen_initramfs_list.sh:
Cannot open 'usr/contents.txt'
make[1]: *** [usr/initramfs_data.cpio] Error 1
make: *** [usr] Error 2

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mmotm 2012-12-07-16-53 uploaded

2012-12-07 Thread akpm
The mm-of-the-moment snapshot 2012-12-07-16-53 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (3.x
or 3.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the
"#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series
file, http://www.ozlabs.org/~akpm/mmotm/series.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/?p=linux-mmotm.git;a=summary

To develop on top of mmotm git:

  $ git remote add mmotm 
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
  $ git remote update mmotm
  $ git checkout -b topic mmotm/master
  
  $ git send-email mmotm/master.. [...]

To rebase a branch with older patches to a new mmotm release:

  $ git remote update mmotm
  $ git rebase --onto mmotm/master  topic




The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/?p=linux-mmots.git;a=summary

and use of this tree is similar to
http://git.cmpxchg.org/?p=linux-mmotm.git, described above.


This mmotm tree contains the following patches against 3.7-rc8:
(patches marked "*" will be included in linux-next)

  origin.patch
  linux-next.patch
  i-need-old-gcc.patch
  arch-alpha-kernel-systblss-remove-debug-check.patch
* thp-fix-update_mmu_cache_pmd-calls.patch
* cris-fix-i-o-macros.patch
* vfs-d_obtain_alias-needs-to-use-as-default-name.patch
* fs-block_devc-page-cache-wrongly-left-invalidated-after-revalidate_disk.patch
* 
arch-x86-platform-iris-irisc-register-a-platform-device-and-a-platform-driver.patch
* x86-numa-dont-check-if-node-is-numa_no_node.patch
* arch-x86-tools-insn_sanityc-identify-source-of-messages.patch
* uv-fix-incorrect-tlb-flush-all-issue.patch
* olpc-fix-olpc-xo1-scic-build-errors.patch
* x86-convert-update_mmu_cache-and-update_mmu_cache_pmd-to-functions.patch
* x86-fix-the-argument-passed-to-sync_global_pgds.patch
* x86-fix-a-compile-error-a-section-type-conflict.patch
* x86-make-mem=-option-to-work-for-efi-platform.patch
* audit-create-explicit-audit_seccomp-event-type.patch
* audit-catch-possible-null-audit-buffers.patch
* ceph-fix-dentry-reference-leak-in-ceph_encode_fh.patch
* cris-use-int-for-ssize_t-to-match-size_t.patch
* pcmcia-move-unbind-rebind-into-dev_pm_opscomplete.patch
* drivers-video-add-support-for-the-solomon-ssd1307-oled-controller.patch
* 
drivers-video-console-softcursorc-remove-redundant-null-check-before-kfree.patch
* fb-rework-locking-to-fix-lock-ordering-on-takeover.patch
* fb-rework-locking-to-fix-lock-ordering-on-takeover-fix.patch
* cyber2000fb-avoid-palette-corruption-at-higher-clocks.patch
* irq-tsk-comm-is-an-array.patch
* timeconstpl-remove-deprecated-defined-array.patch
* time-dont-inline-export_symbol-functions.patch
* coccinelle-add-api-d_find_aliascocci.patch
* h8300-select-generic-atomic64_t-support.patch
* mm-mempolicy-introduce-spinlock-to-read-shared-policy-tree.patch
* drivers-message-fusion-mptscsihc-missing-break.patch
* 
block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch
* block-remove-deadlock-in-disk_clear_events.patch
* block-remove-deadlock-in-disk_clear_events-fix.patch
* block-prevent-race-cleanup.patch
* block-prevent-race-cleanup-fix.patch
* vfs-increment-iversion-when-a-file-is-truncated.patch
* fs-change-return-values-from-eacces-to-eperm.patch
* fs-block_devc-need-not-to-check-inode-i_bdev-in-bd_forget.patch
* watchdog-trigger-all-cpu-backtrace-when-locked-up-and-going-to-panic.patch
* mm-slab-remove-duplicate-check.patch
  mm.patch
* 
writeback-remove-nr_pages_dirtied-arg-from-balance_dirty_pages_ratelimited_nr.patch
* mm-show-migration-types-in-show_mem.patch
* 

Re: [ 00/20] 3.4.23-stable review

2012-12-07 Thread Shuah Khan
On Fri, Dec 7, 2012 at 5:49 PM, Shuah Khan  wrote:
> On Thu, Dec 6, 2012 at 5:54 PM, Greg Kroah-Hartman
>  wrote:
>> This is the start of the stable review cycle for the 3.4.23 release.
>> There are 20 patches in this series, all will be posted as a response
>> to this one.  If anyone has any issues with these being applied, please
>> let me know.
>>
>> Responses should be made by Sun Dec  9 00:50:03 UTC 2012.
>> Anything received after that time might be too late.
>>
>> The whole patch series can be found in one patch at:
>> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.23-rc1.gz
>> and the diffstat can be found below.
>>
>> thanks,
>>
>> greg k-h
>
> Patche applied cleanly and compiled and booted on the following systems:
>
> HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
> HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics
>
> Cross-compile tests
> alpha: defconfig passed
> arm: defconfig passed
> c6x: defconfig passed
> mips: defconfig passed
> mipsel: defconfig passed
> powerpc: wii_defconfig passed
> sh: defconfig passed
> sparc: defconfig passed
> tile: tilegx_defconfig passed

Correction.
tile: tilegx_defconfig failed:

tile tilegx_defconfig

Just a snippet, I can send a full log.

 LD  init/built-in.o
  HOSTCC  usr/gen_init_cpio
  /home/shuah/lkml/linux_stable_testing_3.0.56/scripts/gen_initramfs_list.sh:
Cannot open 'usr/contents.txt'
make[1]: *** [usr/initramfs_data.cpio] Error 1
make: *** [usr] Error 2

Didn't get a chance to investigate.

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-12-07 Thread Marcelo Tosatti
On Thu, Dec 06, 2012 at 12:29:02PM +0530, Raghavendra K T wrote:
> On 12/04/2012 01:26 AM, Marcelo Tosatti wrote:
> >On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
> >>On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
> >>>
> >>>Don't understand the reasoning behind why 3 is a good choice.
> >>
> >>Here is where I came from. (explaining from scratch for
> >>completeness, forgive me :))
> >>In moderate overcommits, we can falsely exit from ple handler even when
> >>we have preempted task of same VM waiting on other cpus. To reduce this
> >>problem, we try few times before exiting.
> >>The problem boils down to:
> >>what is the probability that we exit ple handler even when we have more
> >>than 1 task in other cpus. Theoretical worst case should be around 1.5x
> >>overcommit (As also pointed by Andrew Theurer). [But practical
> >>worstcase may be around 2x,3x overcommits as indicated by the results
> >>for the patch series]
> >>
> >>So if p is the probability of finding rq length one on a particular cpu,
> >>and if we do n tries, then probability of exiting ple handler is:
> >>
> >>  p^(n+1) [ because we would have come across one source with rq length
> >>1 and n target cpu rqs  with length 1 ]
> >>
> >>so
> >>num tries: probability of aborting ple handler (1.5x overcommit)
> >>  1 1/4
> >>  2 1/8
> >>  3 1/16
> >>
> >>We can increase this probability with more tries, but the problem is
> >>the overhead.
> >>Also, If we have tried three times that means we would have iterated
> >>over 3 good eligible vcpus along with many non-eligible candidates. In
> >>worst case if we iterate all the vcpus, we reduce 1x performance and
> >>overcommit performance get hit. [ as in results ].
> >>
> >>I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
> >>concluded 3 is enough.
> >>
> >>Infact I have also run kernbench and hackbench which are giving 5-20%
> >>improvement.
> >>
> >>[ As a side note , I also thought how about having num_tries = f(n) =
> >>ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
> >>overhead and also there is no point in probably making it dependent on
> >>online cpus ]
> >>
> >>Please let me know if you are happy with this rationale/ or correct me
> >>if you foresee some problem. (Infact Avi, Rik's concern about false
> >>exiting made me arrive at 'try' logic which I did not have earlier).
> >>
> >>I am currently trying out the result for 1.5x overcommit will post the
> >>result.
> >
> >Raghavendra
> >
> >Makes sense to me. Thanks.
> >
> 
> Marcelo,
> Do you think this can be considered for next merge window? or you are
> expecting anything else on this patchset.

Nope, not expecting anything else. About merge window, depends on
upstream.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: zram: fix invalid memory references during disk write

2012-12-07 Thread Nitin Gupta
On Thu, Nov 29, 2012 at 10:45 PM, Nitin Gupta  wrote:
> Fixes a bug introduced by commit c8f2f0db1 ("zram: Fix handling
> of incompressible pages") which caused invalid memory references
> during disk write. Invalid references could occur in two cases:
>  - Incoming data expands on compression: In this case, reference was
> made to kunmap()'ed bio page.
>  - Partial (non PAGE_SIZE) write with incompressible data: In this
> case, reference was made to a kfree()'ed buffer.
>
> Fixes bug 50081:
> https://bugzilla.kernel.org/show_bug.cgi?id=50081
>
> Upstream commit ID: c8f2f0d: zram: Fix handling of incompressible pages
> Apply to versions: 3.6.5, 3.6.6, 3.6.7, 3.6.8
>

Greg: can you please apply these patches to above stable versions
and also 3.6.9?

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Dave Chinner
On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
> Ric Wheeler wrote:
> >On 12/07/2012 04:14 PM, Theodore Ts'o wrote:
> >>On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
> >>>How is this similar? By adding this bit, we removed incentive from a
> >>>group of developers that have the means to fix the real issue at hand
> >>>(the performance problem with ext4). Thus, it means that they have a work
> >>>around that's good enough for them, but the rest of us suffer.
> >>That assumes that there **is** a way to claw back the performance
> >>loss, and Chris Mason has demonstrated the performance hit exists with
> >>xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
> >>Sometimes, you have to make the engineering tradeoffs.  That's why
> >>we're engineers, for goodness sakes.  Sometimes, it's just not
> >>possible to square the circle.
> >>
> >>I don't believe that the technique of forcing people who need that
> >>performance to suffer in order to induce them to try to engineer a
> >>solution which may or may not exist is really the best or fairest way
> >>to go about things.
> >>
> >>- Ted
> >
> >This is not a generally useful feature and won't ship in a way that helps 
> >most
> >users with this issue.
> 
> >Let's fix the problem properly.
> >
> >In the meantime, there are several obvious ways to avoid this performance hit
> >without changing the kernel (fully allocate and write the data, certainly
> >reasonable for even reasonable sized files).
> 
> I have to agree that, if this is going to be an ext4-specific
> feature, then it can just be implemented via an ext4-specific ioctl
> and be done with it. But I'm not convinced this should be an
> ext4-specific feature.
> 
> As for "fix the problem properly" - you're fixing the wrong problem.
> This type of feature is important to me, not just because of the
> performance issue. As has already been pointed out, the performance
> difference may even be negligible.
> 
> But on SSDs, the issue is write endurance. The whole point of
> preallocating a file is to avoid doing incremental metadata updates.
> Particularly when each of those 1-bit status updates costs entire
> blocks, and gratuitously shortens the life of the media. The fact
> that avoiding the unnecessary wear and tear may also yield a
> performance boost is just icing on the cake. (And if the perf boost
> is over a factor of 2:1 that's some pretty damn good icing.)

That's a filesystem implementation specific problem, not a generic
fallocate() or unwritten extent conversion problem.

Besides, ext4 doesn't write back every metadata modification that is
made - they are aggregated in memory and only written when the
journal is full or the metadata ages out. Hence unwritten extent
conversion has very little impact on the amount of writes that are
done to the flash because it is vastly dominated by the data writes.

Similarly, in XFS you might see a few thousand or tens of thousands
of metadata blocks get written once every 30s under such a random
write workload, but each metadata block might have gone through a
million changes in memory since the last time it was written.
Indeed, in that 30s, there would have been a few million random data
writes so the metadata writes are well and truly lost in the
noise...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-07 Thread John Stultz

On 12/04/2012 08:18 PM, Minchan Kim wrote:

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:

I don't think the problem is when vmas being marked VM_VOLATILE are
being merged, its that when we mark the vma as *non-volatile*, and
remove the VM_VOLATILE flag we merge the non-volatile vmas with
neighboring vmas. So preserving the purged flag during that merge is
important. Again, the example I used to trigger this was an
alternating pattern of volatile and non volatile vmas, then marking
the entire range non-volatile (though sometimes in two overlapping
passes).

Understood. Thanks.
Below patch solves your problems? It's simple than yours.


Yea, this is nicer then my fix.
Although I still need the purged handling in the vma merge code for me 
to see the behavior I expect in my tests.


I've integrated your patch and repushed my queue here:
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol


Anyway, both yours and mine are not right fix.
As I mentioned, locking scheme is broken.
We need anon_vma_lock to handle purged and we should consider fork
case, too.
Hrm. I'm sure you're right, as I've not yet fully grasped all the 
locking rules here.  Could you clarify how it is broken? And why is the 
anon_vma_lock needed to manage the purged state that is part of the vma 
itself?


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/20] 3.4.23-stable review

2012-12-07 Thread Shuah Khan
On Thu, Dec 6, 2012 at 5:54 PM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 3.4.23 release.
> There are 20 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Sun Dec  9 00:50:03 UTC 2012.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.23-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

Patche applied cleanly and compiled and booted on the following systems:

HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics

Cross-compile tests
alpha: defconfig passed
arm: defconfig passed
c6x: defconfig passed
mips: defconfig passed
mipsel: defconfig passed
powerpc: wii_defconfig passed
sh: defconfig passed
sparc: defconfig passed
tile: tilegx_defconfig passed

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read O_DIRECT regression in 3.7-rc8 (bisected)

2012-12-07 Thread Linus Torvalds


On Sat, 8 Dec 2012, Milan Broz wrote:
> 
> seems this commit in 3.7-rc8 caused regression for O_DIRECT
> read near the end of the device.

Oh, good find, and thanks for the test-case. I had looked at the O_DIRECT 
side, and convinced myself that it already truncates to i_size_read(), but 
it looks like that actually only happens for the *write* side for some 
reason.

So apparently the read side doesn't have anything like that.

This (TOTALLY UNTESTED) patch adds it the same iov_shorten() logic that 
the write side has. It does it differently (in fs/block_dev.c rather than 
in mm/filemap.c), but I actually suspect this is a nicer way to do it, and 
maybe we should do the write side truncation this way too.

But as mentioned, it's untested.. Does it work for you? I'll reboot and 
test myself, but I'm on my laptop right now, so it's easier to send it out 
before the compile has even finished..

Linus

---

 fs/block_dev.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a1e09b4fe1ba..ab3a456f6650 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1544,6 +1544,22 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+static ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+unsigned long nr_segs, loff_t pos)
+{
+   struct file *file = iocb->ki_filp;
+   struct inode *bd_inode = file->f_mapping->host;
+   loff_t size = i_size_read(bd_inode);
+
+   if (pos >= size)
+   return 0;
+
+   size -= pos;
+   if (size < INT_MAX)
+   nr_segs = iov_shorten((struct iovec *)iov, nr_segs, size);
+   return generic_file_aio_read(iocb, iov, nr_segs, pos);
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1574,7 +1590,7 @@ const struct file_operations def_blk_fops = {
.llseek = block_llseek,
.read   = do_sync_read,
.write  = do_sync_write,
-   .aio_read   = generic_file_aio_read,
+   .aio_read   = blkdev_aio_read,
.aio_write  = blkdev_aio_write,
.mmap   = generic_file_mmap,
.fsync  = blkdev_fsync,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/27] 3.6.10-stable review

2012-12-07 Thread Shuah Khan
On Thu, Dec 6, 2012 at 5:58 PM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 3.6.10 release.
> There are 27 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Sun Dec  9 00:57:22 UTC 2012.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.6.10-rc1.gz
> and the diffstat can be found below.
>
> thanks,

Patch applied cleanly and compiled and booted on the following systems:

HP EliteBook 6930p Intel(R) Core(TM)2 Duo CPU T9400  @ 2.53GHz
HP ProBook 6475b AMD A10-4600M APU with Radeon(tm) HD Graphics

Cross-compile tests
alpha: defconfig passed
arm: defconfig passed
c6x: not applicable
mips: defconfig passed
mipsel: defconfig passed
powerpc: wii_defconfig failed on 3.0.56 - fixed it and sending a patch.
sh: defconfig passed on all
sparc: defconfig passed on all
tile: tilegx_defconfig passed on all

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Dave Chinner
On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> >>The other things that I think we should try would be to convert over
> >>larger chunks as we discussed on the list back in the summer (just
> >>because the user writes 4KB does not mean that we cannot flip over
> >>1MB and zero that).
> >Writing a megabyte is not free.  If you assume that your HDD has a
> >sustained write throughput of 100-125 MB/s, writing a megabyte will
> >take 8-10ms.  It might be a win if you amortize it over a large number
> >of writes, but it doesn't help your 99.9 percentile latency numbers.
> >(99.9 percentile latency numbers matters because eventually you'll
> >have a user request which hits multiple serial long latency
> >operations, and then the delay looks **really** user visible.)
> >
> > - Ted
> 
> Writing 4KB at a time to a disk cost XX units of time.
> 
> Writing to the same sector (especially for a HDD), cost XX units + a small 
> amount.
> 
> I suggest that we try it out.
> 
> For SSD's, much better to use specific HW offload commands if
> possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> performance boost since no actual data is moved...

Yup, that could be done quite trivially in XFS. Just mark the
preallocated extents as "busy" rather than unwritten, mark the
transaction as synchronous and the transaction commit will issue a
discard on the preallocated ranges before returning to userspace.
The extra overhead to the preallocation command is unlikely to be
noticed, and unwritten extent conversion overhead just goes away...

No fallocate() API changes necessary, though I think it would be
better if the user application gave a hint that it preferred "writing
zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
as there are workloads where one will always be clearly better than
the other...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] fs: logfs NULL pointer check added

2012-12-07 Thread Andrew Morton
On Tue, 27 Nov 2012 16:54:58 +0530
Abhijit Pawar  wrote:

> This patch fixes  Bug 49921 - Missing NULL check of return value of 
> logfs_get_write_page() in function btree_write_block()
> 
> Signed-off-by: Abhijit Pawar 
> ---
>  fs/logfs/readwrite.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
> index e1a3b6b..53596ce 100644
> --- a/fs/logfs/readwrite.c
> +++ b/fs/logfs/readwrite.c
> @@ -2202,6 +2202,7 @@ void btree_write_block(struct logfs_block *block)
>  
>   inode = logfs_safe_iget(block->sb, block->ino, );
>   page = logfs_get_write_page(inode, block->bix, block->level);
> + BUG_ON(!page);
>  
>   err = logfs_readpage_nolock(page);
>   BUG_ON(err);

We don't gain anything from this change.  If page==NULL then
logfs_readpage_nolock() will oops and will provide the same information
as BUG().


A better fix would be to teach logfs_get_write_page() to return a
proper ERR_PTR errno (not just a gee-i-goofed boolean, guys) then teach
btree_write_block() to process that error appropriately: clean up and
propagate it back.

Right now, a visit from the oom-killer will cause btree_write_block()
to kill the kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-07 Thread John Stultz

On 12/04/2012 11:01 PM, Minchan Kim wrote:

Hi John,

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:


I don't think the problem is when vmas being marked VM_VOLATILE are
being merged, its that when we mark the vma as *non-volatile*, and
remove the VM_VOLATILE flag we merge the non-volatile vmas with
neighboring vmas. So preserving the purged flag during that merge is
important. Again, the example I used to trigger this was an
alternating pattern of volatile and non volatile vmas, then marking
the entire range non-volatile (though sometimes in two overlapping
passes).

If I understand correctly, you mean following as.

chunk1 = mmap(8M)
chunk2 = chunk1 + 2M;
chunk3 = chunk2 + 2M
chunk4 = chunk3 + 2M

madvise(chunk1, 2M, VOLATILE);
madvise(chunk4, 2M, VOLATILE);

/*
  * V : volatile vma
  * N : non volatile vma
  * So Now vma is VNVN.
  */
And chunk4 is purged.

int ret = madvise(chunk1, 8M, NOVOLATILE);
ASSERT(ret == 1);
/* And you expect VNVN->N ?*/

Right?


Yes. That's exactly right.


If so, why should non-volatile function semantic allow it which cross over
non-volatile areas in a range? I would like to fail such case because
in case of MADV_REMOVE, it fails in the middle of operation if it encounter
VM_LOCKED.

What do you think about it?
Right, so I think this issue is maybe a problematic part of the VMA 
based approach.  While marking an area as nonvolatile twice might not 
make a lot of sense, I think userland applications would not appreciate 
the constraint that madvise(VOLATILE/NONVOLATILE) calls be made in 
perfect pairs of identical sizes.


For instance, if a browser has rendered a web page, but the page is so 
large that only a sliding window/view of that page is visible at one 
time, it may want to mark the regions not currently in the view as 
volatile.   So it would be nice (albeit naive) for that application that 
when the view location changed, it would just mark the new region as 
non-volatile, and any region not in the current view as volatile.  This 
would be easier then trying to calculate the diff of the old view region 
boundaries vs the new and modifying only the ranges that changed. 
Granted, doing so might be more efficient, but I'm not sure we can be 
sure every similar case would be more efficient.


So in my mind, double-clearing a flag should be allowed (as well as 
double-setting), as well as allowing for setting/clearing overlapping 
regions.


Aside from if the behavior should be allowed or not, the error mode of 
madvise is problematic as well, since failures can happen mid way 
through the operation, leaving the vmas in the range specified 
inconsistent. Since usually its only advisory, such inconsistent states 
aren't really problematic, and repeating the last action is probably fine.


The problem with NOVOLATILE's  purged state, with vmas, is that if we 
hit an error mid-way through, its hard to figure out what the state of 
the pages are for the range specified. Some of them could have been 
purged and set to non-volatile, while some may not be purged, and still 
left volatile. You can't just repeat the last action and get a sane 
result (as we lose the purged flag state).


With my earlier fallocate implementations, I tried to avoid this by 
making any memory allocations that might be required before making any 
state changes, so there wasn't a chance for a partial failure from 
-ENOMEM.  (It was also simpler because in my own range management code 
there were only volatile ranges,  non-volatility was simply the absence 
of a volatile range. With vmas we have to manage both volatile and 
nonvolatile vmas).  I'm not sure how this could be done with the vma 
method other then by maybe reworking the merge/split logic, but I'm wary 
of mucking with that too much as I know its performance sensitive.


Your thoughts?  Am I just being too set in my way of thinking here?

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] Git v1.8.1-rc1

2012-12-07 Thread Junio C Hamano
A release candidate Git v1.8.1-rc1 is now available for testing
at the usual places.

The release tarballs are found at:

http://code.google.com/p/git-core/downloads/list

and their SHA-1 checksums are:

4b451bb5b7125349c35cf15118e8f1893569e48f  git-1.8.1.rc1.tar.gz
7416b28a0917fef26ca06f22bc493ebea371267f  git-htmldocs-1.8.1.rc1.tar.gz
b5758adf5814d64ee8e0d26bdfb919be1c605071  git-manpages-1.8.1.rc1.tar.gz

Also the following public repositories all have a copy of the v1.8.1-rc1
tag and the master branch that the tag points at:

  url = git://repo.or.cz/alt-git.git
  url = https://code.google.com/p/git-core/
  url = git://git.sourceforge.jp/gitroot/git-core/git.git
  url = git://git-core.git.sourceforge.net/gitroot/git-core/git-core
  url = https://github.com/gitster/git

Git v1.8.1 Release Notes (draft)


Backward compatibility notes


In the next major release (not *this* one), we will change the
behavior of the "git push" command.

When "git push [$there]" does not say what to push, we have used the
traditional "matching" semantics so far (all your branches were sent
to the remote as long as there already are branches of the same name
over there).  We will use the "simple" semantics that pushes the
current branch to the branch with the same name, only when the current
branch is set to integrate with that remote branch.  There is a user
preference configuration variable "push.default" to change this, and
"git push" will warn about the upcoming change until you set this
variable in this release.

"git branch --set-upstream" is deprecated and may be removed in a
relatively distant future.  "git branch [-u|--set-upstream-to]" has
been introduced with a saner order of arguments to replace it.


Updates since v1.8.0


UI, Workflows & Features

 * Command-line completion scripts for tcsh and zsh have been added.

 * A new remote-helper interface for Mercurial has been added to
   contrib/remote-helpers.

 * We used to have a workaround for a bug in ancient "less" that
   causes it to exit without any output when the terminal is resized.
   The bug has been fixed in "less" version 406 (June 2007), and the
   workaround has been removed in this release.

 * Some documentation pages that used to ship only in the plain text
   format are now formatted in HTML as well.

 * "git-prompt" scriptlet (in contrib/completion) can be told to paint
   pieces of the hints in the prompt string in colors.

 * A new configuration variable "diff.context" can be used to
   give the default number of context lines in the patch output, to
   override the hardcoded default of 3 lines.

 * When "git checkout" checks out a branch, it tells the user how far
   behind (or ahead) the new branch is relative to the remote tracking
   branch it builds upon.  The message now also advises how to sync
   them up by pushing or pulling.  This can be disabled with the
   advice.statusHints configuration variable.

 * "git config --get" used to diagnose presence of multiple
   definitions of the same variable in the same configuration file as
   an error, but it now applies the "last one wins" rule used by the
   internal configuration logic.  Strictly speaking, this may be an
   API regression but it is expected that nobody will notice it in
   practice.

 * "git log -p -S" now looks for the  after applying
   the textconv filter (if defined); earlier it inspected the contents
   of the blobs without filtering.

 * "git format-patch" learned the "--notes=" option to give
   notes for the commit after the three-dash lines in its output.

 * "git log --grep=" learned to honor the "grep.patterntype"
   configuration set to "perl".

 * "git replace -d " now interprets  as an extended
   SHA-1 (e.g. HEAD~4 is allowed), instead of only accepting full hex
   object name.

 * "git rm $submodule" used to punt on removing a submodule working
   tree to avoid losing the repository embedded in it.  Because
   recent git uses a mechanism to separate the submodule repository
   from the submodule working tree, "git rm" learned to detect this
   case and removes the submodule working tree when it is safe to do so.

 * "git send-email" used to prompt for the sender address, even when
   the committer identity is well specified (e.g. via user.name and
   user.email configuration variables).  The command no longer gives
   this prompt when not necessary.

 * "git send-email" did not allow non-address garbage strings to
   appear after addresses on Cc: lines in the patch files (and when
   told to pick them up to find more recipients), e.g.

 Cc: Stable Kernel  # for v3.2 and up

   The command now strips " # for v3.2 and up" part before adding the
   remainder of this line to the list of recipients.

 * "git submodule add" learned to add a new submodule at the same
   path as the path where an unrelated submodule was bound to in an
   existing revision via the "--name" option.

 * 

Re: [PATCH] mm: add node physical memory range to sysfs

2012-12-07 Thread Dave Hansen
On 12/07/2012 03:51 PM, Andrew Morton wrote:
>> > +static ssize_t node_read_memrange(struct device *dev,
>> > +struct device_attribute *attr, char *buf)
>> > +{
>> > +  int nid = dev->id;
>> > +  unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
>> > +  unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> hm.  Is this correct for all for
> FLATMEM/SPARSEMEM/SPARSEMEM_VMEMMAP/DISCONTIGME/etc?

It's not _wrong_ per se, but it's not super precise, either.

The problem is, it's quite valid to have these node_start/spanned ranges
overlap between two or more nodes on some hardware.  So, if the desired
purpose is to map nodes to DIMMs, then this can only accomplish this on
_some_ hardware, not all.  It would be completely useless for that
purpose for some configurations.

Seems like the better way to do this would be to expose the DIMMs
themselves in some way, and then map _those_ back to a node.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] HOWTO: fix double words typo

2012-12-07 Thread Rob Landley

On 12/07/2012 08:37:11 AM, Cristian Stoica wrote:

Signed-off-by: Cristian Stoica 


Acked-by: Rob Landley 


---
 Documentation/HOWTO |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/HOWTO b/Documentation/HOWTO
index 59c080f..a9f288f 100644
--- a/Documentation/HOWTO
+++ b/Documentation/HOWTO
@@ -462,7 +462,7 @@ Differences between the kernel community and  
corporate structures


 The kernel community works differently than most traditional  
corporate
 development environments.  Here are a list of things that you can  
try to

-do to try to avoid problems:
+do to avoid problems:
   Good things to say regarding your proposed changes:
 - "This solves multiple problems."
 - "This deletes 2000 lines of code."
--
1.7.8.6






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Dave Chinner
On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 11:18:00AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 7 Dec 2012, Ric Wheeler wrote:
> > > 
> > > Review is part of the way we work as a community and we should figure out 
> > > how
> > > to fix our review process so that we can have meaningful results from the
> > > review or we lose confidence in the process and it makes it much harder 
> > > to get
> > > reviewers to spend time reviewing when their reviews are ultimately 
> > > ignored.
> > 
> > Christ, I promised myself to not respond any more to this thread, but the 
> > insanity just continues, from people who damn well should know better.
> > 
> > The code wasn't merged. The review worked.
> > 
> > What you (and Dave, and Christoph) are trying to do is shut down a feature 
> > that somebody else decided they needed. That's not what code review is all 
> > about, and dammit, don't try to even claim it is.
> > 
> > So stop these dishonest and disingenious arguments. They are full of crap.
> > 
> > No amount of "review" has any meaning what-so-ever on whether somebody 
> > else decides they need a feature or not. You can review all you want, but 
> > it's irrelevant - if some company decides they are going to ship or use a 
> > feature, it's out of your hands.
> > 
> > What got merged was a ONE-LINER to make sure that possible future 
> > development didn't unnecessarily make things any more confusing, with the 
> > knowledge that there was a user of the code you didn't like. 
> > 
> > Every single argument I've heard of from the "please revert" camp has been 
> > inane. And they've been *transparently* inane, to the point where I don't 
> > understand how you can make them with a straght face and not be ashamed.
> 
> I really agree with Dave's statement that we should ioctl for private
> features and system call for features other filesystems are likely to
> implement.  So we really shouldn't have private bits in fallocate in use
> in production systems.
> 
> That's not what happened though, and the right way forward from here is
> to give the bit to the feature, maybe with a generic name like
> FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  It should have been done
> differently, but it wasn't.  And it's a problem we all have, so it makes
> sense that we'll all want to address it somehow.

Well, we could have a discussion about that if Linus were
to revert the original change. Not so much the name (that's just
bikeshedding), but if there is a better way to expand the fallocate
interface to allow people to sanely work around the supposedly
unfixable ext4 unwritten extent performance problems.

But he's not going to, so it is pointless to even suggest such
things.

> On a single flash drive doing random 4K writes, xfs does 950MB/s into
> regular extents but only 400MB/s into preallocated extents.
> 
> http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png

This is bordering on irrelevancy, but can you provide the workload
you were running to generate this graph?  Random 4k writes could be
anything, really.

In my experience, applications that actually do processing between
random write IOs don't see anywhere near the same degradation as
such micro-benchmarks tend to indicate can occur with unwritten
extents. Are you seeing this level of degradation in real-world applications?
If you give me a reason to fix it (and the hardware to test it on),
I'm pretty sure I can bring the overhead down to just a few percent
on fully featured SSDs like FusionIO devices...

[ slightly more on topic ]

FWIW, if this was your production workload and you are using XFS
then you could always use XFS_IOC_ALLOCSP to write zeros during
preallocation rather than using unwritten extents. i.e. trade off
setup-time overhead for higher run-time performance.

[ Have I mentioned before that XFS has several of custom ioctls for
issuing different forms of preallocation? :) ]

I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
The concept, however, implemented by a new fallocate()
flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
the application considers unwritten extents undesirable is exactly
the sort of thing that we should be considering implementing.
Indeed, if the filesystem is on something with WRITE_SAME or
discards to zero, no data would need to be written, you wouldn't
have any unwritten extent overhead, and no stale data exposure.
And it's not a filesystem specific interface or optimisation...

[ back to original topic ]

This is exactly why Ted should have posted the patch for review. He
may not have got the flag through, but the discussion might just end
up in a place that is *better for everyone*. By subverting the
review process, he's deprived the community of that opportunity. now
we're stuck with a shitty change that we can't improve on and will
have to explain repeatedly over the next 15 years why it's not
implemented in any kernel 

Re: [PATCH v2 1/1] uio.c: solve memory leak

2012-12-07 Thread Cong Ding
On Sat, Dec 08, 2012 at 01:10:40AM +0100, Hans J. Koch wrote:
> On Fri, Dec 07, 2012 at 12:02:11AM +0100, Cong Ding wrote:
> > ping Hans, did you have any comment on this?
> 
> Sounds right what you say. Is your patch v2 your final solution, or would
> you like to come up with v3?
> 
> Thanks a lot for your patience and your thorough analysis.
If you don't have more comment or objection, it would be the final version.

Thanks

- cong
> > On Fri, Nov 30, 2012 at 12:03 PM, Cong Ding  wrote:
> > > Hi Hans, I think the memory allocated with kzalloc is properly freed
> > > by calling kobject_put.
> > >
> > > I can give a simple explanation.
> > >
> > > 1)  when we call kobject_init, the parameter portio_attr_type is
> > > passed in. portio_attr_type includes a function pointer to
> > > portio_release, which releases the memory of portio.
> > >
> > > 2) when we call kobject_put, kref_put is called with the pointer of
> > > function kobject_release.
> > > 3) kref_put calls kref_sub, with the same pointer of function 
> > > kobject_release.
> > > 4) and kref_put calls the function kboject_release if
> > > atomic_sub_and_test returns true
> > >
> > > 5) let's look at what kobject_release is. it calls kobject_cleanup,
> > > and kobject_cleanup calls t->release(kobj) where t->release is exactly
> > > the function we passed in through portio_init at step (1). so function
> > > portio_release is called, and the memory allocated with kzalloc is
> > > freed.
> > >
> > > If there are anything wrong in my analysis, please feel free to let me 
> > > know.
> > >
> > > Personally, I suggest to add a function to create and release
> > > uio_portio, which is similar as kobject_create and kobject_put in file
> > > lib/kobject.c. In this way, it avoid other readers thinking the memory
> > > is not freed (and we should add some comments here). For example,
> > > uio_portio_create call kzalloc and kboject_init, and returns
> > > uio_portio, which is similar as function kobject_create; and
> > > uio_portio_release calls kobject_put to release the memory. And we do
> > > same thing for uio_map.
> > >
> > > The usage here is quite strange, but it works. If I write this
> > > function from zero, I will use a pointer to kobject in uio_portio
> > > struct instead of kobject struct itself. In this case I can call
> > > kobject_create instead of kobject_init, and then we do both
> > > kzalloc(uio_portio) and kfree(uio_portio) in the file uio.c.
> > >
> > > Best,
> > > Cong
> > >
> > > On Fri, Nov 30, 2012 at 1:13 AM, Hans J. Koch  wrote:
> > >> There's still another bug: The memory allocated with kzalloc is
> > >> never freed.
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] uio.c: solve memory leak

2012-12-07 Thread Hans J. Koch
On Fri, Dec 07, 2012 at 12:02:11AM +0100, Cong Ding wrote:
> ping Hans, did you have any comment on this?

Sounds right what you say. Is your patch v2 your final solution, or would
you like to come up with v3?

Thanks a lot for your patience and your thorough analysis.

Hans

> 
> - cong
> 
> On Fri, Nov 30, 2012 at 12:03 PM, Cong Ding  wrote:
> > Hi Hans, I think the memory allocated with kzalloc is properly freed
> > by calling kobject_put.
> >
> > I can give a simple explanation.
> >
> > 1)  when we call kobject_init, the parameter portio_attr_type is
> > passed in. portio_attr_type includes a function pointer to
> > portio_release, which releases the memory of portio.
> >
> > 2) when we call kobject_put, kref_put is called with the pointer of
> > function kobject_release.
> > 3) kref_put calls kref_sub, with the same pointer of function 
> > kobject_release.
> > 4) and kref_put calls the function kboject_release if
> > atomic_sub_and_test returns true
> >
> > 5) let's look at what kobject_release is. it calls kobject_cleanup,
> > and kobject_cleanup calls t->release(kobj) where t->release is exactly
> > the function we passed in through portio_init at step (1). so function
> > portio_release is called, and the memory allocated with kzalloc is
> > freed.
> >
> > If there are anything wrong in my analysis, please feel free to let me know.
> >
> > Personally, I suggest to add a function to create and release
> > uio_portio, which is similar as kobject_create and kobject_put in file
> > lib/kobject.c. In this way, it avoid other readers thinking the memory
> > is not freed (and we should add some comments here). For example,
> > uio_portio_create call kzalloc and kboject_init, and returns
> > uio_portio, which is similar as function kobject_create; and
> > uio_portio_release calls kobject_put to release the memory. And we do
> > same thing for uio_map.
> >
> > The usage here is quite strange, but it works. If I write this
> > function from zero, I will use a pointer to kobject in uio_portio
> > struct instead of kobject struct itself. In this case I can call
> > kobject_create instead of kobject_init, and then we do both
> > kzalloc(uio_portio) and kfree(uio_portio) in the file uio.c.
> >
> > Best,
> > Cong
> >
> > On Fri, Nov 30, 2012 at 1:13 AM, Hans J. Koch  wrote:
> >> There's still another bug: The memory allocated with kzalloc is
> >> never freed.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2012-12-07 Thread Steven Rostedt
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups on these CPUs
would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Signed-off-by: Steven Rostedt 

Index: linux-rt.git/kernel/sched/core.c
===
--- linux-rt.git.orig/kernel/sched/core.c
+++ linux-rt.git/kernel/sched/core.c
@@ -1538,6 +1538,8 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+   sched_rt_push_check();
+
if (llist_empty(_rq()->wake_list) && !got_nohz_idle_kick())
return;
 
Index: linux-rt.git/kernel/sched/rt.c
===
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1425,53 +1425,6 @@ static void put_prev_task_rt(struct rq *
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
 
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-   if (!task_running(rq, p) &&
-   (cpu < 0 || cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) &&
-   (p->nr_cpus_allowed > 1))
-   return 1;
-   return 0;
-}
-
-/* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu)
-{
-   struct task_struct *next = NULL;
-   struct sched_rt_entity *rt_se;
-   struct rt_prio_array *array;
-   struct rt_rq *rt_rq;
-   int idx;
-
-   for_each_leaf_rt_rq(rt_rq, rq) {
-   array = _rq->active;
-   idx = sched_find_first_bit(array->bitmap);
-next_idx:
-   if (idx >= MAX_RT_PRIO)
-   continue;
-   if (next && next->prio <= idx)
-   continue;
-   list_for_each_entry(rt_se, array->queue + idx, run_list) {
-   struct task_struct *p;
-
-   if (!rt_entity_is_task(rt_se))
-   continue;
-
-   p = rt_task_of(rt_se);
-   if (pick_rt_task(rq, p, cpu)) {
-   next = p;
-   break;
-   }
-   }
-   if (!next) {
-   idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-   goto next_idx;
-   }
-   }
-
-   return next;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
 
 static int find_lowest_rq(struct task_struct *task)
@@ -1723,10 +1676,24 @@ static void push_rt_tasks(struct rq *rq)
;
 }
 
+void sched_rt_push_check(void)
+{
+   struct rq *rq = cpu_rq(smp_processor_id());
+
+   if (WARN_ON_ONCE(!irqs_disabled()))
+   return;
+
+   if (!has_pushable_tasks(rq))
+   return;
+
+   raw_spin_lock(>lock);
+   push_rt_tasks(rq);
+   raw_spin_unlock(>lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
int this_cpu = this_rq->cpu, ret = 0, cpu;
-   struct task_struct *p;
struct rq *src_rq;
 
if 

[RFC][PATCH RT 4/4] sched/rt: Initiate a pull when the priority of a task is lowered

2012-12-07 Thread Steven Rostedt
If a task lowers its priority (say by losing priority inheritance)
if a higher priority task is waiting on another CPU, initiate a pull.

Signed-off-by: Steven Rostedt 

Index: linux-rt.git/kernel/sched/rt.c
===
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -997,6 +997,8 @@ inc_rt_prio(struct rt_rq *rt_rq, int pri
inc_rt_prio_smp(rt_rq, prio, prev_prio);
 }
 
+static int pull_rt_task(struct rq *this_rq);
+
 static void
 dec_rt_prio(struct rt_rq *rt_rq, int prio)
 {
@@ -1021,6 +1023,9 @@ dec_rt_prio(struct rt_rq *rt_rq, int pri
rt_rq->highest_prio.curr = MAX_RT_PRIO;
 
dec_rt_prio_smp(rt_rq, prio, prev_prio);
+
+   if (prev_prio < rt_rq->highest_prio.curr)
+   pull_rt_task(rq_of_rt_rq(rt_rq));
 }
 
 #else

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did

2012-12-07 Thread Steven Rostedt
Currently, the push_rt_task() only pushes the task if it is lower
priority than the currently running task.

But this is not the only check. If the currently running task is also
pinned, we may want to push as well, and we do this check when we wake
up a task, but then we are guaranteed to fail pushing the task because
the internal checks may fail.

Make the check the same as the wakeup checks. We could remove the
check in the wake up and just let the push_rt_task() do the work,
but this makes the wake up exit this check on the likely case that
"ok_to_push_task()" will fail, and that we don't need to do the
iterative loop of checks on the pushable task list.

Signed-off-by: Steven Rostedt 

Index: linux-rt.git/kernel/sched/rt.c
===
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1615,6 +1615,15 @@ static struct task_struct *pick_next_pus
return p;
 }
 
+static int ok_to_push_task(struct task_struct *p, struct task_struct *curr)
+{
+   return p->nr_cpus_allowed > 1 &&
+   rt_task(curr) &&
+   (curr->migrate_disable ||
+curr->nr_cpus_allowed < 2 ||
+curr->prio <= p->prio);
+}
+
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
@@ -1649,7 +1658,7 @@ retry:
 * higher priority than current. If that's the case
 * just reschedule current.
 */
-   if (unlikely(next_task->prio < rq->curr->prio)) {
+   if (!ok_to_push_task(next_task, rq->curr)) {
resched_task(rq->curr);
return 0;
}
@@ -1814,10 +1823,7 @@ static void task_woken_rt(struct rq *rq,
if (!task_running(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
has_pushable_tasks(rq) &&
-   p->nr_cpus_allowed > 1 &&
-   rt_task(rq->curr) &&
-   (rq->curr->nr_cpus_allowed < 2 ||
-rq->curr->prio <= p->prio))
+   ok_to_push_task(p, rq->curr))
push_rt_tasks(rq);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes

2012-12-07 Thread Steven Rostedt
I've been debugging large latencies on a 40 core box and found a major
cause due to the thundering herd like grab of the rq lock due to the
pull_rt_task() logic.

Basically, if a large number of CPUs were to lower its priority roughly
the same time, they would all trigger a pull. If there happens to be
only one CPU available to get a task, all CPUs doing the pull will try
to grab it. In doing so, they will all contend on the rq lock of
the overloaded CPU. Only one CPU will succeed in pulling the task
and unfortunately, there's no quick way to know which, as it's dependent
on the affinitiy of the task that needs to be pulled, and to look at that,
we need to grab its rq lock!

Instead of having the pull logic grab the rq locks and do the work to
switch the task over to the pulling CPU, this patch series (well patch
#3) has the pulling CPU send an IPI to the overloaded CPU and that
CPU will do the push instead. The push logic uses the cpupri.c code
to quickly find the best CPU to offload the overloaded RT task to, so
it makes it quite efficient to do this.

Retrieving multiple IPIs has a much lower overhead than all the CPUs
grabbing the rq lock.

The other three patches are fixes/enhancements to the push/pull code
that I found while doing the debugging of the latencies.

Note, although this patch series is made for the -rt patch, the issues
apply to mainline as well. But because -rt has the migrate_disable() code,
this patch series is tailored to that. But if we can vet this out in
-rt, all this code should make its way quickly to mainline.

I tested this code out, but it probably needs some clean up and definitely
more comments. I'm only posting this as an RFC for now to get feedback
on the idea.

Thanks!

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH RT 2/4] sched/rt: Try to migrate task if preempting pinned rt task

2012-12-07 Thread Steven Rostedt
If a higher priority task is about to preempt a task that has been
pinned to a CPU. Try to first see if the higher priority task can
preempt another task instead.

That is, a high priority process wakes up on a CPU while a currently
running task can still migrate, it will miss pushing that high priority
task to another CPU. If by the time the task schedules, the task
that it's about to preempt could have changed its affinity and
is pinned. At this time, it may be better to move the task to another
CPU if one exists that is currently running a lower priority task
than the one about to be preempted.

Signed-off-by: Steven Rostedt 

Index: linux-rt.git/kernel/sched/rt.c
===
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1804,8 +1804,22 @@ skip:
 
 static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
 {
+   struct task_struct *p = prev;
+
+   /*
+* If we are preempting a migrate disabled task
+* see if we can push the higher tasks first.
+*/
+   if (prev->on_rq && (prev->nr_cpus_allowed <= 1 || 
prev->migrate_disable) &&
+   has_pushable_tasks(rq) && rq->rt.highest_prio.next < prev->prio) {
+   p = _pick_next_task_rt(rq);
+
+   if (p != prev && p->nr_cpus_allowed > 1 && push_rt_task(rq))
+   p = _pick_next_task_rt(rq);
+   }
+
/* Try to pull RT tasks here if we lower this rq's prio */
-   if (rq->rt.highest_prio.curr > prev->prio)
+   if (rq->rt.highest_prio.curr > p->prio)
pull_rt_task(rq);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 1/5] net: Add support for hardware-offloaded encapsulation

2012-12-07 Thread Joseph Gasparakis
This patch adds support in the kernel for offloading in the NIC Tx and Rx
checksumming for encapsulated packets (such as VXLAN and IP GRE).

For Tx encapsulation offload, the driver will need to set the right bits
in netdev->hw_enc_features. The protocol driver will have to set the
skb->encapsulation bit and populate the inner headers, so the NIC driver will
use those inner headers to calculate the csum in hardware.

For Rx encapsulation offload, the driver will need to set again the
skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
In that case the protocol driver should push the decapsulated packet up
to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
driver should set the skb->encapsulation flag back to zero. Fianlly the
protocol driver should have NETIF_F_RXCSUM flag set in its features.

Signed-off-by: Joseph Gasparakis 
Signed-off-by: Peter P Waskiewicz Jr 
Signed-off-by: Alexander Duyck 
---
 include/linux/ip.h|  5 +++
 include/linux/ipv6.h  |  5 +++
 include/linux/netdevice.h |  6 +++
 include/linux/skbuff.h| 95 ++-
 include/linux/tcp.h   | 10 +
 include/linux/udp.h   |  5 +++
 net/core/skbuff.c |  9 +
 7 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 58b82a2..492bc65 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -25,6 +25,11 @@ static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
return (struct iphdr *)skb_network_header(skb);
 }
 
+static inline struct iphdr *inner_ip_hdr(const struct sk_buff *skb)
+{
+   return (struct iphdr *)skb_inner_network_header(skb);
+}
+
 static inline struct iphdr *ipip_hdr(const struct sk_buff *skb)
 {
return (struct iphdr *)skb_transport_header(skb);
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 12729e9..faed1e3 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -67,6 +67,11 @@ static inline struct ipv6hdr *ipv6_hdr(const struct sk_buff 
*skb)
return (struct ipv6hdr *)skb_network_header(skb);
 }
 
+static inline struct ipv6hdr *inner_ipv6_hdr(const struct sk_buff *skb)
+{
+   return (struct ipv6hdr *)skb_inner_network_header(skb);
+}
+
 static inline struct ipv6hdr *ipipv6_hdr(const struct sk_buff *skb)
 {
return (struct ipv6hdr *)skb_transport_header(skb);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18c5dc9..c6a14d4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1063,6 +1063,12 @@ struct net_device {
netdev_features_t   wanted_features;
/* mask of features inheritable by VLAN devices */
netdev_features_t   vlan_features;
+   /* mask of features inherited by encapsulating devices
+* This field indicates what encapsulation offloads
+* the hardware is capable of doing, and drivers will
+* need to set them appropriately.
+*/
+   netdev_features_t   hw_enc_features;
 
/* Interface index. Unique device identifier*/
int ifindex;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..320e976 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -376,6 +376,8 @@ typedef unsigned char *sk_buff_data_t;
  * @mark: Generic packet mark
  * @dropcount: total number of sk_receive_queue overflows
  * @vlan_tci: vlan tag control information
+ * @inner_transport_header: Inner transport layer header (encapsulation)
+ * @inner_network_header: Network layer header (encapsulation)
  * @transport_header: Transport layer header
  * @network_header: Network layer header
  * @mac_header: Link layer header
@@ -471,7 +473,13 @@ struct sk_buff {
__u8wifi_acked:1;
__u8no_fcs:1;
__u8head_frag:1;
-   /* 8/10 bit hole (depending on ndisc_nodetype presence) */
+   /* Encapsulation protocol and NIC drivers should use
+* this flag to indicate to each other if the skb contains
+* encapsulated packet or not and maybe use the inner packet
+* headers if needed
+*/
+   __u8encapsulation:1;
+   /* 7/9 bit hole (depending on ndisc_nodetype presence) */
kmemcheck_bitfield_end(flags2);
 
 #ifdef CONFIG_NET_DMA
@@ -486,6 +494,8 @@ struct sk_buff {
__u32   avail_size;
};
 
+   sk_buff_data_t  inner_transport_header;
+   sk_buff_data_t  inner_network_header;
sk_buff_data_t  transport_header;
sk_buff_data_t  network_header;
sk_buff_data_t  mac_header;
@@ -1435,12 +1445,53 @@ static inline void skb_reserve(struct sk_buff *skb, int 
len)
skb->tail += len;
 }
 
+static inline void skb_reset_inner_headers(struct sk_buff *skb)
+{
+   

[PATCH v4 5/5] vxlan: Add capability of Rx checksum offload for inner packet

2012-12-07 Thread Joseph Gasparakis
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.

Signed-off-by: Joseph Gasparakis 
---
 drivers/net/vxlan.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 88b31f2..3b3fdf6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -607,7 +607,17 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
 
__skb_tunnel_rx(skb, vxlan->dev);
skb_reset_network_header(skb);
-   skb->ip_summed = CHECKSUM_NONE;
+
+   /* If the NIC driver gave us an encapsulated packet with
+* CHECKSUM_UNNECESSARY and Rx checksum feature is enabled,
+* leave the CHECKSUM_UNNECESSARY, the device checksummed it
+* for us. Otherwise force the upper layers to verify it.
+*/
+   if (skb->ip_summed != CHECKSUM_UNNECESSARY || !skb->encapsulation ||
+   !(vxlan->dev->features & NETIF_F_RXCSUM))
+   skb->ip_summed = CHECKSUM_NONE;
+
+   skb->encapsulation = 0;
 
err = IP_ECN_decapsulate(oip, skb);
if (unlikely(err)) {
@@ -1175,7 +1185,9 @@ static void vxlan_setup(struct net_device *dev)
dev->features   |= NETIF_F_LLTX;
dev->features   |= NETIF_F_NETNS_LOCAL;
dev->features   |= NETIF_F_SG | NETIF_F_HW_CSUM;
-   dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
+   dev->features   |= NETIF_F_RXCSUM;
+
+   dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 
spin_lock_init(>hash_lock);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 3/5] vxlan: capture inner headers during encapsulation

2012-12-07 Thread Joseph Gasparakis
Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
The advantage to these two changes is that it also allows the VXLAN to
make use of GSO.

Signed-off-by: Joseph Gasparakis 
Signed-off-by: Peter P Waskiewicz Jr 
Signed-off-by: Alexander Duyck 
---
 drivers/net/vxlan.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ce77b8b..88b31f2 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -876,6 +876,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
+   if (!skb->encapsulation) {
+   skb_reset_inner_headers(skb);
+   skb->encapsulation = 1;
+   }
+
/* Need space for new headers (invalidates iph ptr) */
if (skb_cow_head(skb, VXLAN_HEADROOM))
goto drop;
@@ -947,7 +952,8 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
vxlan_set_owner(dev, skb);
 
/* See iptunnel_xmit() */
-   skb->ip_summed = CHECKSUM_NONE;
+   if (skb->ip_summed != CHECKSUM_PARTIAL)
+   skb->ip_summed = CHECKSUM_NONE;
ip_select_ident(iph, >dst, NULL);
 
err = ip_local_out(skb);
@@ -1168,6 +1174,8 @@ static void vxlan_setup(struct net_device *dev)
dev->tx_queue_len = 0;
dev->features   |= NETIF_F_LLTX;
dev->features   |= NETIF_F_NETNS_LOCAL;
+   dev->features   |= NETIF_F_SG | NETIF_F_HW_CSUM;
+   dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 
spin_lock_init(>hash_lock);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 4/5] ixgbe: Adding tx encapsulation capability

2012-12-07 Thread Joseph Gasparakis
This patch allows ixgbe to recognize encapsulated packets and do the tx
checksum offload in hardware. This patch is only for demonstration
purposes and should not be applied.

Signed-off-by: Joseph Gasparakis 
Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 46 +--
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fb165b6..62a7d6e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5972,17 +5972,42 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
if (!(first->tx_flags & IXGBE_TX_FLAGS_TXSW))
return;
}
+   vlan_macip_lens |= skb_network_offset(skb)
+  << IXGBE_ADVTXD_MACLEN_SHIFT;
} else {
u8 l4_hdr = 0;
-   switch (first->protocol) {
-   case __constant_htons(ETH_P_IP):
-   vlan_macip_lens |= skb_network_header_len(skb);
+   union {
+   struct iphdr *ipv4;
+   struct ipv6hdr *ipv6;
+   u8 *raw;
+   } network_hdr;
+   union {
+   struct tcphdr *tcphdr;
+   u8 *raw;
+   } transport_hdr;
+
+   if (skb->encapsulation) {
+   network_hdr.raw = skb_inner_network_header(skb);
+   transport_hdr.raw = skb_inner_transport_header(skb);
+   vlan_macip_lens |= skb_inner_network_offset(skb) <<
+  IXGBE_ADVTXD_MACLEN_SHIFT;
+   } else {
+   network_hdr.raw = skb_network_header(skb);
+   transport_hdr.raw = skb_transport_header(skb);
+   vlan_macip_lens |= skb_network_offset(skb) <<
+  IXGBE_ADVTXD_MACLEN_SHIFT;
+   }
+
+   /* use first 4 bits to determine IP version */
+   switch (network_hdr.ipv4->version) {
+   case 4:
+   vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
-   l4_hdr = ip_hdr(skb)->protocol;
+   l4_hdr = network_hdr.ipv4->protocol;
break;
-   case __constant_htons(ETH_P_IPV6):
-   vlan_macip_lens |= skb_network_header_len(skb);
-   l4_hdr = ipv6_hdr(skb)->nexthdr;
+   case 6:
+   vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
+   l4_hdr = network_hdr.ipv6->nexthdr;
break;
default:
if (unlikely(net_ratelimit())) {
@@ -5996,7 +6021,7 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
switch (l4_hdr) {
case IPPROTO_TCP:
type_tucmd |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
-   mss_l4len_idx = tcp_hdrlen(skb) <<
+   mss_l4len_idx = (transport_hdr.tcphdr->doff * 4) <<
IXGBE_ADVTXD_L4LEN_SHIFT;
break;
case IPPROTO_SCTP:
@@ -6022,7 +6047,6 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
}
 
/* vlan_macip_lens: MACLEN, VLAN tag */
-   vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
 
ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0,
@@ -7383,6 +7407,10 @@ static int ixgbe_probe(struct pci_dev *pdev,
 
netdev->hw_features = netdev->features;
 
+   netdev->hw_enc_features = NETIF_F_IP_CSUM |
+ NETIF_F_IPV6_CSUM |
+ NETIF_F_SG;
+
switch (adapter->hw.mac.type) {
case ixgbe_mac_82599EB:
case ixgbe_mac_X540:
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 2/5] net: Handle encapsulated offloads before fragmentation or handing to lower dev

2012-12-07 Thread Joseph Gasparakis
From: Alexander Duyck 

This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c   | 15 +--
 net/ipv4/ip_output.c |  4 
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 307142a..a4c4a1b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2324,6 +2324,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct 
net_device *dev,
skb->vlan_tci = 0;
}
 
+   /* If encapsulation offload request, verify we are testing
+* hardware encapsulation features instead of standard
+* features for the netdev
+*/
+   if (skb->encapsulation)
+   features &= dev->hw_enc_features;
+
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features)))
goto out_kfree_skb;
@@ -2339,8 +2346,12 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct 
net_device *dev,
 * checksumming here.
 */
if (skb->ip_summed == CHECKSUM_PARTIAL) {
-   skb_set_transport_header(skb,
-   skb_checksum_start_offset(skb));
+   if (skb->encapsulation)
+   skb_set_inner_transport_header(skb,
+   skb_checksum_start_offset(skb));
+   else
+   skb_set_transport_header(skb,
+   skb_checksum_start_offset(skb));
if (!(features & NETIF_F_ALL_CSUM) &&
 skb_checksum_help(skb))
goto out_kfree_skb;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6537a40..3e98ed2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -595,6 +595,10 @@ slow_path_clean:
}
 
 slow_path:
+   /* for offloaded checksums cleanup checksum before fragmentation */
+   if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+   goto fail;
+
left = skb->len - hlen; /* Space per frame */
ptr = hlen; /* Where to start from */
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 net-next 0/5] tunneling: Add support for hardware-offloaded encapsulation

2012-12-07 Thread Joseph Gasparakis
The series contains updates to add in the NIC Rx and Tx checksumming support
for encapsulated packets.

The sk_buff needs to somehow have information of the inner packet, and adding
three fields for the inner mac, network and transport headers was the prefered
approach. 

Not adding these fields would mean that the drivers would need to parse the
sk_buff data in hot-path, having a negative impact in the performance.

Adding in sk_buff a pointer to the skbuff of the inner packet made sense, but
would be a complicated change as assumptions needed to be made with regards to
helper functions such as skb_clone() skb_copy(). Also code for the existing
encapsulation protocols (such as VXLAN and IP GRE) had to be reworked, so the
decision was to have the simple approach of adding these three fields.

v2 Makes sure that checksumming for IP GRE does not take place if the offload
   flag is set in the skb's netdev features

v3 Fixes issues picked up by the community in v2 and is intended to provide
   ability to demo vxlan Tx offloading with Intel's ixgbe. As part of this, 
   it provides an RFC patch for ixgbe to take advantage of the offloading
   mechanism

   Now it is possible to create a vxlan interface like this:
#ip link add vxlan0 type vxlan id 40 ttl 10 group 239.1.1.1 dev eth0

   Then turn on/off the encapsulation offload mechanism by doing:
#ethtool -K eth0 tx-checksum-ip-generic on

   In v3 ipgre work got paused (and therefore patches not included) and I will
   come back to it when vxlan is accepted by the community.

v4 Added more detailed commit logs and code comments as per request in v3
   Also now the Rx offload encapsulation patch is included in the series.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 0/3] Add O_DENY* flags to fcntl and cifs

2012-12-07 Thread Myklebust, Trond
> -Original Message-
> From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-
> ow...@vger.kernel.org] On Behalf Of Pavel Shilovsky
> Sent: Friday, December 07, 2012 9:43 PM
> To: Christoph Hellwig
> Cc: linux-c...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> fsde...@vger.kernel.org; wine-de...@winehq.org; linux-
> n...@vger.kernel.org
> Subject: Re: [PATCH 0/3] Add O_DENY* flags to fcntl and cifs
> 
> Christoph Hellwig писал 07.12.2012 20:16:
> > On Thu, Dec 06, 2012 at 10:26:28PM +0400, Pavel Shilovsky wrote:
> >> Network filesystems CIFS, SMB2.0, SMB3.0 and NFSv4 have such flags -
> >> this change can benefit cifs and nfs modules. While this change is ok
> >> for network filesystems, itsn't not targeted for local filesystems
> >> due security problems (e.g. when a user process can deny root to
> >> delete a file).
> >>
> >> Share flags are used by Windows applications and WINE have to deal
> >> with them too. While WINE can process open share flags itself on
> >> local filesystems, it can't do it if a file stored on a network share
> >> and is used by several clients. This patchset makes it possible for
> >> CIFS/SMB2.0/SMB3.0.
> >
> > I don't think introducing user visible flags that are only supported
> > on a single network filesystem is a good idea.
> 
> It can bring benefits for both CIFS and NFS filesystems - so, at least two 
> ones.
> 
> >
> > I'm not even sure adding these flags does make a lot of sense, but
> > assuming we'd actually want this (and I'd like some more detailed
> > explanation) I think we'd at least need to make sure that:
> >
> >  a) opening files with the new modes gives a proper error message if
> > not
> > supported
> 
> It makes us add such checks for all other filesystems, if I understand right, 
> -
> not a problem, I think.
> 
> >  b) there needs to be local support for them as well
> >  c) we need to think really hard when they should be supported, and
> > need
> > a good rational for it.  I can't see how we could do it
> > unconditionally for all users as that would introduce easy denial
> > of services attacks the way I understand the semantics (correct me
> > if I'm wrong).  So a mount option like you currently do probably
> > is
> > the least bad even if don't fell overly happy about that version.
> >
> > What is the reason your special wine use case can't simply use a
> > userspace cifs client?  Given that wine uses windows filesystem
> > semantics and cifs does as well tunnelling it through a Posix-like API
> > inbetween is never going to be perfect.
> 
> Ideally we should not make any difference between underlying filesystems
> in Wine: an application requests an open of the file and we issue this open
> with flags it passed. Since Wineserver can process share flags locally itself 
> (for
> one linux user), we only need to add this support for CIFS (that is actively
> used by Wine applications because of it's Windows nature). Bringing these
> flags for local filesystems can benefit Wine too: it will help in cases when
> Wine applications of different users on the same machine use the same file
> and can make all those things easier, of course.
> 
> The problem is the possibility of denial-of-service attacks here. We can try 
> to
> prevent them by:
> 1) specifying an extra security bit on the file that indicates that share 
> flags are
> accepted (like we have for mandatory locks now) and setting it for
> neccessary files only, or
> 2) adding a special mount option (but it it probably makes sense if we
> decided to add this support for CIFS and NFS only).

Why not just put it under the control of LSM? It seems to me that this doesn't 
so much want to be a per-mount switch but rather deserves to be a per-process 
MAC (i.e. is this running in a Wine sandbox or not)...

Cheers
   Trond


Read O_DIRECT regression in 3.7-rc8 (bisected)

2012-12-07 Thread Milan Broz
Hi Linus,

seems this commit in 3.7-rc8 caused regression for O_DIRECT
read near the end of the device.

bbec0270bdd887f96377065ee38b8848b5afa395 is the first bad commit
commit bbec0270bdd887f96377065ee38b8848b5afa395
Author: Linus Torvalds 
Date:   Thu Nov 29 12:31:52 2012 -0800

blkdev_max_block: make private to fs/buffer.c


With reproducer below (tested on i386), read should return
half of the buffer (8192 bytes), with patch above it fails
completely.

Milan

#define _GNU_SOURCE
#include 
#include 
#include 
#include 

#define BLOCK 8192
int main (int argc, char *argv[])
{
char *buf;
int fd, r;

if (posix_memalign((void*), 4096, 2 * BLOCK)) {
printf("alloc fail\n");
return 1;
}

fd = open("/dev/sdb", O_RDONLY|O_DIRECT);
if (fd == -1) {
printf("open fail\n");
return 1;
}

if (lseek(fd, -BLOCK, SEEK_END) < 0) {
printf("seek fail\n");
close(fd);
return 2;
}

r = read(fd, buf, 2 * BLOCK);
printf("Read returned %d.\n", r);

close(fd);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Howard Chu

Ric Wheeler wrote:

On 12/07/2012 04:14 PM, Theodore Ts'o wrote:

On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:

How is this similar? By adding this bit, we removed incentive from a
group of developers that have the means to fix the real issue at hand
(the performance problem with ext4). Thus, it means that they have a work
around that's good enough for them, but the rest of us suffer.

That assumes that there **is** a way to claw back the performance
loss, and Chris Mason has demonstrated the performance hit exists with
xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
Sometimes, you have to make the engineering tradeoffs.  That's why
we're engineers, for goodness sakes.  Sometimes, it's just not
possible to square the circle.

I don't believe that the technique of forcing people who need that
performance to suffer in order to induce them to try to engineer a
solution which may or may not exist is really the best or fairest way
to go about things.

- Ted


This is not a generally useful feature and won't ship in a way that helps most
users with this issue.



Let's fix the problem properly.

In the meantime, there are several obvious ways to avoid this performance hit
without changing the kernel (fully allocate and write the data, certainly
reasonable for even reasonable sized files).


I have to agree that, if this is going to be an ext4-specific feature, then it 
can just be implemented via an ext4-specific ioctl and be done with it. But 
I'm not convinced this should be an ext4-specific feature.


As for "fix the problem properly" - you're fixing the wrong problem. This type 
of feature is important to me, not just because of the performance issue. As 
has already been pointed out, the performance difference may even be negligible.


But on SSDs, the issue is write endurance. The whole point of preallocating a 
file is to avoid doing incremental metadata updates. Particularly when each of 
those 1-bit status updates costs entire blocks, and gratuitously shortens the 
life of the media. The fact that avoiding the unnecessary wear and tear may 
also yield a performance boost is just icing on the cake. (And if the perf 
boost is over a factor of 2:1 that's some pretty damn good icing.)


There are certainly ways in which a feature like this could be deployed 
safely, or at least, without violating anyone's expectations of security. For 
example, you have braindead filesystems like FAT that don't actually support 
per-file owner/group info. If you have a filesystem where all of the files are 
known to belong to the same user, then the whole argument about "seeing 
someone else's data" is moot. If you provide the uid=/gid= mount options 
generically across all (or most) filesystem types, then you can let a sysadmin 
decide if they want to play this way or not.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: add node physical memory range to sysfs

2012-12-07 Thread Andrew Morton
On Fri, 07 Dec 2012 14:34:56 -0800
Davidlohr Bueso  wrote:

> This patch adds a new 'memrange' file that shows the starting and
> ending physical addresses that are associated to a node. This is
> useful for identifying specific DIMMs within the system.

I was going to bug you about docmentation, but apparently we didn't
document /sys/devices/system/node/node*/.  A great labor-saving device,
that!

> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -211,6 +211,19 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
>  
> +static ssize_t node_read_memrange(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + int nid = dev->id;
> + unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
> + unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;

hm.  Is this correct for all for
FLATMEM/SPARSEMEM/SPARSEMEM_VMEMMAP/DISCONTIGME/etc?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] drivers/uio/uio_pdrv_genirq.c: Fix memory freeing issues

2012-12-07 Thread Hans J. Koch
On Fri, Dec 07, 2012 at 05:00:54PM +0200, Vitalii Demianets wrote:
> >
> > On second thought, we can't call enable_irq()/disable_irq() unconditionally
> > because of the potential disable counter (irq_desc->depth) disbalance.
> > That's why we need UIO_IRQ_DISABLED flag, and that's why we should check it
> > in uio_pdrv_genirq_irqcontrol().
> > On the other hand, you are right in that we don't need to check it inside
> > irq handler. Inside irq handler we can disable irq and set the flag
> > unconditionally, because:
> > a) We know for sure that irqs are enabled, because we are inside
> > (not-shared) irq handler;
> >  and
> > b) We are guarded from potential race conditions by spin_lock_irqsave()
> > call in uio_pdrv_genirq_irqcontrol().
> >
> > So,yes, we can get rid of costly atomic call to
> > test_and_set_bit(UIO_IRQ_DISABLED,..) inside irq handler. But I still don't
> > like the idea of mixing this optimization with bug fixes in a single patch.
> 
> On the third thought, we can't ;)
> Imagine the SMP system where uio_pdrv_genirq_irqcontrol() is being executed 
> on 
> CPU0 and irq handler is running concurrently on CPU1. To protect from 
> disable_irq counter disbalance we must first check current irq status, and in 
> atomic manner. Thus we prevent  double-disable, one from 
> uio_pdrv_genirq_irqcontrol() running on CPU0 and another form irq handler 
> running on CPU1.
> Above consideration justifies current code.
> 
> But it seems we have potential concurrency problem here anyway. Here is 
> theoretical scenario:
> 1) Initial state: irq is enabled, uio_pdrv_genirq_irqcontrol() starts being 
> executed on CPU0 with irq_on=1 and at the same time, concurrently, irq 
> handler starts being executed on CPU1;
> 2) irq handler executes line
>if (!test_and_set_bit(UIO_IRQ_DISABLED, >flags))
> as irq was enabled, the condition holds. And now UIO_IRQ_DISABLED is set.
> 3) uio_pdrv_genirq_irqcontrol() executes line
>if (test_and_clear_bit(UIO_IRQ_DISABLED, >flags))
> as UIO_IRQ_DISABLED was set by CPU1, the condition holds. And now 
> UIO_IRQ_DISABLED is cleared.
> 4) CPU0 executes enable_irq() . IRQ is enabled. "Unbalanced enable for 
> IRQ %d\n" warning is issued.
> 5) irq handler on CPU1 executes disable_irq_nosync(). IRQ is disabled.
> 
> And now we are in situation where IRQ is disabled, but UIO_IRQ_DISABLED is 
> cleared. Bad.
> 
> The above scenario is purely theoretical, I have no means to check it in 
> hardware.
> The (theoretical) solution is to guard UIO_IRQ_DISABLED test and actual irq 
> disabling inside irq handler by priv->lock.
> 
> What do you think about it? Is it worth worrying about?

Hi Vitalii,
thanks a lot for analyzing the problem so thoroughly. It made me review
uio_pdrv_genirq.c again, and I noticed several issues and came to the
following conclusions:

1.) priv->lock is completely unnecessary. It is only used in one function,
so there's nothing it could possibly protect.

2.) All these "test_and_clear_bit" and "test_and_set_bit" calls are also
unnecessary. We can simply use enable_irq and disable_irq in both the
irq handler and in uio_pdrv_genirq_irqcontrol.

We should go "back to the roots" and have a look at how UIO works.
The workflow it is intended for is like this:

1.) Hardware is in Reset State (e.g. after boot). Any decent hardware
has its interrupt disabled at that time.

2.) uio_pdrv_genirq is loaded. Kernel enables the irq.

3.) Userspace part of the driver comes up. It will initialize the hardware
(including setting the bits that enable the interrupt).

4.) Userspace will then issue a blocking read() on /dev/uioX. Typically,
there'll be a loop or a thread with this blocking read() at the beginning
(usually using the select() call).

5.) At some time, a hardware interrupt will occur. The irq handler in kernel
space will be called, only to disable the irq. This will also cause the UIO
core to make /dev/uioX readable.

6.) Userspace's blocking read returns. Userspace does its work by
reading/writing device memory.

7.) As the last thing, Userspace writes a "1" to /dev/uioX, which causes
uio_pdrv_genirq_irqcontrol to be called, re-enabling the interrupt.

8.) Goto 4.)

We should also remember that uio_pdrv_genirq_handler() is NOT a real hardware
irq handler. The real handler is in the UIO core, which will increment the
event number and wake up userspace.

So, although your scenario clearly shows a subtle race condition, there is
none. If userspace does stupid things, no harm will be done to the kernel.
If userspace is designed the way described above (and in the documentation),
it will always wake up with its interrupt disabled, do its work, and then
re-enable the interrupt. You can probably think of a few things userspace
could do to screw things up. But that's not our problem.

Could you hack up a patch for that? I think it should start with removing
uio_pdrv_genirq_platdata->lock and uio_pdrv_genirq_platdata->flags...

Thanks again for your work. What do you 

RE: [RFC][PATCH] pstore: Skip spinlock when just one cpu is online

2012-12-07 Thread Seiji Aguchi
> Can all these things really happen (did you run into this problem on a real 
> system?). Or is this just a theoretical problem.  Ugly (but
> practical) hacks might be OK to solve real problems. 

It is a theoretical problem right now.
But it is a timing issue and there is a possibility to happen actually.

> But do we really want them to fix problems that actually never happen?

If we find a problem (even if it is theoretical), we can't say "It actually 
never happen.".

I have some reasons to submit this patch before reproducing actually.

1)
It is too late if we fix a problem after it actually happened in case where we 
apply Linux, including pstore, 
to mission critical systems, because the failure of those systems has a great 
impact on a whole society.
Customers in this area ask us to fix a problem as soon as possible.
On the other hand, this kind of timing issue is hard to reproduce.
So, our support service engineers often work all night to reproduce it.
It is a nightmare for us.

If we can fix it with a small patch in adance, it is really helpful for us.

2)
In the long term, I plan to add a kmsg_dump to a kexec path because kdump may 
fail in the real world.
In that case, we need another troubleshooting material like pstore to detect a 
root cause of failure.

Actually, someone blamed for a reliability of kdump in LinuxCON Europe.
http://events.linuxfoundation.org/images/stories/pdf/lceu2012_holzheu.pdf

To convince a kexec maintainer to add a kmsg_dump, I need to prove that there 
is no problem in pstore code
causing a failure of kdump.

Seiji

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about configfs_attribute

2012-12-07 Thread Joel Becker
On Mon, Oct 15, 2012 at 01:45:56PM +0200, Constantine Shulyupin wrote:
> Hi
> 
> I wonder why show and store methods are not stored inside of
> configfs_attribute but stored in wrapper struct defined with
> CONFIGFS_ATTR_STRUCT ?

If you created a custom attribute struct and methods for
yourself, you would not have to have the show/store methods on the same
structure.  The structure, and the ##item_show/store macros, are a
convenience way to do it.
It certainly could be done differently.  I cloned the sysfs
approach at the time.  Note that sysfs has a different methodology for
defining attributes now.

Joel

-- 

Life's Little Instruction Book #252

"Take good care of those you love."

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

2012-12-07 Thread H. Peter Anvin

On 12/07/2012 03:17 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 03:06:13PM -0800, H. Peter Anvin wrote:

On 12/07/2012 03:03 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 02:56:16PM -0800, H. Peter Anvin wrote:

On 12/07/2012 02:49 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 02:45:43PM -0800, H. Peter Anvin wrote:

Patch description please?

there are 2 consts in the definition of one variable



Please put in an actual patch description.  The first line (subject
line) is a title; the patch should make sense without it.

sorry for that. so like this is fine?



Well, except that typically you should explain which variable it is.
Yes, it is obvious if you look at the patch, but you're making the
reader spend a few more moments than necessary.

Also, you should explain what the harm is -- if it breaks anything
or is just a cosmetic issue.

sorry again for lacking of experience...
and I missed another same error, so send version 2.



And one final complaint (I'll fix this one, but for the future):

git automation wants you to put commentary *after* the patch (after the 
line with three dashes) rather than before.


-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/1] Input: add driver for Cypress APA I2C Trackpad

2012-12-07 Thread Benson Leung
This patch introduces a driver for Cypress All Points Addressable
I2C Trackpad, including the ones in 2012 Samsung Chromebooks.

This device is compatible with MT protocol type B, providing identifiable
contacts.

Signed-off-by: Dudley Du 
Signed-off-by: Daniel Kurtz 
Signed-off-by: Benson Leung 
---
Version history :
v3 : * Handle pointer emulation and unused slots in core
 * Fix compile fail on 3.7 for input_mt_init_slots (Thanks Rydberg)
 * Set BUTTONPAD property only if is a buttonpad based on capabilities.
 * Removed __devinit/__devexit/__devexit_p
 * Set to OFF power mode on cyapa_remove
 * Got rid of kasprintf'd physical location string for static one.

v2 : * Removed firmware update.
 * Removed sysfs properties related to firmware update and power mode.
 * Folded cyapa_detect into cyapa_probe.
 * Added support for middle and right mechanical buttons, if they exist.
 * Rearranged disable_irq/enable_irq in suspend and resume to prevent
 a power mode change from colliding with a read of tracking data.
 * Made cyapa_get_state more reliable.
 * Use IRQF_ONESHOT for threaded irq
 * Simplified cyapa_set_power_mode.
 * Removed extra kernel-doc style comments
 * Removed dev_dbg messages.
 * Cleaned up unused includes.
 * Cleaned up unused #defines

v1 : Initial
---
 drivers/input/mouse/Kconfig  |   12 +
 drivers/input/mouse/Makefile |1 +
 drivers/input/mouse/cyapa.c  |  803 ++
 3 files changed, 816 insertions(+), 0 deletions(-)
 create mode 100644 drivers/input/mouse/cyapa.c

diff --git a/drivers/input/mouse/Kconfig b/drivers/input/mouse/Kconfig
index cd6268c..23db30a 100644
--- a/drivers/input/mouse/Kconfig
+++ b/drivers/input/mouse/Kconfig
@@ -193,6 +193,18 @@ config MOUSE_BCM5974
  To compile this driver as a module, choose M here: the
  module will be called bcm5974.
 
+config MOUSE_CYAPA
+   tristate "Cypress APA I2C Trackpad support"
+   depends on I2C
+   help
+ This driver adds support for Cypress All Points Addressable (APA)
+ I2C Trackpads, including the ones used in 2012 Samsung Chromebooks.
+
+ Say Y here if you have a Cypress APA I2C Trackpad.
+
+ To compile this driver as a module, choose M here: the module will be
+ called cyapa.
+
 config MOUSE_INPORT
tristate "InPort/MS/ATIXL busmouse"
depends on ISA
diff --git a/drivers/input/mouse/Makefile b/drivers/input/mouse/Makefile
index 46ba755..10b4773 100644
--- a/drivers/input/mouse/Makefile
+++ b/drivers/input/mouse/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_MOUSE_AMIGA)   += amimouse.o
 obj-$(CONFIG_MOUSE_APPLETOUCH) += appletouch.o
 obj-$(CONFIG_MOUSE_ATARI)  += atarimouse.o
 obj-$(CONFIG_MOUSE_BCM5974)+= bcm5974.o
+obj-$(CONFIG_MOUSE_CYAPA)  += cyapa.o
 obj-$(CONFIG_MOUSE_GPIO)   += gpio_mouse.o
 obj-$(CONFIG_MOUSE_INPORT) += inport.o
 obj-$(CONFIG_MOUSE_LOGIBM) += logibm.o
diff --git a/drivers/input/mouse/cyapa.c b/drivers/input/mouse/cyapa.c
new file mode 100644
index 000..0964583
--- /dev/null
+++ b/drivers/input/mouse/cyapa.c
@@ -0,0 +1,803 @@
+/*
+ * Cypress APA trackpad with I2C interface
+ *
+ * Author: Dudley Du 
+ * Further cleanup and restructuring by:
+ *   Daniel Kurtz 
+ *   Benson Leung 
+ *
+ * Copyright (C) 2011-2012 Cypress Semiconductor, Inc.
+ * Copyright (C) 2011-2012 Google, Inc.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* APA trackpad firmware generation */
+#define CYAPA_GEN3   0x03   /* support MT-protocol B with tracking ID. */
+
+#define CYAPA_NAME   "Cypress APA Trackpad (cyapa)"
+
+/* commands for read/write registers of Cypress trackpad */
+#define CYAPA_CMD_SOFT_RESET   0x00
+#define CYAPA_CMD_POWER_MODE   0x01
+#define CYAPA_CMD_DEV_STATUS   0x02
+#define CYAPA_CMD_GROUP_DATA   0x03
+#define CYAPA_CMD_GROUP_CMD0x04
+#define CYAPA_CMD_GROUP_QUERY  0x05
+#define CYAPA_CMD_BL_STATUS0x06
+#define CYAPA_CMD_BL_HEAD  0x07
+#define CYAPA_CMD_BL_CMD   0x08
+#define CYAPA_CMD_BL_DATA  0x09
+#define CYAPA_CMD_BL_ALL   0x0a
+#define CYAPA_CMD_BLK_PRODUCT_ID   0x0b
+#define CYAPA_CMD_BLK_HEAD 0x0c
+
+/* report data start reg offset address. */
+#define DATA_REG_START_OFFSET  0x
+
+#define BL_HEAD_OFFSET 0x00
+#define BL_DATA_OFFSET 0x10
+
+/*
+ * Operational Device Status Register
+ *
+ * bit 7: Valid interrupt source
+ * bit 6 - 4: Reserved
+ * bit 3 - 2: Power status
+ * bit 1 - 0: Device status
+ */
+#define REG_OP_STATUS 0x00
+#define OP_STATUS_SRC 0x80
+#define OP_STATUS_POWER   0x0c
+#define OP_STATUS_DEV 0x03
+#define 

[PATCH v3 0/1] Input: add driver for Cypress APA I2C Trackpad

2012-12-07 Thread Benson Leung
Thanks Dmitry, Jean, and Henrik.

Here's V3, which I have confirms compiles on input-next, and works
well on my Chromebook.

v3 : * Handle pointer emulation and unused slots in core
 * Fix compile fail on 3.7 for input_mt_init_slots (Thanks Rydberg)
 * Set BUTTONPAD property only if is a buttonpad based on capabilities.
 * Removed __devinit/__devexit/__devexit_p
 * Set to OFF power mode on cyapa_remove
 * Got rid of kasprintf'd physical location string for static one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs/configfs: allow to create groups on demand

2012-12-07 Thread Joel Becker
On Thu, Nov 29, 2012 at 05:41:23PM +0100, Sebastian Andrzej Siewior wrote:
> This patch adds a function add a group to an existing one and its
> counterart. The newly created group behaves as it would be created via
> default_groups[] which means the user can't rmdir it.
> This should be used by the upcomming USB gadget interface in order to
> add the currently available UDCs as a child of the UDC node. The UDC
> itself will appear once the hardware driver is loaded and can appear
> later.

I've now responded to the original thread.  Sorry I didn't notice last
week.  Please see my thoughts there and comment.

Joel

> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  fs/configfs/dir.c|   63 
> ++
>  include/linux/configfs.h |4 +++
>  2 files changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> index 7414ae2..50ee2bd 100644
> --- a/fs/configfs/dir.c
> +++ b/fs/configfs/dir.c
> @@ -1663,19 +1663,13 @@ const struct file_operations configfs_dir_operations 
> = {
>   .readdir= configfs_readdir,
>  };
>  
> -int configfs_register_subsystem(struct configfs_subsystem *subsys)
> +static int __create_group(struct config_group *group, struct dentry *root)
>  {
>   int err;
> - struct config_group *group = >su_group;
>   struct qstr name;
>   struct dentry *dentry;
> - struct dentry *root;
>   struct configfs_dirent *sd;
>  
> - root = configfs_pin_fs();
> - if (IS_ERR(root))
> - return PTR_ERR(root);
> -
>   if (!group->cg_item.ci_name)
>   group->cg_item.ci_name = group->cg_item.ci_namebuf;
>  
> @@ -1708,25 +1702,48 @@ int configfs_register_subsystem(struct 
> configfs_subsystem *subsys)
>  
>   mutex_unlock(>d_inode->i_mutex);
>  
> - if (err) {
> + if (err)
>   unlink_group(group);
> - configfs_release_fs();
> +
> + return err;
> +}
> +
> +int configfs_create_group(struct config_group *parent, struct config_group 
> *new)
> +{
> + int ret;
> +
> + ret = __create_group(new, parent->cg_item.ci_dentry);
> + if (!ret) {
> + struct configfs_dirent *sd = new->cg_item.ci_dentry->d_fsdata;
> +
> + sd->s_type |= CONFIGFS_USET_DEFAULT;
>   }
>  
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(configfs_create_group);
> +
> +int configfs_register_subsystem(struct configfs_subsystem *subsys)
> +{
> + int err;
> + struct dentry *root;
> +
> + root = configfs_pin_fs();
> + if (IS_ERR(root))
> + return PTR_ERR(root);
> +
> + err = __create_group(>su_group, root);
> + if (err)
> + configfs_release_fs();
> +
>   return err;
>  }
>  
> -void configfs_unregister_subsystem(struct configfs_subsystem *subsys)
> +void configfs_remove_group(struct config_group *group)
>  {
> - struct config_group *group = >su_group;
>   struct dentry *dentry = group->cg_item.ci_dentry;
>   struct dentry *root = dentry->d_sb->s_root;
>  
> - if (dentry->d_parent != root) {
> - printk(KERN_ERR "configfs: Tried to unregister 
> non-subsystem!\n");
> - return;
> - }
> -
>   mutex_lock_nested(>d_inode->i_mutex,
> I_MUTEX_PARENT);
>   mutex_lock_nested(>d_inode->i_mutex, I_MUTEX_CHILD);
> @@ -1749,6 +1766,20 @@ void configfs_unregister_subsystem(struct 
> configfs_subsystem *subsys)
>   dput(dentry);
>  
>   unlink_group(group);
> +}
> +EXPORT_SYMBOL_GPL(configfs_remove_group);
> +
> +void configfs_unregister_subsystem(struct configfs_subsystem *subsys)
> +{
> + struct config_group *group = >su_group;
> + struct dentry *dentry = group->cg_item.ci_dentry;
> + struct dentry *root = dentry->d_sb->s_root;
> +
> + if (dentry->d_parent != root) {
> + pr_err("configfs: Tried to unregister non-subsystem!\n");
> + return;
> + }
> + configfs_remove_group(group);
>   configfs_release_fs();
>  }
>  
> diff --git a/include/linux/configfs.h b/include/linux/configfs.h
> index 34025df..660c25d 100644
> --- a/include/linux/configfs.h
> +++ b/include/linux/configfs.h
> @@ -249,6 +249,10 @@ static inline struct configfs_subsystem 
> *to_configfs_subsystem(struct config_gro
>   NULL;
>  }
>  
> +int configfs_create_group(struct config_group *parent,
> + struct config_group *new);
> +void configfs_remove_group(struct config_group *group);
> +
>  int configfs_register_subsystem(struct configfs_subsystem *subsys);
>  void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
>  
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 

"Friends may come and go, but enemies accumulate." 
 

Re: [RFC][PATCH] fs: configfs: programmatically create config groups

2012-12-07 Thread Joel Becker
Hey Guys,
Sorry I missed this for a while.  I'll make a couple of inline
comments, and then I'll summarize my (incomplete) thoughts at the
bottom.

On Wed, Nov 28, 2012 at 02:50:13PM +0100, Sebastian Andrzej Siewior wrote:
> On 11/28/2012 02:05 PM, Michal Nazarewicz wrote:
> >>On 11/27/2012 05:23 PM, Michal Nazarewicz wrote:
> >>>How should a generic tool know what kind of actions are needed for given
> >>>function to be removed?  If you ask me, there should be a way to unbind
> >>>gadget and unload all modules without any specific knowledge of the
> >>>functions.  If there is no such mechanism, then it's a bad user
> >>>interface.

Please remember that configfs is not a "user" interface, it's a
userspace<->kernelspace interface.  Like sysfs, it's not required to be
convenient for someone at a bash prompt.  My goal is that it is *usable*
from a bash prompt.  So it must be that you can
create/destroy/configure objects via mkdir/rmkdir/cat/echo, but you
might have a lot of those mkdir/echo combos to configure something.
When it comes to the "user" interface, a wrapper script or library
should be converting a user intention into all that boilerplate.

> >
> >On Wed, Nov 28 2012, Sebastian Andrzej Siewior wrote:
> >>Well. You need only to remove the directories you created.
> >
> >My point is that there should be a way to write a script that is unaware
> >of the way function is configured, ie. which directories were created
> >and which were not.

As I stated above, I expect that tools will know which is which.
But having said that, a major goal of configfs is that it is discoverable and
transparent.  So you have to be able to distinguish between default
groups and created directories.  When you rmdir a configfs directory,
EACCESS means you don't have permission, ENOTEMPTY means there are children,
EBUSY means there is a depend or a link, and EPERM means it is a default
group.

> I get this. If you recursively rmdir each directory then you clean it
> up.
> 
> >Besides, if you rmdir lun0, is the function still supposed to work with
> >all LUNs present?  In my opinion, while gadget is bound, it should not
> >be possible to modify such things.
> 
> That is correct. The configuration should remain frozen as long as the
> gadget is active because in most cases you can't propagate the change.
> 
> >>An unbind would be  simply an unlink of the gadget  which is linked to
> >>the udc.   All configurations  remain so  you can link  it at  a later
> >>point without touching the configuration because it is as it was.
> >
> >Yes, but that's not my concern.  My concern is that I should be able to
> >put a relatively simple code in my shutdown script (or whatever) which
> >unbinds all gadgets, without knowing what kind of functions are used.
> >
> >And I'm proposing that this could be done by allowing user to just do:
> >
> > cd /cfs/...
> > rmdir gadgets/* # unbind and remove all gadgets
> > rmdir functions/*/* # unbind and remove all function instances
> > rmdir functions/*   # unload all functions
> 
> Yes, you push for simple rmdir API. That would avoid the need for an
> user land tool at some point and you end up in shell scripts.
> I'm not against it but others do have user tools to handle such things.

Yeah, user tools are expected (and should be).

> Anyway, for this to work we have to go through Joel.
> 
> > rmdir udcs/*# unload all UDCs
> 
> No, for this you still have to rmmod :)
> 
> 
> >>>I think the question is of information flow direction.  If user gives
> >>>some information to the kernel, she should be the one creating any
> >>>necessary directories.  But if the information comes from kernel to the
> >>>user, the kernel should create the structure.

This last paragraph actually describes the distinction between
configfs and sysfs.  More specifically, if userspace wants to create an
object in the kernel, it should use configfs.  If the kernel has created
an object on its own, it exposes it via sysfs.  It is a deliberate
non-goal for configfs to replicate sysfs behavior.

[General Thoughts]

First let me restate your problem to see if I understand it.
You want to expose e.g. five LUNs.  They should eventually appear
in configfs as five items to configure (.../{lun0,lun1,...lun4}/.  The
current configfs way to do this is to have your setup script do:

cd /cfg/.../mass_storage
mkdir lun0
echo blah >lun0/attr1
echo blahh >lun0/attr2
mkdir lun1
echo blag >lun1/attr1
echo blagg >lun1/attr1
...

I think the primary concern expressed by Andrzej is that a random user
could come along and say "mkdir lun8", even though the gadget cannot
support it.  A secondary concern from Michal is that userspace has to
run all of those mkdirs.  The thread has described varying solutions.
If these original directories were default_groups, you could
disallow 

[PATCH v2] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

2012-12-07 Thread Cong Ding
On Fri, Dec 07, 2012 at 03:06:13PM -0800, H. Peter Anvin wrote:
> On 12/07/2012 03:03 PM, Cong Ding wrote:
> >On Fri, Dec 07, 2012 at 02:56:16PM -0800, H. Peter Anvin wrote:
> >>On 12/07/2012 02:49 PM, Cong Ding wrote:
> >>>On Fri, Dec 07, 2012 at 02:45:43PM -0800, H. Peter Anvin wrote:
> Patch description please?
> >>>there are 2 consts in the definition of one variable
> >>>
> >>
> >>Please put in an actual patch description.  The first line (subject
> >>line) is a title; the patch should make sense without it.
> >sorry for that. so like this is fine?
> >
> 
> Well, except that typically you should explain which variable it is.
> Yes, it is obvious if you look at the patch, but you're making the
> reader spend a few more moments than necessary.
> 
> Also, you should explain what the harm is -- if it breaks anything
> or is just a cosmetic issue.
sorry again for lacking of experience...
and I missed another same error, so send version 2.

- cong
---
>From 6cf729b913287a6fc06325ca75ccf0efff9274e8 Mon Sep 17 00:00:00 2001
From: Cong Ding 
Date: Fri, 7 Dec 2012 23:14:32 +
Subject: [PATCH] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

fix the following sparse warning:
arch/x86/lib/inat-tables.c:1080:25: warning: duplicate const
arch/x86/lib/inat-tables.c:1095:25: warning: duplicate const
arch/x86/lib/inat-tables.c:1118:25: warning: duplicate const

for variable inat_escape_tables, inat_group_tables, and inat_avx_tables

Signed-off-by: Cong Ding 
---
 arch/x86/tools/gen-insn-attr-x86.awk |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/tools/gen-insn-attr-x86.awk 
b/arch/x86/tools/gen-insn-attr-x86.awk
index ddcf39b..987c7b2 100644
--- a/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/arch/x86/tools/gen-insn-attr-x86.awk
@@ -356,7 +356,7 @@ END {
exit 1
# print escape opcode map's array
print "/* Escape opcode map array */"
-   print "const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1]" \
+   print "const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
  "[INAT_LSTPFX_MAX + 1] = {"
for (i = 0; i < geid; i++)
for (j = 0; j < max_lprefix; j++)
@@ -365,7 +365,7 @@ END {
print "};\n"
# print group opcode map's array
print "/* Group opcode map array */"
-   print "const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1]"\
+   print "const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
  "[INAT_LSTPFX_MAX + 1] = {"
for (i = 0; i < ggid; i++)
for (j = 0; j < max_lprefix; j++)
@@ -374,7 +374,7 @@ END {
print "};\n"
# print AVX opcode map's array
print "/* AVX opcode map array */"
-   print "const insn_attr_t const *inat_avx_tables[X86_VEX_M_MAX + 1]"\
+   print "const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
  "[INAT_LSTPFX_MAX + 1] = {"
for (i = 0; i < gaid; i++)
for (j = 0; j < max_lprefix; j++)
-- 
1.7.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/4] vxlan: capture inner headers during encapsulation

2012-12-07 Thread Jeff Kirsher
On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
> The advantage to these two changes is that it also allows the VXLAN to
> make use of GSO.
>
> Signed-off-by: Joseph Gasparakis 
> Signed-off-by: Peter P Waskiewicz Jr 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/vxlan.c |   10 +-
>  1 files changed, 9 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] ARM: net: bpf_jit_32: fix sp-relative load/stores offsets.

2012-12-07 Thread Mircea Gherzan
Am 06.12.2012 15:38, schrieb Nicolas Schichan:
> The offset must be multiplied by 4 to be sure to access the correct
> 32bit word in the stack scratch space.
> 
> For instance, a store at scratch memory cell #1 was generating the
> following:
> 
> str4, [sp, #1]
> 
> While the correct code for this is:
> 
> str4, [sp, #4]
> 
> To reproduce the bug (assuming your system has a NIC with the mac
> address 52:54:00:12:34:56):
> 
> echo 0 > /proc/sys/net/core/bpf_jit_enable
> tcpdump -ni eth0 "ether[1] + ether[2] - ether[3] * ether[4] - ether[5] \
>   == -0x3AA" # this will capture packets as expected
> 
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> tcpdump -ni eth0 "ether[1] + ether[2] - ether[3] * ether[4] - ether[5] \
>   == -0x3AA" # this will not.
> 
> This bug was present since the original inclusion of bpf_jit for ARM
> (ddecdfce: ARM: 7259/3: net: JIT compiler for packet filters).
> 
> Signed-off-by: Nicolas Schichan 
> ---
>  arch/arm/net/bpf_jit_32.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index a64d349..b6f305e 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -42,7 +42,7 @@
>  #define r_skb_hl ARM_R8
>  
>  #define SCRATCH_SP_OFFSET0
> -#define SCRATCH_OFF(k)   (SCRATCH_SP_OFFSET + (k))
> +#define SCRATCH_OFF(k)   (SCRATCH_SP_OFFSET + 4 * (k))
>  
>  #define SEEN_MEM ((1 << BPF_MEMWORDS) - 1)
>  #define SEEN_MEM_WORD(k) (1 << (k))

Acked-by: Mircea Gherzan 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/4] net: Handle encapsulated offloads before fragmentation or handing to lower dev

2012-12-07 Thread Jeff Kirsher
On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> From: Alexander Duyck 
>
> This change allows the VXLAN to enable Tx checksum offloading even on
> devices that do not support encapsulated checksum offloads. The
> advantage to this is that it allows for the lower device to change due
> to routing table changes without impacting features on the VXLAN itself.
>
> Signed-off-by: Alexander Duyck 
> ---
>  net/core/dev.c   |   15 +--
>  net/ipv4/ip_output.c |4 
>  2 files changed, 17 insertions(+), 2 deletions(-)
Acked-by: Jeff Kirsher 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation

2012-12-07 Thread Jeff Kirsher
On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> This patch adds support in the kernel for offloading in the NIC Tx and Rx
> checksumming for encapsulated packets (such as VXLAN and IP GRE).
>
> Signed-off-by: Joseph Gasparakis 
> Signed-off-by: Peter P Waskiewicz Jr 
> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/ip.h|5 ++
>  include/linux/ipv6.h  |5 ++
>  include/linux/netdevice.h |2 +
>  include/linux/skbuff.h|   90 
> -
>  include/linux/tcp.h   |   10 +
>  include/linux/udp.h   |5 ++
>  net/core/skbuff.c |9 
>  7 files changed, 125 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net

2012-12-07 Thread Stephen Hemminger
On Fri, 07 Dec 2012 15:35:56 -0500 (EST)
David Miller  wrote:

> From: Jason Wang 
> Date: Sat,  8 Dec 2012 01:04:54 +0800
> 
> > This series is an update version (hope the final version) of multiqueue
> > (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> > addressed, the work were based on Krishna Kumar's work to let virtio-net use
> > multiple rx/tx queues to do the packets reception and transmission. 
> > Performance
> > test show the aggregate latency were increased greately but may get some
> > regression in small packet transmission. Due to this, multiqueue were 
> > disabled
> > by default. If user want to benefit form the multiqueue, ethtool -L could be
> > used to enable the feature.
> > 
> > Please review and comments.
> > 
> > A protype implementation of qemu-kvm support could by found in
> > git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two 
> > queues, you
> > could specify the queues parameters to both tap and virtio-net like:
> > 
> > ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
> > 
> > then enable the multiqueue through ethtool by:
> > 
> > ethtool -L eth0 combined 2
> 
> It seems like most, if not all, of the feedback given for this series
> has been addressed by Jason.
> 
> Can I get some ACKs?

Other than the minor style nit in the first patch, I see no issues.
This is really needed by Virtual Routers.

Acked-by: Stephen Hemminger 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info

2012-12-07 Thread Stephen Hemminger
Minor style issue reported by checkpatch which can be fixed after merge.
Although sizeof is actually an operator in C, it is considered correct
style to treat it as a function.


WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#293: FILE: drivers/net/virtio_net.c:395:
+   sg_set_buf(rq->sg, >hdr, sizeof hdr->hdr);

WARNING: sizeof hdr->mhdr should be sizeof(hdr->mhdr)
#552: FILE: drivers/net/virtio_net.c:641:
+   sg_set_buf(sq->sg, >mhdr, sizeof hdr->mhdr);

WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#555: FILE: drivers/net/virtio_net.c:643:
+   sg_set_buf(sq->sg, >hdr, sizeof hdr->hdr);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] gpio/gpio-generic: Add OF bindings

2012-12-07 Thread Jason Gunthorpe
Allow the platform driver to bind through OF. The existing
OF machinery for setting the resource names through OF is used to
configure the device, so the change is minimally intrusive and
fully featured.

Signed-off-by: Jason Gunthorpe 
---
 .../devicetree/bindings/gpio/gpio-generic.txt  |   28 
 drivers/gpio/gpio-generic.c|   18 -
 2 files changed, 45 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/gpio/gpio-generic.txt

diff --git a/Documentation/devicetree/bindings/gpio/gpio-generic.txt 
b/Documentation/devicetree/bindings/gpio/gpio-generic.txt
new file mode 100644
index 000..12b4989
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpio/gpio-generic.txt
@@ -0,0 +1,28 @@
+* General purpose MMIO GPIO controller
+
+Required properties:
+- compatible: "linux,basic-mmio-gpio" or "linux,basic-mmio-gpio-be",
+  the choice determines which bit is considered GPIO #0
+- reg and reg-names: An array of named register ranges describing the windows,
+  in one of these combinations:
+   * 'dat' - Single input/output data register.
+   * 'dat', 'set' and 'clr' - 'dat' is the input and drive 1 writes high to 
'set'
+  and drive 0 writes high to 'clr'
+   * 'dat' and 'set' - 'dat' is the input and drive 1 write high to 'set' and
+   drive 0 writes low to set
+  Additionally one of these may be specified:
+   * dirout - Write 1 to set as output, 0 to set as input
+   * dirin - Write 1 to set as input, 0 to set as output
+
+  The size of the registers should be 1, 4 or 8.
+- #gpio-cells: Should be two.
+- gpio-controller: Marks the device node as a GPIO controller.
+
+Example:
+   gpio0: gpio@8 {
+   #gpio-cells = <2>;
+   compatible = "linux,basic-mmio-gpio";
+   gpio-controller;
+   reg-names = "dat", "set", "dirin";
+   reg = <0x8 4>, <0xc 4>, <0x10 4>;
+   };
diff --git a/drivers/gpio/gpio-generic.c b/drivers/gpio/gpio-generic.c
index 82e2e4f..f71a917 100644
--- a/drivers/gpio/gpio-generic.c
+++ b/drivers/gpio/gpio-generic.c
@@ -458,6 +458,7 @@ static int __devinit bgpio_pdev_probe(struct 
platform_device *pdev)
int err;
struct bgpio_chip *bgc;
struct bgpio_pdata *pdata = dev_get_platdata(dev);
+   const char *name;
 
r = platform_get_resource_byname(pdev, IORESOURCE_MEM, "dat");
if (!r)
@@ -485,7 +486,13 @@ static int __devinit bgpio_pdev_probe(struct 
platform_device *pdev)
if (err)
return err;
 
-   if (!strcmp(platform_get_device_id(pdev)->name, "basic-mmio-gpio-be"))
+   name = platform_get_device_id(pdev)->name;
+   if (name && !strcmp(name, "basic-mmio-gpio-be"))
+   flags |= BGPIOF_BIG_ENDIAN;
+
+   if (pdev->dev.of_node &&
+   of_device_is_compatible(pdev->dev.of_node,
+   "linux,basic-mmio-gpio-be"))
flags |= BGPIOF_BIG_ENDIAN;
 
bgc = devm_kzalloc(>dev, sizeof(*bgc), GFP_KERNEL);
@@ -521,9 +528,18 @@ static const struct platform_device_id bgpio_id_table[] = {
 };
 MODULE_DEVICE_TABLE(platform, bgpio_id_table);
 
+static const struct of_device_id bgpio_ofid_table[] __devinitdata = {
+   {.compatible = "linux,basic-mmio-gpio"},
+   {.compatible = "linux,basic-mmio-gpio-be"},
+   {},
+};
+MODULE_DEVICE_TABLE(of, bgpio_ofid_table);
+
 static struct platform_driver bgpio_driver = {
.driver = {
.name = "basic-mmio-gpio",
+   .owner = THIS_MODULE,
+   .of_match_table = of_match_ptr(bgpio_ofid_table),
},
.id_table = bgpio_id_table,
.probe = bgpio_pdev_probe,
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

2012-12-07 Thread H. Peter Anvin

On 12/07/2012 03:03 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 02:56:16PM -0800, H. Peter Anvin wrote:

On 12/07/2012 02:49 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 02:45:43PM -0800, H. Peter Anvin wrote:

Patch description please?

there are 2 consts in the definition of one variable



Please put in an actual patch description.  The first line (subject
line) is a title; the patch should make sense without it.

sorry for that. so like this is fine?



Well, except that typically you should explain which variable it is. 
Yes, it is obvious if you look at the patch, but you're making the 
reader spend a few more moments than necessary.


Also, you should explain what the harm is -- if it breaks anything or is 
just a cosmetic issue.


-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 000/493] remove CONFIG_HOTPLUG as an option

2012-12-07 Thread Grant Likely
On Fri, Dec 7, 2012 at 5:16 PM, Greg KH  wrote:
> On Fri, Dec 07, 2012 at 01:47:48PM +, Grant Likely wrote:
>> On Wed, 5 Dec 2012 16:39:23 -0800, Greg KH  
>> wrote:
>> > On Thu, Dec 06, 2012 at 12:27:42AM +, Grant Likely wrote:
>> > > On Wed, 21 Nov 2012 20:07:23 -0500, wf...@viridian.itc.virginia.edu 
>> > > (Bill Pemberton) wrote:
>> > > > Grant Likely writes:
>> > > > >
>> > > > > You mean this series wasn't created with a script? You did this by
>> > > > > hand? If so then I must say kudos on your dedication!
>> > > > >
>> > > > > But it makes me more nervous about the series. Too easy to fat
>> > > > > finger many things when touching that many files.
>> > > > >
>> > > >
>> > > > No, I didn't do them by hand, it was a script.  Originally, it was a
>> > > > couple, all basically the same, but removing each __dev*.  Then I'd do
>> > > > a word diff to eyeball them to make sure the script didn't do
>> > > > something goofy.
>> > > >
>> > > > The whack-a-mole part came along because I was working against
>> > > > linux-next and whatever patch series was right for one day wouldn't be
>> > > > right for the next day because of some of the faster moving trees.
>> > > >
>> > > >
>> > > > > Please do write a script and post that for review.
>> > > > >
>> > > >
>> > > > The all-in-one version of the script:
>> > > >
>> > > > #! /usr/bin/perl
>> > > >
>> > > > use strict;
>> > > > use IO::InSitu;
>> > > >
>> > > > sub processfile
>> > > > {
>> > > > my $fn = shift;
>> > > >
>> > > > my ($in, $out) = open_rw($fn, $fn);
>> > > >
>> > > > while (<$in>) {
>> > > > s|__devexit_p\(([^)]+)\)|$1|;
>> > > > s|\s__devexit\b||;
>> > > > s|\s__devinitconst\b||;
>> > > > s|\s__devinitdata\b||;
>> > > > s|\s__devinit\b||;
>> > >
>> > > Pretty straight forward, and works against the files I tried.  :-)
>> > >
>> > > Greg, I'd much rather see the change applied all at once in this manner.
>> > > If that isn't possible, then at the least I'll use the script against
>> > > the code that I maintain and push th result out to Linus.
>> >
>> > Given that there are a lot of patches already in linux-next from Bill
>> > due to this work, I'm not going to do this for all files right now,
>> > sorry.
>> >
>> > But, if you want to use this for the files you maintain and push that
>> > out for 3.8-rc1, that would be great.  I'll be walking the tree after
>> > 3.8-rc1 is out to catch the stragglers with a script like this.
>>
>> Okay. Can you drop any commits you have against drivers/{spi,gpio,of}?
>
> Hm, I only applied the gpio ones to my tree, you got an email when that
> happened.  I didn't apply the spi or of ones.
>
>> Or are they in a tree that you will not rebase?
>
> They are in my driver-core.git tree, the driver-core-next branch, which
> will not be rebased, and has been in linux-next for a while now.
>
> I can revert the 5 gpio patches if you want me to, just let me know.

no. Don't revert. I won't have the gpio changes in my tree, but I'll
do spi since they aren't in your tree yet.

g.


--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ARM: net: bpf_jit_32: fix kzalloc gfp/size mismatch.

2012-12-07 Thread Mircea Gherzan
Am 06.12.2012 15:38, schrieb Nicolas Schichan:
> Official prototype for kzalloc is:
> 
> void *kzalloc(size_t, gfp_t);
> 
> The ARM bpf_jit code was having the assumption that it was:
> 
> void *kzalloc(gfp_t, size);
> 
> This was resulting the use of some random GFP flags depending on the
> size requested and some random overflows once the really needed size
> was more than the value of GFP_KERNEL.
> 
> This bug was present since the original inclusion of bpf_jit for ARM
> (ddecdfce: ARM: 7259/3: net: JIT compiler for packet filters).
> 
> Signed-off-by: Nicolas Schichan 
> ---
>  arch/arm/net/bpf_jit_32.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index c641fb6..a64d349 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -845,7 +845,7 @@ void bpf_jit_compile(struct sk_filter *fp)
>   ctx.skf = fp;
>   ctx.ret0_fp_idx = -1;
>  
> - ctx.offsets = kzalloc(GFP_KERNEL, 4 * (ctx.skf->len + 1));
> + ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
>   if (ctx.offsets == NULL)
>   return;
>  
> @@ -864,7 +864,7 @@ void bpf_jit_compile(struct sk_filter *fp)
>  
>   ctx.idx += ctx.imm_count;
>   if (ctx.imm_count) {
> - ctx.imms = kzalloc(GFP_KERNEL, 4 * ctx.imm_count);
> + ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
>   if (ctx.imms == NULL)
>   goto out;
>   }

Acked-by: Mircea Gherzan 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

2012-12-07 Thread Cong Ding
On Fri, Dec 07, 2012 at 02:56:16PM -0800, H. Peter Anvin wrote:
> On 12/07/2012 02:49 PM, Cong Ding wrote:
> >On Fri, Dec 07, 2012 at 02:45:43PM -0800, H. Peter Anvin wrote:
> >>Patch description please?
> >there are 2 consts in the definition of one variable
> >
> 
> Please put in an actual patch description.  The first line (subject
> line) is a title; the patch should make sense without it.
sorry for that. so like this is fine?

-cong

---
>From 1abfab824ed2dc0af6e283ea0b7a6c45541d4fd1 Mon Sep 17 00:00:00 2001
From: Cong Ding 
Date: Fri, 7 Dec 2012 22:41:09 +
Subject: [PATCH] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

there are two const in the definition of one variable, we should delete one.

Signed-off-by: Cong Ding 
---
 arch/x86/tools/gen-insn-attr-x86.awk |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/tools/gen-insn-attr-x86.awk 
b/arch/x86/tools/gen-insn-attr-x86.awk
index ddcf39b..d1d9cfa 100644
--- a/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/arch/x86/tools/gen-insn-attr-x86.awk
@@ -356,7 +356,7 @@ END {
exit 1
# print escape opcode map's array
print "/* Escape opcode map array */"
-   print "const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1]" \
+   print "const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
  "[INAT_LSTPFX_MAX + 1] = {"
for (i = 0; i < geid; i++)
for (j = 0; j < max_lprefix; j++)
@@ -365,7 +365,7 @@ END {
print "};\n"
# print group opcode map's array
print "/* Group opcode map array */"
-   print "const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1]"\
+   print "const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
  "[INAT_LSTPFX_MAX + 1] = {"
for (i = 0; i < ggid; i++)
for (j = 0; j < max_lprefix; j++)
-- 
1.7.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Debugging: Keep track of page owners

2012-12-07 Thread Dave Hansen
On 12/07/2012 02:44 PM, Andrew Morton wrote:
> AFACIT that difference was undescribed.  I can see that the new version
> uses the stack-tracing infrastructure, but the change to
> pagetypeinfo_showmixedcount_print() is a mystery.

Ahhh, I assume you're talking about this hunk:

>> @@ -976,10 +976,7 @@ static void pagetypeinfo_showmixedcount_print(struct 
>> seq_file *m,
>>  
>> pagetype = allocflags_to_migratetype(page->gfp_mask);
>> if (pagetype != mtype) {
>> -   if (is_migrate_cma(pagetype))
>> -   count[MIGRATE_MOVABLE]++;
>> -   else
>> -   count[mtype]++;
>> +   count[mtype]++;
>> break;
>> }

That was to fix the comment that Laura Abbott made about it miscounting
MIGRATE_CMA pages.

My patch-sending scripts were choking a bit on the text description in
your patch.  I'm using a long-ago-forked copy of your patch-utils and
the DESC/EDESC in the patch I imported is giving them fits when I send
via email and stripping large parts of the description.  I'm happy to
resend via email, too, but here, the raw patch (will the full description):

https://www.sr71.net/~dave/linux/pageowner.patch

The important description that the scripts managed to strip out when
emailed was this:

Updated 12/4/2012 - should apply to 3.7 kernels.  I did a quick
sniff-test to make sure that this boots and produces some sane
output, but it's not been exhaustively tested.

 * Moved file over to debugfs (no reason to keep polluting /proc)
 * Now using generic stack tracking infrastructure
 * Added check for MIGRATE_CMA pages to explicitly count them
   as movable.

The new snprint_stack_trace() probably belongs in its own patch
if this were to get merged, but it won't kill anyone as it stands.

-

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/tools/gen-insn-attr-x86.awk: remove duplicate const

2012-12-07 Thread H. Peter Anvin

On 12/07/2012 02:49 PM, Cong Ding wrote:

On Fri, Dec 07, 2012 at 02:45:43PM -0800, H. Peter Anvin wrote:

Patch description please?

there are 2 consts in the definition of one variable



Please put in an actual patch description.  The first line (subject 
line) is a title; the patch should make sense without it.


-hpa



--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ARM: Orion: Hoist bridge interrupt handling out of the timer

2012-12-07 Thread Jason Gunthorpe
The intent of this patch is to expose the other bridge cause
interrupts to users in the kernel.

- Add orion_bridge_irq_init to create a new edge triggered interrupt
  chip based on the bridge cause register
- Remove all interrupt register code from time.c and use normal
  interrupt functions instead
- Update the machines that use orion_time_init to call
  orion_bridge_irq_init and use the new signature

Tested on a Kirkwood platform.

NOTE: I'm skeptical that MV78xx0 has a bridge interrupt cause/mask
register. However, it was setup so the timer code would touch those
registers, so I've preserved that. Unfortunately prior to this patch
the 'bridge cause register' was only written to, never read. If it is
wired-to-zero on MV78xx0 because it doesn't exist then the timer will
fail to function. The fix is easy, but I need someone with the
manual/system to tell me which is right.

Signed-off-by: Jason Gunthorpe 
---
 arch/arm/mach-dove/common.c   |5 +-
 arch/arm/mach-dove/include/mach/bridge-regs.h |2 +-
 arch/arm/mach-dove/include/mach/irqs.h|9 +++-
 arch/arm/mach-kirkwood/common.c   |6 ++-
 arch/arm/mach-kirkwood/include/mach/bridge-regs.h |2 -
 arch/arm/mach-kirkwood/include/mach/irqs.h|   14 +-
 arch/arm/mach-mv78xx0/common.c|8 +++-
 arch/arm/mach-mv78xx0/include/mach/bridge-regs.h  |2 +-
 arch/arm/mach-mv78xx0/include/mach/irqs.h |9 +++-
 arch/arm/mach-orion5x/common.c|6 ++-
 arch/arm/mach-orion5x/include/mach/bridge-regs.h  |2 -
 arch/arm/mach-orion5x/include/mach/irqs.h |9 +++-
 arch/arm/plat-orion/include/plat/irq.h|3 +
 arch/arm/plat-orion/include/plat/time.h   |4 +-
 arch/arm/plat-orion/irq.c |   51 +
 arch/arm/plat-orion/time.c|   46 ---
 16 files changed, 120 insertions(+), 58 deletions(-)

My immediate need is to have the kernel panic on watchdog timer
expiry, but I also think this might be the first step to clean up this
item:

/*
 * Disable propagation of mbus errors to the CPU local bus,
 * as this causes mbus errors (which can occur for example
 * for PCI aborts) to throw CPU aborts, which we're not set
 * up to deal with.
 */
writel(readl(CPU_CONFIG) & ~CPU_CONFIG_ERROR_PROP, CPU_CONFIG);

Regards,
Jason

diff --git a/arch/arm/mach-dove/common.c b/arch/arm/mach-dove/common.c
index f723fe1..9107808 100644
--- a/arch/arm/mach-dove/common.c
+++ b/arch/arm/mach-dove/common.c
@@ -243,8 +243,9 @@ static int __init dove_find_tclk(void)
 static void __init dove_timer_init(void)
 {
dove_tclk = dove_find_tclk();
-   orion_time_init(BRIDGE_VIRT_BASE, BRIDGE_INT_TIMER1_CLR,
-   IRQ_DOVE_BRIDGE, dove_tclk);
+   orion_bridge_irq_init(IRQ_DOVE_BRIDGE, IRQ_DOVE_BRIDGE_START,
+ BRIDGE_CAUSE);
+   orion_time_init(IRQ_DOVE_BRIDGE_TIMER1, dove_tclk);
 }
 
 struct sys_timer dove_timer = {
diff --git a/arch/arm/mach-dove/include/mach/bridge-regs.h 
b/arch/arm/mach-dove/include/mach/bridge-regs.h
index 99f259e..3bd4656 100644
--- a/arch/arm/mach-dove/include/mach/bridge-regs.h
+++ b/arch/arm/mach-dove/include/mach/bridge-regs.h
@@ -26,7 +26,7 @@
 #define SYSTEM_SOFT_RESET  (BRIDGE_VIRT_BASE + 0x010c)
 #define  SOFT_RESET0x0001
 
-#define  BRIDGE_INT_TIMER1_CLR (~0x0004)
+#define BRIDGE_CAUSE(BRIDGE_VIRT_BASE + 0x0110)
 
 #define IRQ_VIRT_BASE  (BRIDGE_VIRT_BASE + 0x0200)
 #define IRQ_CAUSE_LOW_OFF  0x
diff --git a/arch/arm/mach-dove/include/mach/irqs.h 
b/arch/arm/mach-dove/include/mach/irqs.h
index 03d401d..53c7d82 100644
--- a/arch/arm/mach-dove/include/mach/irqs.h
+++ b/arch/arm/mach-dove/include/mach/irqs.h
@@ -78,9 +78,16 @@
 #define IRQ_DOVE_SATA  62
 
 /*
+ * Bridge Interrupt Controller
+ */
+#define IRQ_DOVE_BRIDGE_START   64
+#define IRQ_DOVE_BRIDGE_TIMER1  (IRQ_DOVE_BRIDGE_START + 2)
+#define NR_BRIDGE_IRQS  6
+
+/*
  * DOVE General Purpose Pins
  */
-#define IRQ_DOVE_GPIO_START64
+#define IRQ_DOVE_GPIO_START(IRQ_DOVE_BRIDGE_START + NR_BRIDGE_IRQS)
 #define NR_GPIO_IRQS   64
 
 /*
diff --git a/arch/arm/mach-kirkwood/common.c b/arch/arm/mach-kirkwood/common.c
index 906c22e..6ec60b8 100644
--- a/arch/arm/mach-kirkwood/common.c
+++ b/arch/arm/mach-kirkwood/common.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "common.h"
 
 /*
@@ -534,8 +535,9 @@ static void __init kirkwood_timer_init(void)
 {
kirkwood_tclk = kirkwood_find_tclk();
 
-   orion_time_init(BRIDGE_VIRT_BASE, BRIDGE_INT_TIMER1_CLR,
-   IRQ_KIRKWOOD_BRIDGE, kirkwood_tclk);
+   orion_bridge_irq_init(IRQ_KIRKWOOD_BRIDGE, 

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Eric Sandeen
On 12/7/12 3:57 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
>> On 12/07/2012 04:43 PM, Chris Mason wrote:
>>> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
 On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> Persistent trim is what I had in mind, but there are other ideas that do
> imply a change in behavior as well.  Can we safely assume this feature
> won't matter on spinning media?  New features like persistent
> trim do make it much easier to solve securely, and using a bit for it
> means we can toss back an error to the app if the underlying storage
> isn't safe.
 We originally implemented no hide stale for spinning media.  Some
 folks have claimed that for XFS their superior technology means that
 no hide stale doesn't buy them anything for HDD's.  I'm not entirely
 sure I buy this, since if you need to update metadata, it means at
 least one extra seek for each random write into 4k preallocated space,
 and 7200 RPM disks only have about 200 seeks per second.
>>> True, 7200 RPM disks are slow, but even allowing them to expose stale
>>> data just makes them a little less slow.
>>>
>>> I know it's against the rules to pretend that disks don't matter.  But
>>> really, once you're doing random IO into a spindle you've given up on
>>> performance anyway.
>>>
>>> -chris
>>
>> That's right.
>>
>> And equally true, once you have moved the disk heads to that track, you can 
>> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will 
>> also 
>> avoid fragmentation of the extents.
> 
> When you do a 4K write, you have to remember that you've written just
> those 4K.  When you do a 1MB write, you have to remember that you've
> written just that 1MB.  It's the same operation, except with the 1MB
> you've also had to setup all the bios and send down the zeros, and do
> the proper locking to make sure you're not sending zeros down over
> some concurrent IO.
> 
> The 1MB setup is actually more work, but it does greatly reduce the
> amount of time the workload needs to run before it goes into a steady
> state.  For smaller files it may work well, but for larger ones I don't
> think it will be enough.

Ext4 already does this, actually, I think - see s_extent_max_zeroout_kb
and how it's used.

/* If extent is less than s_max_zeroout_kb, zeroout directly */

It's not a tunable (*gasp* ;)) but it's currently set to "32" as in
32 kb.  Would be fun to bump that up and see how your test goes.

-Eric

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >