Re: [PATCH v2] zswap: Zero-filled pages handling

2017-07-02 Thread Seth Jennings
On Sun, Jul 2, 2017 at 9:19 AM, Srividya Desireddy
 wrote:
> From: Srividya Desireddy 
> Date: Sun, 2 Jul 2017 19:15:37 +0530
> Subject: [PATCH v2] zswap: Zero-filled pages handling
>
> Zswap is a cache which compresses the pages that are being swapped out
> and stores them into a dynamically allocated RAM-based memory pool.
> Experiments have shown that around 10-20% of pages stored in zswap
> are zero-filled pages (i.e. contents of the page are all zeros), but
> these pages are handled as normal pages by compressing and allocating
> memory in the pool.

I am somewhat surprised that this many anon pages are zero filled.

If this is true, then maybe we should consider solving this at the
swap level in general, as we can de-dup zero pages in all swap
devices, not just zswap.

That being said, this is a fair small change and I don't see anything
objectionable.  However, I do think the better solution would be to do
this at a higher level.

Thanks,
Seth


>
> This patch adds a check in zswap_frontswap_store() to identify zero-filled
> page before compression of the page. If the page is a zero-filled page, set
> zswap_entry.zeroflag and skip the compression of the page and alloction
> of memory in zpool. In zswap_frontswap_load(), check if the zeroflag is
> set for the page in zswap_entry. If the flag is set, memset the page with
> zero. This saves the decompression time during load.
>
> On Ubuntu PC with 2GB RAM, while executing kernel build and other test
> scripts ~15% of pages in zswap were zero pages. With multimedia workload
> more than 20% of zswap pages were found to be zero pages.
>
> On a ARM Quad Core 32-bit device with 1.5GB RAM an average 10% of zero
> pages were found in zswap (an average of 5000 zero pages found out of
> ~5 pages stored in zswap) on launching and relaunching 15 applications.
> The launch time of the applications improved by ~3%.
>
> Test Parameters BaselineWith patch  Improvement
> ---
> Total RAM   1343MB  1343MB
> Available RAM   451MB   445MB -6MB
> Avg. Memfree69MB70MB  1MB
> Avg. Swap Used  226MB   215MB -11MB
> Avg. App entry time 644msec 623msec   3%
>
> With patch, every page swapped to zswap is checked if it is a zero
> page or not and for all the zero pages compression and memory allocation
> operations are skipped. Overall there is an improvement of 30% in zswap
> store time.
>
> In case of non-zero pages there is no overhead during zswap page load. For
> zero pages there is a improvement of more than 60% in the zswap load time
> as the zero page decompression is avoided.
> The below table shows the execution time profiling of the patch.
>
> Zswap Store Operation BaselineWith patch  % Improvement
> --
> * Zero page check-- 22.5ms
>  (for non-zero pages)
> * Zero page check-- 24ms
>  (for zero pages)
> * Compression time  55ms --
>  (of zero pages)
> * Allocation time   14ms --
>  (to store compressed
>   zero pages)
> -
> Total   69ms46.5ms 32%
>
> Zswap Load Operation BaselineWith patch  % Improvement
> -
> * Decompression time  30.4ms--
>  (of zero pages)
> * Zero page check +-- 10.04ms
>  memset operation
>  (of zero pages)
> -
> Total 30.4ms  10.04ms   66%
>
> *The execution times may vary with test device used.
>
> Signed-off-by: Srividya Desireddy 
> ---
>  mm/zswap.c |   46 ++
>  1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index eedc278..edc584b 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -49,6 +49,8 @@
>  static u64 zswap_pool_total_size;
>  /* The number of compressed pages currently stored in zswap */
>  static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> +/* The number of zero filled pages swapped out to zswap */
> +static atomic_t zswap_zero_pages = ATOMIC_INIT(0);
>
>  /*
>   * The statistics below are not protected from concurrent access for
> @@ -145,7 +147,7 @@ struct zswap_pool {
>   *be held while changing the refcount.  Since the lock must
>   *be held, there is no reason to also make refcount atomic.
>   * length - the length in bytes of the compressed page data.  Needed during
> - *  decompression
> + *  decompression. For a zero page length is 0.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zpool allocation handle that stores the compressed page data
>   */
> @@ -320,8 +322,12 @@ static void zswap_rb_erase(stru

Re: [RFC PATCH v1 1/1] mm: zswap - Add crypto acomp/scomp framework support

2017-02-14 Thread Seth Jennings
On Tue, Feb 14, 2017 at 9:40 AM, Mahipal Challa
 wrote:
> This adds the support for kernel's crypto new acomp/scomp framework
> to zswap.
>
> Signed-off-by: Mahipal Challa 
> Signed-off-by: Vishnu Nair 
> ---
>  mm/zswap.c | 129 
> +++--
>  1 file changed, 99 insertions(+), 30 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 067a0d6..d08631b 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -33,6 +33,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -114,7 +116,8 @@ static int zswap_compressor_param_set(const char *,
>
>  struct zswap_pool {
> struct zpool *zpool;
> -   struct crypto_comp * __percpu *tfm;
> +   struct crypto_acomp * __percpu *acomp;
> +   struct acomp_req * __percpu *acomp_req;
> struct kref kref;
> struct list_head list;
> struct work_struct work;
> @@ -379,30 +382,49 @@ static int zswap_dstmem_dead(unsigned int cpu)
>  static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  {
> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -   struct crypto_comp *tfm;
> +   struct crypto_acomp *acomp;
> +   struct acomp_req *acomp_req;
>
> -   if (WARN_ON(*per_cpu_ptr(pool->tfm, cpu)))
> +   if (WARN_ON(*per_cpu_ptr(pool->acomp, cpu)))
> return 0;
> +   if (WARN_ON(*per_cpu_ptr(pool->acomp_req, cpu)))
> +   return 0;
> +
> +   acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
> +   if (IS_ERR_OR_NULL(acomp)) {
> +   pr_err("could not alloc crypto acomp %s : %ld\n",
> +  pool->tfm_name, PTR_ERR(acomp));
> +   return -ENOMEM;
> +   }
> +   *per_cpu_ptr(pool->acomp, cpu) = acomp;
>
> -   tfm = crypto_alloc_comp(pool->tfm_name, 0, 0);
> -   if (IS_ERR_OR_NULL(tfm)) {
> -   pr_err("could not alloc crypto comp %s : %ld\n",
> -  pool->tfm_name, PTR_ERR(tfm));
> +   acomp_req = acomp_request_alloc(acomp);
> +   if (IS_ERR_OR_NULL(acomp_req)) {
> +   pr_err("could not alloc crypto acomp %s : %ld\n",
> +  pool->tfm_name, PTR_ERR(acomp));
> return -ENOMEM;
> }
> -   *per_cpu_ptr(pool->tfm, cpu) = tfm;
> +   *per_cpu_ptr(pool->acomp_req, cpu) = acomp_req;
> +
> return 0;
>  }
>
>  static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>  {
> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -   struct crypto_comp *tfm;
> +   struct crypto_acomp *acomp;
> +   struct acomp_req *acomp_req;
> +
> +   acomp_req = *per_cpu_ptr(pool->acomp_req, cpu);
> +   if (!IS_ERR_OR_NULL(acomp_req))
> +   acomp_request_free(acomp_req);
> +   *per_cpu_ptr(pool->acomp_req, cpu) = NULL;
> +
> +   acomp = *per_cpu_ptr(pool->acomp, cpu);
> +   if (!IS_ERR_OR_NULL(acomp))
> +   crypto_free_acomp(acomp);
> +   *per_cpu_ptr(pool->acomp, cpu) = NULL;
>
> -   tfm = *per_cpu_ptr(pool->tfm, cpu);
> -   if (!IS_ERR_OR_NULL(tfm))
> -   crypto_free_comp(tfm);
> -   *per_cpu_ptr(pool->tfm, cpu) = NULL;
> return 0;
>  }
>
> @@ -503,8 +525,14 @@ static struct zswap_pool *zswap_pool_create(char *type, 
> char *compressor)
> pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
>
> strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> -   pool->tfm = alloc_percpu(struct crypto_comp *);
> -   if (!pool->tfm) {
> +   pool->acomp = alloc_percpu(struct crypto_acomp *);
> +   if (!pool->acomp) {
> +   pr_err("percpu alloc failed\n");
> +   goto error;
> +   }
> +
> +   pool->acomp_req = alloc_percpu(struct acomp_req *);
> +   if (!pool->acomp_req) {
> pr_err("percpu alloc failed\n");
> goto error;
> }
> @@ -526,7 +554,8 @@ static struct zswap_pool *zswap_pool_create(char *type, 
> char *compressor)
> return pool;
>
>  error:
> -   free_percpu(pool->tfm);
> +   free_percpu(pool->acomp_req);
> +   free_percpu(pool->acomp);
> if (pool->zpool)
> zpool_destroy_pool(pool->zpool);
> kfree(pool);
> @@ -566,7 +595,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
> zswap_pool_debug("destroying", pool);
>
> cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> -   free_percpu(pool->tfm);
> +   free_percpu(pool->acomp_req);
> +   free_percpu(pool->acomp);
> zpool_destroy_pool(pool->zpool);
> kfree(pool);
>  }
> @@ -763,7 +793,8 @@ static int zswap_writeback_entry(struct zpool *pool, 
> unsigned long handle)
> pgoff_t offset;
> struct zswap_entry *entry;
> struct page *page;
> -   struct crypto_comp *tfm;
> +   

Re: [PATCHv2] MAINTAINERS: add Dan Streetman to zswap maintainers

2017-01-24 Thread Seth Jennings
On Tue, Jan 24, 2017 at 3:22 PM, Dan Streetman  wrote:
>
> Add myself as zswap maintainer.
>
> Cc: Seth Jennings 
> Signed-off-by: Dan Streetman 

Acked-by: Seth Jennings 

Very yes to this.  I've had almost no kernel time in my new position :(
Dan, if you wanted to add yourself to the zbud maintainers too, feel free!

Thanks,
Seth

>
> ---
> You'd think I could get this simple patch right.  oops!
>
> Since v1: fixed Seth's email in Cc: line
>
> Seth, I'd meant to send this last year, I assume you're still ok
> adding me.  Did you want to stay on as maintainer also?
>
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 741f35f..e5575d5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -13736,6 +13736,7 @@ F:      Documentation/vm/zsmalloc.txt
>
>  ZSWAP COMPRESSED SWAP CACHING
>  M: Seth Jennings 
> +M: Dan Streetman 
>  L: linux...@kvack.org
>  S: Maintained
>  F: mm/zswap.c
> --
> 2.9.3
>


Re: [PATCH v2] z3fold: the 3-fold allocator for compressed pages

2016-04-25 Thread Seth Jennings
On Mon, Apr 25, 2016 at 2:28 AM, Vlastimil Babka  wrote:
> On 04/22/2016 01:22 AM, Andrew Morton wrote:
>>
>> On Tue, 19 Apr 2016 11:48:45 +0200 Vitaly Wool 
>> wrote:
>>
>>> This patch introduces z3fold, a special purpose allocator for storing
>>> compressed pages. It is designed to store up to three compressed pages
>>> per
>>> physical page. It is a ZBUD derivative which allows for higher
>>> compression
>>> ratio keeping the simplicity and determinism of its predecessor.
>>>
>>> The main differences between z3fold and zbud are:
>>> * unlike zbud, z3fold allows for up to PAGE_SIZE allocations
>>> * z3fold can hold up to 3 compressed pages in its page
>>>
>>> This patch comes as a follow-up to the discussions at the Embedded Linux
>>> Conference in San-Diego related to the talk [1]. The outcome of these
>>> discussions was that it would be good to have a compressed page allocator
>>> as stable and deterministic as zbud with with higher compression ratio.
>>>
>>> To keep the determinism and simplicity, z3fold, just like zbud, always
>>> stores an integral number of compressed pages per page, but it can store
>>> up to 3 pages unlike zbud which can store at most 2. Therefore the
>>> compression ratio goes to around 2.5x while zbud's one is around 1.7x.
>>>
>>> The patch is based on the latest linux.git tree.
>>>
>>> This version of the patch has updates related to various concurrency
>>> fixes
>>> made after intensive testing on SMP/HMP platforms.
>>>
>>>
>>> [1]https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
>>>
>>
>> So...  why don't we just replace zbud with z3fold?  (Update the changelog
>> to answer this rather obvious question, please!)
>
>
> There was discussion between Seth and Vitaly on v1. Without me knowing the
> details myself, it looked like Seth's objections were addressed, but then
> the thread died. I think there should first be a more clear answer from Seth
> whether z3fold really looks like a clear win (i.e. not workload-dependent)
> over zbud, in which case zbud could be extended?

(sorry for the dup Vlastimil, didn't reply-to-all)

It seems like it could be in the case that most of the pages in your
system compress to 1/3 their original size (on average).  In my
original research, I found that, using lzo, 1/2 a page was more
typical.  However, if you used deflate, you might be able to push the
average down.

IMO I do think we should try to merge zbud and z3fold with zbud being
the default mode (2 object per page) and have an option to enable the
3 objects per page logic.  IIRC that 3rd object logic seemed to be
fairly contained.  Having the separate would duplicate a lot of very
similar code.

However, if Andrew is ok with yet another z- allocator, it can just be
another zpool backend.  I'm fine either way.  Just my two cents.

Seth

>


Re: [PATCH] z3fold: the 3-fold allocator for compressed pages

2016-04-14 Thread Seth Jennings
On Thu, Apr 14, 2016 at 12:45 PM, Vitaly Wool  wrote:
> On Thu, Apr 14, 2016 at 5:53 PM, Seth Jennings  wrote:
>> On Thu, Apr 14, 2016 at 4:06 AM, Vitaly Wool  wrote:
>>>
>>>
>>> On Thu, Apr 14, 2016 at 10:48 AM, Vlastimil Babka  wrote:
>>>>
>>>> On 04/14/2016 10:05 AM, Vitaly Wool wrote:
>>>>>
>>>>> This patch introduces z3fold, a special purpose allocator for storing
>>>>> compressed pages. It is designed to store up to three compressed pages
>>>>> per
>>>>> physical page. It is a ZBUD derivative which allows for higher
>>>>> compression
>>>>> ratio keeping the simplicity and determinism of its predecessor.
>>>>
>>>>
>>>> So the obvious question is, why a separate allocator and not extend zbud?
>>>
>>>
>>> Well, as far as I recall Seth was very much for keeping zbud as simple as
>>> possible. I am fine either way but if we have zpool API, why not have
>>> another zpool API user?
>>>
>>>>
>>>> I didn't study the code, nor notice a design/algorithm overview doc, but
>>>> it seems z3fold keeps the idea of one compressed page at the beginning, one
>>>> at the end of page frame, but it adds another one in the middle? Also how 
>>>> is
>>>> the buddy-matching done?
>>
>> Yes, as soon as you introduce a 3rd object in the page, zpage
>> fragmentation becomes an issue.  Having a middle object partitions
>> that zpage, blocking allocations that are larger than either
>> partition, even though the combined size of the partitions could have
>> accommodated the object.
>
> Yes, but this situation is easy to track down and work around by
> moving the middle object to either the beginning or the end. In case
> of the current implementation it is the beginning.
>
>> This also means that the unbuddied list is broken in this
>> implementation.  num_free_chunks() is calculating the _total_ free
>> space in the page.  But that is not that the _usable_ free space by a
>> single object, if the middle object has partitioned that free space.
>
> Once again, there is the code in z3fold_free() that makes sure the
> free space within the page is contiguous so I don't think the
> unbuddied list is, or will be, broken.

Didn't see the relocation before.  However, that brings up another
question.  How is the code moving objects when the location of that
object is encoded in the handle that has already been given to the
user?

Seth

>
> ~vitaly


Re: [PATCH] z3fold: the 3-fold allocator for compressed pages

2016-04-14 Thread Seth Jennings
On Thu, Apr 14, 2016 at 4:06 AM, Vitaly Wool  wrote:
>
>
> On Thu, Apr 14, 2016 at 10:48 AM, Vlastimil Babka  wrote:
>>
>> On 04/14/2016 10:05 AM, Vitaly Wool wrote:
>>>
>>> This patch introduces z3fold, a special purpose allocator for storing
>>> compressed pages. It is designed to store up to three compressed pages
>>> per
>>> physical page. It is a ZBUD derivative which allows for higher
>>> compression
>>> ratio keeping the simplicity and determinism of its predecessor.
>>
>>
>> So the obvious question is, why a separate allocator and not extend zbud?
>
>
> Well, as far as I recall Seth was very much for keeping zbud as simple as
> possible. I am fine either way but if we have zpool API, why not have
> another zpool API user?
>
>>
>> I didn't study the code, nor notice a design/algorithm overview doc, but
>> it seems z3fold keeps the idea of one compressed page at the beginning, one
>> at the end of page frame, but it adds another one in the middle? Also how is
>> the buddy-matching done?

Yes, as soon as you introduce a 3rd object in the page, zpage
fragmentation becomes an issue.  Having a middle object partitions
that zpage, blocking allocations that are larger than either
partition, even though the combined size of the partitions could have
accommodated the object.

This also means that the unbuddied list is broken in this
implementation.  num_free_chunks() is calculating the _total_ free
space in the page.  But that is not that the _usable_ free space by a
single object, if the middle object has partitioned that free space.

Seth

>
>
> Basically yes. There is 'start_middle' variable which point to the start of
> the middle page, if any. The matching is done basing on the buddy number.
>
> ~vitaly


Re: [PATCH] livepatch: Update maintainers

2016-03-20 Thread Seth Jennings
On Wed, Mar 16, 2016 at 10:03 AM, Josh Poimboeuf  wrote:
> Seth and Vojtech are no longer active maintainers of livepatch, so
> remove them in favor of Jessica and Miroslav.
>
> Also add Petr as a designated reviewer.
>
> Signed-off-by: Josh Poimboeuf 
> ---
>  MAINTAINERS | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 860e306..e04e0a5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6583,9 +6583,10 @@ F:   drivers/platform/x86/hp_accel.c
>
>  LIVE PATCHING
>  M: Josh Poimboeuf 
> -M: Seth Jennings 
> +M: Jessica Yu 
>  M: Jiri Kosina 
> -M: Vojtech Pavlik 
> +M: Miroslav Benes 
> +R: Petr Mladek 
>  S: Maintained
>  F: kernel/livepatch/
>  F: include/linux/livepatch.h

Acked-by: Seth Jennings 

> --
> 2.4.3
>


[PATCH] MAINTAINERS: update Seth email

2016-01-29 Thread Seth Jennings
Update/unify my contact info.  The old email address
will no longer work soon.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 30aca4a..9778aab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12133,7 +12133,7 @@ F:  drivers/net/hamradio/*scc.c
 F: drivers/net/hamradio/z8530.h
 
 ZBUD COMPRESSED PAGE ALLOCATOR
-M: Seth Jennings 
+M: Seth Jennings 
 L: linux...@kvack.org
 S: Maintained
 F: mm/zbud.c
@@ -12188,7 +12188,7 @@ F:  include/linux/zsmalloc.h
 F: Documentation/vm/zsmalloc.txt
 
 ZSWAP COMPRESSED SWAP CACHING
-M: Seth Jennings 
+M: Seth Jennings 
 L: linux...@kvack.org
 S: Maintained
 F: mm/zswap.c
-- 
2.5.0



[PATCH 2/3] drivers: memory: rename remove_memory_block() to remove_memory_section()

2015-12-02 Thread Seth Jennings
The function removes a section, not a block.  Rename to reflect
actual functionality.

Signed-off-by: Seth Jennings 
---
 drivers/base/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index ca2ce02..dd30744 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -688,7 +688,7 @@ unregister_memory(struct memory_block *memory)
device_unregister(&memory->dev);
 }
 
-static int remove_memory_block(unsigned long node_id,
+static int remove_memory_section(unsigned long node_id,
   struct mem_section *section, int phys_device)
 {
struct memory_block *mem;
@@ -712,7 +712,7 @@ int unregister_memory_section(struct mem_section *section)
if (!present_section(section))
return -EINVAL;
 
-   return remove_memory_block(0, section, 0);
+   return remove_memory_section(0, section, 0);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] drivers: memory: clean up section counting

2015-12-02 Thread Seth Jennings
Right now, section_count is calculated in add_memory_block().
However, init_memory_block() increments section_count as well,
which, at first, seems like it would lead to an off-by-one error.
There is no harm done because add_memory_block() immediately overwrites
the mem->section_count, but it is messy.

This commit moves the increment out of the common init_memory_block()
(called by both add_memory_block() and register_new_memory()) and
adds it to register_new_memory().

Signed-off-by: Seth Jennings 
---
 drivers/base/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2804aed..ca2ce02 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -614,7 +614,6 @@ static int init_memory_block(struct memory_block **memory,
base_memory_block_id(scn_nr) * sections_per_block;
mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;
mem->state = state;
-   mem->section_count++;
start_pfn = section_nr_to_pfn(mem->start_section_nr);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
@@ -668,6 +667,7 @@ int register_new_memory(int nid, struct mem_section 
*section)
ret = init_memory_block(&mem, section, MEM_OFFLINE);
if (ret)
goto out;
+   mem->section_count++;
}
 
if (mem->section_count == sections_per_block)
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with missing sections

2015-12-02 Thread Seth Jennings
bdee237c and 982792c7 introduced large block sizes for x86.
This made it possible to have multiple sections per memory
block where previously, there was a only every one section
per block.

Since blocks consist of contiguous ranges of section, there
can be holes in the blocks where sections are not present.
If one attempts to offline such a block, a crash occurs since
the code is not designed to deal with this.

This patch is a quick fix to gaurd against the crash by
not allowing blocks with non-present sections to be offlined.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107781
Reported-by: Andrew Banman 
Signed-off-by: Seth Jennings 
---
 drivers/base/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dd30744..6d7b14c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev)
if (mem->state == MEM_OFFLINE)
return 0;
 
+   /* Can't offline block with non-present sections */
+   if (mem->section_count != sections_per_block)
+   return -EINVAL;
+
return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
 }
 
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3] x86: mm: clean up probe_memory_block_size()

2015-11-30 Thread Seth Jennings
v2: remove bz local var, remove unneeded block size message (Ingo)
v3: restore bz local var, unify block size message (Ingo)

The cumulative effect of bdee237c and 982792c7 is some pretty convoluted
code.  This commit has no (intended) functional change; just seeks to
simplify and make the code more understandable.

The whole section with the "tail size" doesn't seem to be reachable,
since both the >= 64GB and < 64GB case return, so it was removed.

This commit also adds code back for the UV case since it seemed to just
go away without reason in bdee237c and might lead to unexpected change
in behavior.

Signed-off-by: Seth Jennings 
---
 arch/x86/mm/init_64.c | 24 ++--
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ec081fe..31c41c5 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mm_internal.h"
@@ -1194,26 +1195,13 @@ int kern_addr_valid(unsigned long addr)
 
 static unsigned long probe_memory_block_size(void)
 {
-   /* start from 2g */
-   unsigned long bz = 1UL<<31;
+   unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
 
-   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
-   pr_info("Using 2GB memory block size for large-memory 
system\n");
-   return 2UL * 1024 * 1024 * 1024;
-   }
-
-   /* less than 64g installed */
-   if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
-   return MIN_MEMORY_BLOCK_SIZE;
-
-   /* get the tail size */
-   while (bz > MIN_MEMORY_BLOCK_SIZE) {
-   if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
-   break;
-   bz >>= 1;
-   }
+   /* if system is UV or has 64GB of RAM or more, use large blocks */
+   if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
+   bz = 2UL << 30; /* 2GB */
 
-   printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);
+   pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
 
return bz;
 }
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] x86: mm: clean up probe_memory_block_size()

2015-11-30 Thread Seth Jennings
On Fri, Nov 27, 2015 at 08:39:32AM +0100, Ingo Molnar wrote:
> 
> * Seth Jennings  wrote:
> 
> > v2:
> > remove local bz variable (Ingo) and debug message since, if
> > the 2GB message doesn't print, there is only one possible
> > block size.
> 
> I'd not remove the info message, it would print the memory block size 
> regardless 
> of memory size. Yes, one could decode the 'no message' case as 'the kernel 
> used 
> the default value' - but that's very version dependent and obscure in any 
> case. 
> Please keep the debug message in both code paths, like the original code had 
> it.
> 
> But, on a second thought, I'd definitely harmonize the messages, instead of:
> 
> > pr_info("Using 2GB memory block size for large-memory 
> > system\n");
> > printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);
> 
> I'd print:
> 
> > pr_info("x86/mm: Memory block size: 2GB, large-memory 
> > system\n");
> > pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
> 
> Also note how I changed both printouts to pr_info(), so that we have the 
> memory 
> block size information printed unconditionally.
> 
> (And btw., doing this printout means that we should keep the 'bz' local 
> variable.)

Just sent out v3.

Thanks,
Seth

> 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: mm: clean up probe_memory_block_size()

2015-11-26 Thread Seth Jennings
On Thu, Nov 26, 2015 at 10:12:01AM +0100, Ingo Molnar wrote:
> 
> * Seth Jennings  wrote:
> 
> > The cumulative effect of bdee237c and 982792c7 is some pretty convoluted
> > code.  This commit has no (intended) functional change; just seeks to
> > simplify and make the code more understandable.
> > 
> > The whole section with the "tail size" doesn't seem to be reachable,
> > since both the >= 64GB and < 64GB case return, so it was removed.
> > 
> > This commit also adds code back for the UV case since it seemed to just
> > go away without reason in bdee237c and might lead to unexpected change
> > in behavior.
> > 
> > Signed-off-by: Seth Jennings 
> > ---
> >  arch/x86/mm/init_64.c | 22 ++
> >  1 file changed, 6 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index ec081fe..a83c470 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -52,6 +52,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  
> >  #include "mm_internal.h"
> > @@ -1194,26 +1195,15 @@ int kern_addr_valid(unsigned long addr)
> >  
> >  static unsigned long probe_memory_block_size(void)
> >  {
> > +   unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
> >  
> > +   /* if system is UV or has 64GB of RAM or more, use large blocks */
> > +   if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30))) {
> > pr_info("Using 2GB memory block size for large-memory 
> > system\n");
> > +   bz = 2UL << 30; /* 2GB */
> > }
> >  
> > +   pr_debug("memory block size : %ldMB\n", bz >> 20);
> >  
> > return bz;
> >  }
> 
> So why keep 'bz' at all? Just return with the right value and be done with 
> it. 
> 'bz' is just an unnecessary confusion factor.

Good point.  Just send out v2.

Thanks,
Seth

> 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2] x86: mm: clean up probe_memory_block_size()

2015-11-26 Thread Seth Jennings
v2:
remove local bz variable (Ingo) and debug message since, if
the 2GB message doesn't print, there is only one possible
block size.

The cumulative effect of bdee237c and 982792c7 is some pretty convoluted
code.  This commit has no (intended) functional change; just seeks to
simplify and make the code more understandable.

The whole section with the "tail size" doesn't seem to be reachable,
since both the >= 64GB and < 64GB case return, so it was removed.

This commit also adds code back for the UV case since it seemed to just
go away without reason in bdee237c and might lead to unexpected change
in behavior.

Signed-off-by: Seth Jennings 
---
 arch/x86/mm/init_64.c | 24 +---
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ec081fe..b05df4f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mm_internal.h"
@@ -1194,28 +1195,13 @@ int kern_addr_valid(unsigned long addr)
 
 static unsigned long probe_memory_block_size(void)
 {
-   /* start from 2g */
-   unsigned long bz = 1UL<<31;
-
-   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
+   /* if system is UV or has 64GB of RAM or more, use large blocks */
+   if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30))) {
pr_info("Using 2GB memory block size for large-memory 
system\n");
-   return 2UL * 1024 * 1024 * 1024;
+   return 2UL << 30; /* 2GB */
}
 
-   /* less than 64g installed */
-   if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
-   return MIN_MEMORY_BLOCK_SIZE;
-
-   /* get the tail size */
-   while (bz > MIN_MEMORY_BLOCK_SIZE) {
-   if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
-   break;
-   bz >>= 1;
-   }
-
-   printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);
-
-   return bz;
+   return MIN_MEMORY_BLOCK_SIZE;
 }
 
 static unsigned long memory_block_size_probed;
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: mm: clean up probe_memory_block_size()

2015-11-24 Thread Seth Jennings
The cumulative effect of bdee237c and 982792c7 is some pretty convoluted
code.  This commit has no (intended) functional change; just seeks to
simplify and make the code more understandable.

The whole section with the "tail size" doesn't seem to be reachable,
since both the >= 64GB and < 64GB case return, so it was removed.

This commit also adds code back for the UV case since it seemed to just
go away without reason in bdee237c and might lead to unexpected change
in behavior.

Signed-off-by: Seth Jennings 
---
 arch/x86/mm/init_64.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ec081fe..a83c470 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mm_internal.h"
@@ -1194,26 +1195,15 @@ int kern_addr_valid(unsigned long addr)
 
 static unsigned long probe_memory_block_size(void)
 {
-   /* start from 2g */
-   unsigned long bz = 1UL<<31;
+   unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
 
-   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
+   /* if system is UV or has 64GB of RAM or more, use large blocks */
+   if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30))) {
pr_info("Using 2GB memory block size for large-memory 
system\n");
-   return 2UL * 1024 * 1024 * 1024;
+   bz = 2UL << 30; /* 2GB */
}
 
-   /* less than 64g installed */
-   if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
-   return MIN_MEMORY_BLOCK_SIZE;
-
-   /* get the tail size */
-   while (bz > MIN_MEMORY_BLOCK_SIZE) {
-   if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
-   break;
-   bz >>= 1;
-   }
-
-   printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);
+   pr_debug("memory block size : %ldMB\n", bz >> 20);
 
return bz;
 }
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] zbud: allow up to PAGE_SIZE allocations

2015-09-23 Thread Seth Jennings
On Wed, Sep 23, 2015 at 10:59:00PM +0200, Vitaly Wool wrote:
> Okay, how about this? It's gotten smaller BTW :)
> 
> zbud: allow up to PAGE_SIZE allocations
> 
> Currently zbud is only capable of allocating not more than
> PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE. This is okay as
> long as only zswap is using it, but other users of zbud may
> (and likely will) want to allocate up to PAGE_SIZE. This patch
> addresses that by skipping the creation of zbud internal
> structure in the beginning of an allocated page. As a zbud page
> is no longer guaranteed to contain zbud header, the following
> changes have to be applied throughout the code:
> * page->lru to be used for zbud page lists
> * page->private to hold 'under_reclaim' flag
> 
> page->private will also be used to indicate if this page contains
> a zbud header in the beginning or not ('headless' flag).
> 
> Signed-off-by: Vitaly Wool 
> ---
>  mm/zbud.c | 167 
> ++
>  1 file changed, 113 insertions(+), 54 deletions(-)
> 
> diff --git a/mm/zbud.c b/mm/zbud.c
> index fa48bcdf..3946fba 100644
> --- a/mm/zbud.c
> +++ b/mm/zbud.c
> @@ -105,18 +105,20 @@ struct zbud_pool {
>  
>  /*
>   * struct zbud_header - zbud page metadata occupying the first chunk of each
> - *   zbud page.
> + *   zbud page, except for HEADLESS pages
>   * @buddy:   links the zbud page into the unbuddied/buddied lists in the pool
> - * @lru: links the zbud page into the lru list in the pool
>   * @first_chunks:the size of the first buddy in chunks, 0 if free
>   * @last_chunks: the size of the last buddy in chunks, 0 if free
>   */
>  struct zbud_header {
>   struct list_head buddy;
> - struct list_head lru;
>   unsigned int first_chunks;
>   unsigned int last_chunks;
> - bool under_reclaim;
> +};
> +
> +enum zbud_page_flags {
> + UNDER_RECLAIM = 0,

Don't need the "= 0"

> + PAGE_HEADLESS,

Also I think we should prefix the enum values here. With ZPF_ ?

>  };
>  
>  /*
> @@ -221,6 +223,7 @@ MODULE_ALIAS("zpool-zbud");
>  */
>  /* Just to make the code easier to read */
>  enum buddy {
> + HEADLESS,
>   FIRST,
>   LAST
>  };
> @@ -238,11 +241,14 @@ static int size_to_chunks(size_t size)
>  static struct zbud_header *init_zbud_page(struct page *page)
>  {
>   struct zbud_header *zhdr = page_address(page);
> +
> + INIT_LIST_HEAD(&page->lru);
> + clear_bit(UNDER_RECLAIM, &page->private);
> + clear_bit(HEADLESS, &page->private);

I know we are using private in a bitwise flags mode, but maybe we
should just init with page->private = 0

> +
>   zhdr->first_chunks = 0;
>   zhdr->last_chunks = 0;
>   INIT_LIST_HEAD(&zhdr->buddy);
> - INIT_LIST_HEAD(&zhdr->lru);
> - zhdr->under_reclaim = 0;
>   return zhdr;
>  }
>  
> @@ -267,11 +273,22 @@ static unsigned long encode_handle(struct zbud_header 
> *zhdr, enum buddy bud)
>* over the zbud header in the first chunk.
>*/
>   handle = (unsigned long)zhdr;
> - if (bud == FIRST)
> + switch (bud) {
> + case FIRST:
>   /* skip over zbud header */
>   handle += ZHDR_SIZE_ALIGNED;
> - else /* bud == LAST */
> + break;
> + case LAST:
>   handle += PAGE_SIZE - (zhdr->last_chunks  << CHUNK_SHIFT);
> + break;
> + case HEADLESS:
> + break;
> + default:
> + /* this should never happen */
> + pr_err("zbud: invalid buddy value %d\n", bud);
> + handle = 0;
> + break;
> + }

Don't need this default case since we have a case for each valid value
of the enum.

Also, I think we want to add some code to free_zbud_page() to clear
page->private and init page->lru so we don't leave dangling pointers.

Looks good though :)

Thanks,
Seth

>   return handle;
>  }
>  
> @@ -287,6 +304,7 @@ static int num_free_chunks(struct zbud_header *zhdr)
>   /*
>* Rather than branch for different situations, just use the fact that
>* free buddies have a length of zero to simplify everything.
> +  * NB: can't be used with HEADLESS pages.
>*/
>   return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks;
>  }
> @@ -353,31 +371,39 @@ void zbud_destroy_pool(struct zbud_pool *pool)
>  int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp,
>   unsigned long *handle)
>  {
> - int chunks, i, freechunks;
> + int chunks = 0, i, freechunks;
>   struct zbud_header *zhdr = NULL;
>   enum buddy bud;
>   struct page *page;
>  
>   if (!size || (gfp & __GFP_HIGHMEM))
>   return -EINVAL;
> - if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE)
> +
> + if (size > PAGE_SIZE)
>   return -ENOSPC;
> - chunks = size_to_chunks(size);
> - spin_lock(&pool->lock);
>  
> - /* First, try to find an un

Re: [PATCH v2] zbud: allow up to PAGE_SIZE allocations

2015-09-23 Thread Seth Jennings
On Wed, Sep 23, 2015 at 09:54:02AM +0200, Vitaly Wool wrote:
> On Wed, Sep 23, 2015 at 5:18 AM, Seth Jennings  
> wrote:
> > On Tue, Sep 22, 2015 at 02:17:33PM +0200, Vitaly Wool wrote:
> >> Currently zbud is only capable of allocating not more than
> >> PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE. This is okay as
> >> long as only zswap is using it, but other users of zbud may
> >> (and likely will) want to allocate up to PAGE_SIZE. This patch
> >> addresses that by skipping the creation of zbud internal
> >> structure in the beginning of an allocated page (such pages are
> >> then called 'headless').
> >
> > I guess I'm having trouble with this.  If you store a PAGE_SIZE
> > allocation in zbud, then the zpage can only have one allocation as there
> > is no room for a buddy.  So... we have an allocator for that: the
> > page allocator.
> >
> > zbud doesn't support this by design because, if you are only storing one
> > allocation per page, you don't gain anything.
> >
> > This functionality creates many new edge cases for the code.
> >
> > What is this use case you envision?  I think we need to discuss
> > whether the use case exists and if it justifies the added complexity.
> 
> The use case is to use zram with zbud as allocator via the common
> zpool api. Sometimes determinism and better worst-case time are more
> important than high compression ratio.
> As far as I can see, I'm not the only one who wants this case
> supported in mainline.

Ok, I can see that having the allocator backends for zpool 
have the same set of constraints is nice.

I'll look at your latest patch.

Thanks,
Seth

> 
> > We are crossing a boundary into zsmalloc style complexity with storing
> > stuff in the struct page, something I really didn't want to do in zbud.
> 
> Well, the thing is we need PAGE_SIZE allocations supported to use zram
> with zbud. I can of course add the code handling this in zpool but I
> am quite sure doing that in zbud directly is a better idea. I'm very
> keen on keeping the complexity down as much as possible though.
> 
> > zbud is the simple one, zsmalloc is the complex one.  I'd hate to have
> > two complex ones :-/
> 
> Who am I to disagree :) Keeping zbud simple is my goal, too, but once
> again, I'd really like it to support PAGE_SIZE allocations. And if it
> doesn't, the whole zpool thing for it becomes useless, since there
> will hardly be any zbud users other than zswap.
> 
> ~vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] zbud: allow up to PAGE_SIZE allocations

2015-09-22 Thread Seth Jennings
On Tue, Sep 22, 2015 at 02:17:33PM +0200, Vitaly Wool wrote:
> Currently zbud is only capable of allocating not more than
> PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE. This is okay as
> long as only zswap is using it, but other users of zbud may
> (and likely will) want to allocate up to PAGE_SIZE. This patch
> addresses that by skipping the creation of zbud internal
> structure in the beginning of an allocated page (such pages are
> then called 'headless').

I guess I'm having trouble with this.  If you store a PAGE_SIZE
allocation in zbud, then the zpage can only have one allocation as there
is no room for a buddy.  So... we have an allocator for that: the
page allocator.

zbud doesn't support this by design because, if you are only storing one
allocation per page, you don't gain anything.

This functionality creates many new edge cases for the code.

What is this use case you envision?  I think we need to discuss
whether the use case exists and if it justifies the added complexity.

We are crossing a boundary into zsmalloc style complexity with storing
stuff in the struct page, something I really didn't want to do in zbud.

zbud is the simple one, zsmalloc is the complex one.  I'd hate to have
two complex ones :-/

Seth

> 
> As a zbud page is no longer guaranteed to contain zbud header, the
> following changes had to be applied throughout the code:
> * page->lru to be used for zbud page lists
> * page->private to hold 'under_reclaim' flag
> 
> page->private will also be used to indicate if this page contains
> a zbud header in the beginning or not ('headless' flag).
> 
> Signed-off-by: Vitaly Wool 
> ---
>  mm/zbud.c | 194 
> +-
>  1 file changed, 128 insertions(+), 66 deletions(-)
> 
> diff --git a/mm/zbud.c b/mm/zbud.c
> index fa48bcdf..7b51eb6 100644
> --- a/mm/zbud.c
> +++ b/mm/zbud.c
> @@ -105,18 +105,25 @@ struct zbud_pool {
>  
>  /*
>   * struct zbud_header - zbud page metadata occupying the first chunk of each
> - *   zbud page.
> + *   zbud page, except for HEADLESS pages
>   * @buddy:   links the zbud page into the unbuddied/buddied lists in the pool
> - * @lru: links the zbud page into the lru list in the pool
>   * @first_chunks:the size of the first buddy in chunks, 0 if free
>   * @last_chunks: the size of the last buddy in chunks, 0 if free
>   */
>  struct zbud_header {
>   struct list_head buddy;
> - struct list_head lru;
>   unsigned int first_chunks;
>   unsigned int last_chunks;
> - bool under_reclaim;
> +};
> +
> +/*
> + * struct zbud_page_priv - zbud flags to be stored in page->private
> + * @under_reclaim: if a zbud page is under reclaim
> + * @headless: indicates a page where zbud header didn't fit
> + */
> +struct zbud_page_priv {
> + bool under_reclaim:1;
> + bool headless:1;
>  };
>  
>  /*
> @@ -221,6 +228,7 @@ MODULE_ALIAS("zpool-zbud");
>  */
>  /* Just to make the code easier to read */
>  enum buddy {
> + HEADLESS,
>   FIRST,
>   LAST
>  };
> @@ -237,12 +245,15 @@ static int size_to_chunks(size_t size)
>  /* Initializes the zbud header of a newly allocated zbud page */
>  static struct zbud_header *init_zbud_page(struct page *page)
>  {
> + struct zbud_page_priv *ppriv = (struct zbud_page_priv *)page->private;
>   struct zbud_header *zhdr = page_address(page);
> +
> + INIT_LIST_HEAD(&page->lru);
> + ppriv->under_reclaim = 0;
> +
>   zhdr->first_chunks = 0;
>   zhdr->last_chunks = 0;
>   INIT_LIST_HEAD(&zhdr->buddy);
> - INIT_LIST_HEAD(&zhdr->lru);
> - zhdr->under_reclaim = 0;
>   return zhdr;
>  }
>  
> @@ -267,11 +278,22 @@ static unsigned long encode_handle(struct zbud_header 
> *zhdr, enum buddy bud)
>* over the zbud header in the first chunk.
>*/
>   handle = (unsigned long)zhdr;
> - if (bud == FIRST)
> + switch (bud) {
> + case FIRST:
>   /* skip over zbud header */
>   handle += ZHDR_SIZE_ALIGNED;
> - else /* bud == LAST */
> + break;
> + case LAST:
>   handle += PAGE_SIZE - (zhdr->last_chunks  << CHUNK_SHIFT);
> + break;
> + case HEADLESS:
> + break;
> + default:
> + /* this should never happen */
> + pr_err("zbud: invalid buddy value %d\n", bud);
> + handle = 0;
> + break;
> + }
>   return handle;
>  }
>  
> @@ -287,6 +309,7 @@ static int num_free_chunks(struct zbud_header *zhdr)
>   /*
>* Rather than branch for different situations, just use the fact that
>* free buddies have a length of zero to simplify everything.
> +  * NB: can't be used with HEADLESS pages.
>*/
>   return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks;
>  }
> @@ -353,31 +376,40 @@ void zbud_destroy_pool(struct zbud_pool *pool)
>  int zbud_alloc(struct zbud_pool *po

Re: [PATCH 1/3] zpool: add zpool_has_pool()

2015-08-06 Thread Seth Jennings
On Thu, Aug 06, 2015 at 04:50:23PM -0500, Seth Jennings wrote:
> On Wed, Aug 05, 2015 at 03:06:59PM -0700, Andrew Morton wrote:
> > On Wed, 5 Aug 2015 18:00:26 -0400 Dan Streetman  wrote:
> > 
> > > >
> > > > If there's some reason why this can't happen, can we please have a code
> > > > comment which reveals that reason?
> > > 
> > > zpool_create_pool() should work if this returns true, unless as you
> > > say the module is rmmod'ed *and* removed from the system - since
> > > zpool_create_pool() will call request_module() just as this function
> > > does.  I can add a comment explaining that.
> > 
> > I like comments ;)
> > 
> > Seth, I'm planning on sitting on these patches until you've had a
> > chance to review them.
> 
> Thanks Andrew.  I'm reviewing now.  Patch 2/3 is pretty huge.  I've got
> the gist of the changes now.  I'm also building and testing for myself
> as this creates a lot more surface area for issues, alternating between
> compressors and allocating new compression transforms on the fly.
> 
> I'm kinda with Sergey on this in that it adds yet another complexity to
> an already complex feature.  This adds more locking, more RCU, more
> refcounting.  It's becoming harder to review, test, and verify.
> 
> I should have results tomorrow.

So I gave it a test run turning all the knobs (compressor, enabled,
max_pool_percent, and zpool) like a crazy person and it was stable,
and all the adjustments had the expected result.

Dan, you might follow up with an update to Documentation/vm/zswap.txt
noting that these parameters are runtime adjustable now.

The growing complexity is a concern, but it is nice to have the
flexibility.  Thanks for the good work!

To patchset:

Acked-by: Seth Jennings 

> 
> Thanks,
> Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] zpool: add zpool_has_pool()

2015-08-06 Thread Seth Jennings
On Wed, Aug 05, 2015 at 03:06:59PM -0700, Andrew Morton wrote:
> On Wed, 5 Aug 2015 18:00:26 -0400 Dan Streetman  wrote:
> 
> > >
> > > If there's some reason why this can't happen, can we please have a code
> > > comment which reveals that reason?
> > 
> > zpool_create_pool() should work if this returns true, unless as you
> > say the module is rmmod'ed *and* removed from the system - since
> > zpool_create_pool() will call request_module() just as this function
> > does.  I can add a comment explaining that.
> 
> I like comments ;)
> 
> Seth, I'm planning on sitting on these patches until you've had a
> chance to review them.

Thanks Andrew.  I'm reviewing now.  Patch 2/3 is pretty huge.  I've got
the gist of the changes now.  I'm also building and testing for myself
as this creates a lot more surface area for issues, alternating between
compressors and allocating new compression transforms on the fly.

I'm kinda with Sergey on this in that it adds yet another complexity to
an already complex feature.  This adds more locking, more RCU, more
refcounting.  It's becoming harder to review, test, and verify.

I should have results tomorrow.

Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sb_edac: fix TAD presence check for sbridge_mci_bind_devs()

2015-08-05 Thread Seth Jennings
In 7d375bff, NUM_CHANNELS was changed to 8 and the channel space was
renumerated to handle EN, EP, and EX configurations.

The *_mci_bind_devs functions, except for sbridge_mci_bind_devs(), got a
new device presence check in the form of saw_chan_mask.  However,
sbridge_mci_bind_devs() still uses the NUM_CHANNELS for loop.

With the increase in NUM_CHANNELS, this loop fails at index 4 since
SB only has 4 TADs.  This results in the following error on SB machines:

EDAC sbridge: Some needed devices are missing
EDAC sbridge: Couldn't find mci handler
EDAC sbridge: Couldn't find mci handle

This patch adapts the saw_chan_mask logic for sbridge_mci_bind_devs() as
well.

After this patch:

EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge 
Socket#0: DEV :3f:0e.0 (POLLED)
EDAC MC1: Giving out device to module sbridge_edac.c controller Sandy Bridge 
Socket#1: DEV :7f:0e.0 (POLLED)

Signed-off-by: Seth Jennings 
---
 drivers/edac/sb_edac.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index ca78311..91cf710 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -1648,6 +1648,7 @@ static int sbridge_mci_bind_devs(struct mem_ctl_info *mci,
 {
struct sbridge_pvt *pvt = mci->pvt_info;
struct pci_dev *pdev;
+   u8 saw_chan_mask = 0;
int i;
 
for (i = 0; i < sbridge_dev->n_devs; i++) {
@@ -1681,6 +1682,7 @@ static int sbridge_mci_bind_devs(struct mem_ctl_info *mci,
{
int id = pdev->device - 
PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_TAD0;
pvt->pci_tad[id] = pdev;
+   saw_chan_mask |= 1 << id;
}
break;
case PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_DDRIO:
@@ -1701,10 +1703,8 @@ static int sbridge_mci_bind_devs(struct mem_ctl_info 
*mci,
!pvt-> pci_tad || !pvt->pci_ras  || !pvt->pci_ta)
goto enodev;
 
-   for (i = 0; i < NUM_CHANNELS; i++) {
-   if (!pvt->pci_tad[i])
-   goto enodev;
-   }
+   if (saw_chan_mask != 0x0f)
+   goto enodev;
return 0;
 
 enodev:
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] zswap: dynamic pool creation

2015-06-18 Thread Seth Jennings
On Thu, Jun 11, 2015 at 01:51:45PM -0500, Seth Jennings wrote:
> On Wed, Jun 10, 2015 at 04:54:24PM -0400, Dan Streetman wrote:
> > On Thu, Jun 4, 2015 at 8:13 AM, Dan Streetman  wrote:
> > > On Thu, Jun 4, 2015 at 8:02 AM, Dan Streetman  wrote:
> > >> Add dynamic creation of pools.  Move the static crypto compression
> > >> per-cpu transforms into each pool.  Add a pointer to zswap_entry to
> > >> the pool it's in.
> > >
> > > Seth, as far as the design, from your previous comments I assume you
> > > were thinking of maintaining separate lists of zpools and compressors?
> > >  I do see how that will reduce duplication of zpools and compressors,
> > > but it also requires adding a new pointer to each zswap_entry, and
> > > increasing the amount of code to manage each list separately.  And the
> > > most common case in zswap will be just a single zpool and compressor,
> > > not repeatedly changing params.  What do you think?
> > 
> > Any opinion on this patch?  If you want, I can break it up so there's
> > a list of zpools and a list of compressors.  Either the combined way
> > (this patch) or separate lists works for me, as long as the params are
> > changeable at runtime :-)
> 
> I'm still reviewing the code.  I was going to test it too but it doesn't
> compile for me:
> 
>   CC  mm/zswap.o
> mm/zswap.c: In function ‘__zswap_pool_create_fallback’:
> mm/zswap.c:605:10: warning: argument to ‘sizeof’ in ‘strncpy’ call is the 
> same expression as the destination; did you mean to provide an explicit 
> length? [-Wsizeof-pointer-memaccess]
> sizeof(zswap_compressor));
>   ^
> mm/zswap.c:607:7: error: implicit declaration of function ‘zpool_has_pool’ 
> [-Werror=implicit-function-declaration]
>   if (!zpool_has_pool(zswap_zpool_type)) {
>^
> mm/zswap.c:611:10: warning: argument to ‘sizeof’ in ‘strncpy’ call is the 
> same expression as the destination; did you mean to provide an explicit 
> length? [-Wsizeof-pointer-memaccess]
> sizeof(zswap_zpool_type));
>   ^
> mm/zswap.c: At top level:
> mm/zswap.c:664:1: error: expected identifier or ‘(’ before ‘}’ token
>  }
>  ^
> mm/zswap.c:99:22: warning: ‘zswap_pool’ defined but not used 
> [-Wunused-variable]
>  static struct zpool *zswap_pool;
>   ^
> mm/zswap.c:531:27: warning: ‘zswap_pool_find_get’ defined but not used 
> [-Wunused-function]
>  static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
>^

Dan, I never heard back from you on this, but I figured it out.  PATCH
1/5 from your original patchset wasn't pulled in, but you didn't roll it
into this new patch either.

Seth


> 
> Seth
> 
> > 
> > 
> > >
> > >>
> > >> This is required by a separate patch which enables changing the
> > >> zswap zpool and compressor params at runtime.
> > >>
> > >> Signed-off-by: Dan Streetman 
> > >> ---
> > >>  mm/zswap.c | 550 
> > >> +
> > >>  1 file changed, 408 insertions(+), 142 deletions(-)
> > >>
> > >> diff --git a/mm/zswap.c b/mm/zswap.c
> > >> index 2d5727b..fc93770 100644
> > >> --- a/mm/zswap.c
> > >> +++ b/mm/zswap.c
> > >> @@ -99,66 +99,19 @@ module_param_named(zpool, zswap_zpool_type, charp, 
> > >> 0444);
> > >>  static struct zpool *zswap_pool;
> > >>
> > >>  /*
> > >> -* compression functions
> > >> +* data structures
> > >>  **/
> > >> -/* per-cpu compression transforms */
> > >> -static struct crypto_comp * __percpu *zswap_comp_pcpu_tfms;
> > >>
> > >> -enum comp_op {
> > >> -   ZSWAP_COMPOP_COMPRESS,
> > >> -   ZSWAP_COMPOP_DECOMPRESS
> > >> +struct zswap_pool {
> > >> +   struct zpool *zpool;
> > >> +   struct kref kref;
> > >> +   struct list_head list;
> > >> +   struct rcu_head rcu_head;
> > >> +   struct notifier_block notifier;
> > >> +   char tfm_name[CRYPTO_MAX_ALG_NAME];
> > >> +   struct crypto_comp * __percpu *tfm;
> > >>  };
> > >>
> > >> -static int zswap_comp_op(enum comp_op op, const u8 *src, unsigned int 
> > >> slen,
> > >> -   u8 *dst, unsigned int *dlen)
> >

Re: [PATCH] zswap: dynamic pool creation

2015-06-18 Thread Seth Jennings
On Wed, Jun 17, 2015 at 07:13:31PM -0400, Dan Streetman wrote:
> On Wed, Jun 10, 2015 at 4:54 PM, Dan Streetman  wrote:
> > On Thu, Jun 4, 2015 at 8:13 AM, Dan Streetman  wrote:
> >> On Thu, Jun 4, 2015 at 8:02 AM, Dan Streetman  wrote:
> >>> Add dynamic creation of pools.  Move the static crypto compression
> >>> per-cpu transforms into each pool.  Add a pointer to zswap_entry to
> >>> the pool it's in.
> >>
> >> Seth, as far as the design, from your previous comments I assume you
> >> were thinking of maintaining separate lists of zpools and compressors?
> >>  I do see how that will reduce duplication of zpools and compressors,
> >> but it also requires adding a new pointer to each zswap_entry, and
> >> increasing the amount of code to manage each list separately.  And the
> >> most common case in zswap will be just a single zpool and compressor,
> >> not repeatedly changing params.  What do you think?
> >
> > Any opinion on this patch?  If you want, I can break it up so there's
> > a list of zpools and a list of compressors.  Either the combined way
> > (this patch) or separate lists works for me, as long as the params are
> > changeable at runtime :-)
> 
> You on vacation Seth?  Let me know what direction you prefer for this...

I think it is as good as it can be.  Complicated, but I don't know of a
way to make it less so if we want to be able to do this (which is a
separate discussion).

Seth

> 
> >
> >
> >>
> >>>
> >>> This is required by a separate patch which enables changing the
> >>> zswap zpool and compressor params at runtime.
> >>>
> >>> Signed-off-by: Dan Streetman 
> >>> ---
> >>>  mm/zswap.c | 550 
> >>> +
> >>>  1 file changed, 408 insertions(+), 142 deletions(-)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index 2d5727b..fc93770 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -99,66 +99,19 @@ module_param_named(zpool, zswap_zpool_type, charp, 
> >>> 0444);
> >>>  static struct zpool *zswap_pool;
> >>>
> >>>  /*
> >>> -* compression functions
> >>> +* data structures
> >>>  **/
> >>> -/* per-cpu compression transforms */
> >>> -static struct crypto_comp * __percpu *zswap_comp_pcpu_tfms;
> >>>
> >>> -enum comp_op {
> >>> -   ZSWAP_COMPOP_COMPRESS,
> >>> -   ZSWAP_COMPOP_DECOMPRESS
> >>> +struct zswap_pool {
> >>> +   struct zpool *zpool;
> >>> +   struct kref kref;
> >>> +   struct list_head list;
> >>> +   struct rcu_head rcu_head;
> >>> +   struct notifier_block notifier;
> >>> +   char tfm_name[CRYPTO_MAX_ALG_NAME];
> >>> +   struct crypto_comp * __percpu *tfm;
> >>>  };
> >>>
> >>> -static int zswap_comp_op(enum comp_op op, const u8 *src, unsigned int 
> >>> slen,
> >>> -   u8 *dst, unsigned int *dlen)
> >>> -{
> >>> -   struct crypto_comp *tfm;
> >>> -   int ret;
> >>> -
> >>> -   tfm = *per_cpu_ptr(zswap_comp_pcpu_tfms, get_cpu());
> >>> -   switch (op) {
> >>> -   case ZSWAP_COMPOP_COMPRESS:
> >>> -   ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
> >>> -   break;
> >>> -   case ZSWAP_COMPOP_DECOMPRESS:
> >>> -   ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
> >>> -   break;
> >>> -   default:
> >>> -   ret = -EINVAL;
> >>> -   }
> >>> -
> >>> -   put_cpu();
> >>> -   return ret;
> >>> -}
> >>> -
> >>> -static int __init zswap_comp_init(void)
> >>> -{
> >>> -   if (!crypto_has_comp(zswap_compressor, 0, 0)) {
> >>> -   pr_info("%s compressor not available\n", 
> >>> zswap_compressor);
> >>> -   /* fall back to default compressor */
> >>> -   zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
> >>> -   if (!crypto_has_comp(zswap_compressor, 0, 0))
> >>> -   /* can't even load the default compressor */
> >>> -   return -ENODEV;
> >>> -   }
> >>> -   pr_info("using %s compressor\n", zswap_compressor);
> >>> -
> >>> -   /* alloc percpu transforms */
> >>> -   zswap_comp_pcpu_tfms = alloc_percpu(struct crypto_comp *);
> >>> -   if (!zswap_comp_pcpu_tfms)
> >>> -   return -ENOMEM;
> >>> -   return 0;
> >>> -}
> >>> -
> >>> -static void __init zswap_comp_exit(void)
> >>> -{
> >>> -   /* free percpu transforms */
> >>> -   free_percpu(zswap_comp_pcpu_tfms);
> >>> -}
> >>> -
> >>> -/*
> >>> -* data structures
> >>> -**/
> >>>  /*
> >>>   * struct zswap_entry
> >>>   *
> >>> @@ -166,22 +119,24 @@ static void __init zswap_comp_exit(void)
> >>>   * page within zswap.
> >>>   *
> >>>   * rbnode - links the entry into red-black tree for the appropriate swap 
> >>> type
> >>> + * offset - the swap offset for the entry.  Index into the red-black 
> >>> tree.
> >>>   * refcount - the n

Re: [PATCH] zswap: dynamic pool creation

2015-06-11 Thread Seth Jennings
On Wed, Jun 10, 2015 at 04:54:24PM -0400, Dan Streetman wrote:
> On Thu, Jun 4, 2015 at 8:13 AM, Dan Streetman  wrote:
> > On Thu, Jun 4, 2015 at 8:02 AM, Dan Streetman  wrote:
> >> Add dynamic creation of pools.  Move the static crypto compression
> >> per-cpu transforms into each pool.  Add a pointer to zswap_entry to
> >> the pool it's in.
> >
> > Seth, as far as the design, from your previous comments I assume you
> > were thinking of maintaining separate lists of zpools and compressors?
> >  I do see how that will reduce duplication of zpools and compressors,
> > but it also requires adding a new pointer to each zswap_entry, and
> > increasing the amount of code to manage each list separately.  And the
> > most common case in zswap will be just a single zpool and compressor,
> > not repeatedly changing params.  What do you think?
> 
> Any opinion on this patch?  If you want, I can break it up so there's
> a list of zpools and a list of compressors.  Either the combined way
> (this patch) or separate lists works for me, as long as the params are
> changeable at runtime :-)

I'm still reviewing the code.  I was going to test it too but it doesn't
compile for me:

  CC  mm/zswap.o
mm/zswap.c: In function ‘__zswap_pool_create_fallback’:
mm/zswap.c:605:10: warning: argument to ‘sizeof’ in ‘strncpy’ call is the same 
expression as the destination; did you mean to provide an explicit length? 
[-Wsizeof-pointer-memaccess]
sizeof(zswap_compressor));
  ^
mm/zswap.c:607:7: error: implicit declaration of function ‘zpool_has_pool’ 
[-Werror=implicit-function-declaration]
  if (!zpool_has_pool(zswap_zpool_type)) {
   ^
mm/zswap.c:611:10: warning: argument to ‘sizeof’ in ‘strncpy’ call is the same 
expression as the destination; did you mean to provide an explicit length? 
[-Wsizeof-pointer-memaccess]
sizeof(zswap_zpool_type));
  ^
mm/zswap.c: At top level:
mm/zswap.c:664:1: error: expected identifier or ‘(’ before ‘}’ token
 }
 ^
mm/zswap.c:99:22: warning: ‘zswap_pool’ defined but not used [-Wunused-variable]
 static struct zpool *zswap_pool;
  ^
mm/zswap.c:531:27: warning: ‘zswap_pool_find_get’ defined but not used 
[-Wunused-function]
 static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
   ^

Seth

> 
> 
> >
> >>
> >> This is required by a separate patch which enables changing the
> >> zswap zpool and compressor params at runtime.
> >>
> >> Signed-off-by: Dan Streetman 
> >> ---
> >>  mm/zswap.c | 550 
> >> +
> >>  1 file changed, 408 insertions(+), 142 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 2d5727b..fc93770 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -99,66 +99,19 @@ module_param_named(zpool, zswap_zpool_type, charp, 
> >> 0444);
> >>  static struct zpool *zswap_pool;
> >>
> >>  /*
> >> -* compression functions
> >> +* data structures
> >>  **/
> >> -/* per-cpu compression transforms */
> >> -static struct crypto_comp * __percpu *zswap_comp_pcpu_tfms;
> >>
> >> -enum comp_op {
> >> -   ZSWAP_COMPOP_COMPRESS,
> >> -   ZSWAP_COMPOP_DECOMPRESS
> >> +struct zswap_pool {
> >> +   struct zpool *zpool;
> >> +   struct kref kref;
> >> +   struct list_head list;
> >> +   struct rcu_head rcu_head;
> >> +   struct notifier_block notifier;
> >> +   char tfm_name[CRYPTO_MAX_ALG_NAME];
> >> +   struct crypto_comp * __percpu *tfm;
> >>  };
> >>
> >> -static int zswap_comp_op(enum comp_op op, const u8 *src, unsigned int 
> >> slen,
> >> -   u8 *dst, unsigned int *dlen)
> >> -{
> >> -   struct crypto_comp *tfm;
> >> -   int ret;
> >> -
> >> -   tfm = *per_cpu_ptr(zswap_comp_pcpu_tfms, get_cpu());
> >> -   switch (op) {
> >> -   case ZSWAP_COMPOP_COMPRESS:
> >> -   ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
> >> -   break;
> >> -   case ZSWAP_COMPOP_DECOMPRESS:
> >> -   ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
> >> -   break;
> >> -   default:
> >> -   ret = -EINVAL;
> >> -   }
> >> -
> >> -   put_cpu();
> >> -   return ret;
> >> -}
> >> -
> >> -static int __init zswap_comp_init(void)
> >> -{
> >> -   if (!crypto_has_comp(zswap_compressor, 0, 0)) {
> >> -   pr_info("%s compressor not available\n", zswap_compressor);
> >> -   /* fall back to default compressor */
> >> -   zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
> >> -   if (!crypto_has_comp(zswap_compressor, 0, 0))
> >> -   /* can't even load the default compressor */
> >> -   return -ENODEV;
> >> -   }
> >> -   pr_info("using %s compressor\n", zswap_compressor);
> >> -
> >> -   /* alloc percpu transforms */
> >> - 

Re: [PATCH 0/5] zswap: make params runtime changeable

2015-06-02 Thread Seth Jennings
On Tue, Jun 02, 2015 at 11:11:52AM -0400, Dan Streetman wrote:
> This patch series allows setting all zswap params at runtime, instead
> of only being settable at boot-time.
> 
> The changes to zswap are rather large, due to the creation of zswap pools,
> which contain both a compressor function as well as a zpool.  When either
> the compressor or zpool param is changed at runtime, a new zswap pool is
> created with the new compressor and zpool, and used for all new compressed
> pages.  Any old zswap pools that still contain pages are retained only to
> load pages from, and destroyed once they become empty.
> 
> One notable change required for this to work is to split the currently
> global kernel param mutex into a global mutex only for built-in params,
> and a per-module mutex for loadable module params.  The reason this change
> is required is because zswap's compressor and zpool param handler callback
> functions attempt to load, via crypto_has_comp() and the new zpool_has_pool()
> functions, any required compressor or zpool modules.  The problem there is
> that the zswap param callback functions run while the global param mutex is
> locked, but when they attempt to load another module, if the loading module
> has any params set e.g. via /etc/modprobe.d/*.conf, modprobe will also try
> to take the global param mutex, and a deadlock will result, with the mutex
> held by the zswap param callback which is waiting for modprobe, but modprobe
> waiting for the mutex to change the loading module's param.  Using a
> per-module mutex for all loadable modules prevents this, since each module
> will take its own mutex and never conflict with another module's param
> changes.

Nice work Dan :)

I'm trying to look at this as three different efforts. In order of
increasing difficulty:
- Enabling/disabling zswap at runtime
- Changing the compressor at runtime, which doesn't involve the zpool layer
- Changing the allocator (type) at runtime which does involve the zpool layer.

In other words, we can store entries that use a different compressor in
the same zpool, but not entries stored in different allocators.

Enabling zswap at runtime is very straightforward, especially if you
aren't going to attempt to flush out all the pages on a disable; only
prevent new stores.  I like that.

Changing the compressor at runtime is the next easiest one, since you
have to allocate new compressor transforms, but not a new zpool.  You
just store which compressor was used on a per-entry basis.

Changing the allocator (type) is the hardest since it involves a new
zpool, and all the code for managing multiple zpools in zswap.

This is a lot of change all at once.  Maybe we could just do the runtime
enable/disable of zswap and the runtime change of compressors first?  I
think those two alone would be a lot less invasive.  Then we can look at
runtime change of the allocator as a separate thing.

Thanks,
Seth

> 
> 
> Dan Streetman (5):
>   zpool: add zpool_has_pool()
>   module: add per-module params lock
>   zswap: runtime enable/disable
>   zswap: dynamic pool creation
>   zswap: change zpool/compressor at runtime
> 
>  arch/um/drivers/hostaudio_kern.c |  20 +-
>  drivers/net/ethernet/myricom/myri10ge/myri10ge.c |   6 +-
>  drivers/net/wireless/libertas_tf/if_usb.c|   6 +-
>  drivers/usb/atm/ueagle-atm.c |   4 +-
>  drivers/video/fbdev/vt8623fb.c   |   4 +-
>  include/linux/module.h   |   1 +
>  include/linux/moduleparam.h  |  67 +--
>  include/linux/zpool.h|   2 +
>  kernel/module.c  |   1 +
>  kernel/params.c  |  45 +-
>  mm/zpool.c   |  25 +
>  mm/zswap.c   | 696 
> +--
>  net/mac80211/rate.c  |   4 +-
>  13 files changed, 640 insertions(+), 241 deletions(-)
> 
> -- 
> 2.1.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] zswap: runtime enable/disable

2015-06-02 Thread Seth Jennings
On Tue, Jun 02, 2015 at 11:11:55AM -0400, Dan Streetman wrote:
> Change the "enabled" parameter to be configurable at runtime.  Remove
> the enabled check from init(), and move it to the frontswap store()
> function; when enabled, pages will be stored, and when disabled, pages
> won't be stored.

I like this one. So much so I wrote it about 2 years ago :)

http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html

It didn't go in though and I forgot about it.

We need to update the documentation too (see my patch).

Thanks,
Seth

> 
> Signed-off-by: Dan Streetman 
> ---
>  mm/zswap.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4249e82..e070b10 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -75,9 +75,10 @@ static u64 zswap_duplicate_entry;
>  /*
>  * tunables
>  **/
> -/* Enable/disable zswap (disabled by default, fixed at boot for now) */
> -static bool zswap_enabled __read_mostly;
> -module_param_named(enabled, zswap_enabled, bool, 0444);
> +
> +/* Enable/disable zswap (disabled by default) */
> +static bool zswap_enabled;
> +module_param_named(enabled, zswap_enabled, bool, 0644);
>  
>  /* Compressor to be used by zswap (fixed at boot for now) */
>  #define ZSWAP_COMPRESSOR_DEFAULT "lzo"
> @@ -648,6 +649,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
> offset,
>   u8 *src, *dst;
>   struct zswap_header *zhdr;
>  
> + if (!zswap_enabled)
> + return -EPERM;
> +
>   if (!tree) {
>   ret = -ENODEV;
>   goto reject;
> @@ -901,9 +905,6 @@ static int __init init_zswap(void)
>  {
>   gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
>  
> - if (!zswap_enabled)
> - return 0;
> -
>   pr_info("loading zswap\n");
>  
>   zswap_pool = zpool_create_pool(zswap_zpool_type, "zswap", gfp,
> -- 
> 2.1.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 6/9] livepatch: create per-task consistency model

2015-02-10 Thread Seth Jennings
On Mon, Feb 09, 2015 at 11:31:18AM -0600, Josh Poimboeuf wrote:
> Add a basic per-task consistency model.  This is the foundation which
> will eventually enable us to patch those ~10% of security patches which
> change function prototypes and/or data semantics.
> 
> When a patch is enabled, livepatch enters into a transition state where
> tasks are converging from the old universe to the new universe.  If a
> given task isn't using any of the patched functions, it's switched to
> the new universe.  Once all the tasks have been converged to the new
> universe, patching is complete.
> 
> The same sequence occurs when a patch is disabled, except the tasks
> converge from the new universe to the old universe.
> 
> The /sys/kernel/livepatch//transition file shows whether a patch
> is in transition.  Only a single patch (the topmost patch on the stack)
> can be in transition at a given time.  A patch can remain in the
> transition state indefinitely, if any of the tasks are stuck in the
> previous universe.
> 
> A transition can be reversed and effectively canceled by writing the
> opposite value to the /sys/kernel/livepatch//enabled file while
> the transition is in progress.  Then all the tasks will attempt to
> converge back to the original universe.
> 
> Signed-off-by: Josh Poimboeuf 
> ---
>  include/linux/livepatch.h |  18 ++-
>  include/linux/sched.h |   3 +
>  kernel/fork.c |   2 +
>  kernel/livepatch/Makefile |   2 +-
>  kernel/livepatch/core.c   |  71 ++
>  kernel/livepatch/patch.c  |  34 -
>  kernel/livepatch/patch.h  |   1 +
>  kernel/livepatch/transition.c | 300 
> ++
>  kernel/livepatch/transition.h |  16 +++
>  kernel/sched/core.c   |   2 +
>  10 files changed, 423 insertions(+), 26 deletions(-)
>  create mode 100644 kernel/livepatch/transition.c
>  create mode 100644 kernel/livepatch/transition.h
> 

> diff --git a/kernel/livepatch/transition.h b/kernel/livepatch/transition.h
> new file mode 100644
> index 000..ba9a55c
> --- /dev/null
> +++ b/kernel/livepatch/transition.h
> @@ -0,0 +1,16 @@
> +#include 
> +
> +enum {
> + KLP_UNIVERSE_UNDEFINED = -1,
> + KLP_UNIVERSE_OLD,
> + KLP_UNIVERSE_NEW,
> +};
> +
> +extern struct mutex klp_mutex;

klp_mutex isn't defined in transition.c.  Maybe this extern should be in
the transition.c file or in a core.h file, since core.c provides the
definition?

Thanks,
Seth

> +extern struct klp_patch *klp_transition_patch;
> +
> +extern void klp_init_transition(struct klp_patch *patch, int universe);
> +extern void klp_start_transition(int universe);
> +extern void klp_reverse_transition(void);
> +extern void klp_try_complete_transition(void);
> +extern void klp_complete_transition(void);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d91e6..7b877f4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -74,6 +74,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -4601,6 +4602,7 @@ void init_idle(struct task_struct *idle, int cpu)
>  #if defined(CONFIG_SMP)
>   sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
>  #endif
> + klp_update_task_universe(idle);
>  }
>  
>  int cpuset_cpumask_can_shrink(const struct cpumask *cur,
> -- 
> 2.1.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/9] mm/zbud: support highmem pages

2015-01-27 Thread Seth Jennings
On Tue, Nov 04, 2014 at 10:33:43AM -0600, Seth Jennings wrote:
> On Tue, Oct 14, 2014 at 08:59:19PM +0900, Heesub Shin wrote:
> > zbud is a memory allocator for storing compressed data pages. It keeps
> > two data objects of arbitrary size on a single page. This simple design
> > provides very deterministic behavior on reclamation, which is one of
> > reasons why zswap selected zbud as a default allocator over zsmalloc.
> > 
> > Unlike zsmalloc, however, zbud does not support highmem. This is
> > problomatic especially on 32-bit machines having relatively small
> > lowmem. Compressing anonymous pages from highmem and storing them into
> > lowmem could eat up lowmem spaces.
> > 
> > This limitation is due to the fact that zbud manages its internal data
> > structures on zbud_header which is kept in the head of zbud_page. For
> > example, zbud_pages are tracked by several lists and have some status
> > information, which are being referenced at any time by the kernel. Thus,
> > zbud_pages should be allocated on a memory region directly mapped,
> > lowmem.
> > 
> > After some digging out, I found that internal data structures of zbud
> > can be kept in the struct page, the same way as zsmalloc does. So, this
> > series moves out all fields in zbud_header to struct page. Though it
> > alters quite a lot, it does not add any functional differences except
> > highmem support. I am afraid that this kind of modification abusing
> > several fields in struct page would be ok.
> 
> Hi Heesub,
> 
> Sorry for the very late reply.  The end of October was very busy for me.
> 
> A little history on zbud.  I didn't put the metadata in the struct
> page, even though I knew that was an option since we had done it with
> zsmalloc. At the time, Andrew Morton had concerns about memmap walkers
> getting messed up with unexpected values in the struct page fields.  In
> order to smooth zbud's acceptance, I decided to store the metadata
> inline in the page itself.
> 
> Later, zsmalloc eventually got accepted, which basically gave the
> impression that putting the metadata in the struct page was acceptable.
> 
> I have recently been looking at implementing compaction for zsmalloc,
> but having the metadata in the struct page and having the handle
> directly encode the PFN and offset of the data block prevents
> transparent relocation of the data. zbud has a similar issue as it
> currently encodes the page address in the handle returned to the user
> (also the limitation that is preventing use of highmem pages).
> 
> I would like to implement compaction for zbud too and moving the
> metadata into the struct page is going to work against that. In fact,
> I'm looking at the option of converting the current zbud_header into a
> per-allocation metadata structure, which would provide a layer of
> indirection between zbud and the user, allowing for transparent
> relocation and compaction.

I had some downtime and started thinking about this again today (after
3 months).

Upon further reflection, I really like this and don't think that it
inhibits introducing compaction later.

There are just a few places that look messy or problematic to me:

1. the use of page->private and masking the number of chunks for both
buddies into it (see suggestion for overlay struct below)
2. the use of the second double word &page->index to store a list_head

#2 might be problematic because, IIRC, memmap walkers will check _count
(or _mapcount).  I think we ran into this in zsmalloc.

Initially, when working on zsmalloc, I just created a structure that
overlaid the struct page in the memmap, reserving the flags and _count
areas, so that I wouldn't have to be bound by the field names/boundaries
in the struct page.

IIRC, Andrew was initially against that, but he was also against the
whole idea of using the struct page fields for random stuff... I that
ended up being accepted.

This code looks really good!  I think with a little cleanup and finding
a way to steer clear of using the _count part of the structure, this
will be great.

Sorry for dismissing it earlier.  Didn't give it enough credit.

Thanks,
Seth

> 
> However, I do like the part about letting zbud use highmem pages.
> 
> I have something in mind that would allow highmem pages _and_ move
> toward something that would support compaction.  I'll see if I can put
> it into code today.
> 
> Thanks,
> Seth
> 
> > 
> > Heesub Shin (9):
> >   mm/zbud: tidy up a bit
> >   mm/zbud: remove buddied list from zbud_pool
> >   mm/zbud: remove lru from zbud_header
> >   mm/zbud: remove first|last_chunks from zbud_header
> >   mm/zbud: encode zbud handle using struct page
>

Re: [PATCH 1/2] livepatch: Revert "livepatch: enforce patch stacking semantics"

2015-01-21 Thread Seth Jennings
On Wed, Jan 21, 2015 at 03:06:38PM +0100, Jiri Kosina wrote:
> On Wed, 21 Jan 2015, Li Bin wrote:
> 
> > This reverts commit 83a90bb1345767f0cb96d242fd8b9db44b2b0e17.
> > 
> > The method that only allowing the topmost patch on the stack to be
> > enabled or disabled is unreasonable. Such as the following case:
> > 
> > - do live patch1
> > - disable patch1
> > - do live patch2 //error
> >
> > Now, we will never be able to do new live patch unless disabing the
> > patch1 although there is no dependencies.
> 
> Unregistering disabled patch still works and removes it from the list no 
> matter the position.
> 
> So what exactly is the problem?

>From a quick glance, it seems that what this set does is it only
enforces the stacking requirements if two patches patch the same
function.

I'm not sure if that is correct logically or correctly implemented by
these patches yet.

Seth

> 
> -- 
> Jiri Kosina
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] mm/zsmalloc: add statistics support

2015-01-12 Thread Seth Jennings
On Tue, Dec 23, 2014 at 11:40:45AM +0900, Minchan Kim wrote:
> Hi Ganesh,
> 
> On Tue, Dec 23, 2014 at 10:26:12AM +0800, Ganesh Mahendran wrote:
> > Hello Minchan
> > 
> > 2014-12-20 10:25 GMT+08:00 Minchan Kim :
> > > Hey Ganesh,
> > >
> > > On Sat, Dec 20, 2014 at 09:43:34AM +0800, Ganesh Mahendran wrote:
> > >> 2014-12-20 8:23 GMT+08:00 Minchan Kim :
> > >> > On Fri, Dec 19, 2014 at 04:17:56PM -0800, Andrew Morton wrote:
> > >> >> On Sat, 20 Dec 2014 09:10:43 +0900 Minchan Kim  
> > >> >> wrote:
> > >> >>
> > >> >> > > It involves rehashing a lengthy argument with Greg.
> > >> >> >
> > >> >> > Okay. Then, Ganesh,
> > >> >> > please add warn message about duplicaed name possibility althoug
> > >> >> > it's unlikely as it is.
> > >> >>
> > >> >> Oh, getting EEXIST is easy with this patch.  Just create and destroy a
> > >> >> pool 2^32 times and the counter wraps ;) It's hardly a serious issue
> > >> >> for a debugging patch.
> > >> >
> > >> > I meant that I wanted to change from index to name passed from caller 
> > >> > like this
> > >> >
> > >> > zram:
> > >> > zs_create_pool(GFP_NOIO | __GFP_HIGHMEM, 
> > >> > zram->disk->first_minor);
> > >> >
> > >> > So, duplication should be rare. :)
> > >>
> > >> We still can not know whether the name is duplicated if we do not
> > >> change the debugfs API.
> > >> The API does not return the errno to us.
> > >>
> > >> How about just zsmalloc decides the name of the pool-id, like pool-x.
> > >> When the pool-id reaches
> > >> 0x., we print warn message about duplicated name, and stop
> > >> creating the debugfs entry
> > >> for the user.
> > >
> > > The idea is from the developer point of view to implement thing easy
> > > but my point is we should take care of user(ie, admin) rather than
> > > developer(ie, we).
> > 
> > Yes. I got it.
> > 
> > >
> > > For user, /sys/kernel/debug/zsmalloc/zram0 would be more
> > > straightforward and even it doesn't need zram to export
> > > /sys/block/zram0/pool-id.
> > 
> > BTW, If we add a new argument in zs_create_pool(). It seems we also need to
> > add argument in zs_zpool_create(). So, zpool/zswap/zbud will be
> > modified to support
> > the new API.
> > Is that acceptable?
> 
> I think it's doable.
> The zpool_create_pool has already zswap_zpool_type.
> Ccing maintainers for double check.

Late response, but fine by me.

Seth

> 
> Many thanks.
> 
> 
> > 
> > Thanks.
> > 
> > >
> > > Thanks.
> > >
> > >>
> > >> Thanks.
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] zsmalloc support compaction

2014-12-17 Thread Seth Jennings
On Tue, Dec 02, 2014 at 11:49:41AM +0900, Minchan Kim wrote:
> Recently, there was issue about zsmalloc fragmentation and
> I got a report from Juno that new fork failed although there
> are plenty of free pages in the system.
> His investigation revealed zram is one of the culprit to make
> heavy fragmentation so there was no more contiguous 16K page
> for pgd to fork in the ARM.
> 
> This patchset implement *basic* zsmalloc compaction support
> and zram utilizes it so admin can do
>   "echo 1 > /sys/block/zram0/compact"
> 
> Actually, ideal is that mm migrate code is aware of zram pages and
> migrate them out automatically without admin's manual opeartion
> when system is out of contiguous page. Howver, we need more thinking
> before adding more hooks to migrate.c. Even though we implement it,
> we need manual trigger mode, too so I hope we could enhance
> zram migration stuff based on this primitive functions in future.
> 
> I just tested it on only x86 so need more testing on other arches.
> Additionally, I should have a number for zsmalloc regression
> caused by indirect layering. Unfortunately, I don't have any
> ARM test machine on my desk. I will get it soon and test it.
> Anyway, before further work, I'd like to hear opinion.
> 
> Pathset is based on v3.18-rc6-mmotm-2014-11-26-15-45.

Hey Minchan, sorry it has taken a while for me to look at this.

I have prototyped this for zbud to and I see you face some of the same
issues, some of them much worse for zsmalloc like large number of
objects to move to reclaim a page (with zbud, the max is 1).

I see you are using zsmalloc itself for allocating the handles.  Why not
kmalloc()?  Then you wouldn't need to track the handle_class stuff and
adjust the class sizes (just in the interest of changing only what is
need to achieve the functionality).

I used kmalloc() but that is not without issue as the handles can be
allocated from many slabs and any slab that contains a handle can't be
freed, basically resulting in the handles themselves needing to be
compacted, which they can't be because the user handle is a pointer to
them.

One way to fix this, but it would be some amount of work, is to have the
user (zswap/zbud) provide the space for the handle to zbud/zsmalloc.
The zswap/zbud layer knows the size of the device (i.e. handle space)
and could allocate a statically sized vmalloc area for holding handles
so they don't get spread all over memory.  I haven't fully explored this
idea yet.

It is pretty limiting having the user trigger the compaction. Can we
have a work item that periodically does some amount of compaction?
Maybe also have something analogous to direct reclaim that, when
zs_malloc fails to secure a new page, it will try to compact to get one?
I understand this is a first step.  Maybe too much.

Also worth pointing out that the fullness groups are very coarse.
Combining the objects from a ZS_ALMOST_EMPTY zspage and ZS_ALMOST_FULL
zspage, might not result in very tight packing.  In the worst case, the
destination zspage would be slightly over 1/4 full (see
fullness_threshold_frac)

It also seems that you start with the smallest size classes first.
Seems like if we start with the biggest first, we move fewer objects and
reclaim more pages.

It does add a lot of code :-/  Not sure if there is any way around that
though if we want this functionality for zsmalloc.

Seth

> 
> Thanks.
> 
> Minchan Kim (6):
>   zsmalloc: expand size class to support sizeof(unsigned long)
>   zsmalloc: add indrection layer to decouple handle from object
>   zsmalloc: implement reverse mapping
>   zsmalloc: encode alloced mark in handle object
>   zsmalloc: support compaction
>   zram: support compaction
> 
>  drivers/block/zram/zram_drv.c |  24 ++
>  drivers/block/zram/zram_drv.h |   1 +
>  include/linux/zsmalloc.h  |   1 +
>  mm/zsmalloc.c | 596 
> +-
>  4 files changed, 552 insertions(+), 70 deletions(-)
> 
> -- 
> 2.0.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv7 2/3] kernel: add support for live patching

2014-12-16 Thread Seth Jennings
On Wed, Dec 17, 2014 at 12:16:18AM +0530, Balbir Singh wrote:
> On Tue, Dec 16, 2014 at 11:28 PM, Seth Jennings  wrote:
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> >
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> >
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> >
> > Signed-off-by: Seth Jennings 
> > Signed-off-by: Josh Poimboeuf 
> 
> snip
> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index ce8dcdf..5c57181 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -17,6 +17,7 @@ config X86_64
> > depends on 64BIT
> > select X86_DEV_DMA_OPS
> > select ARCH_USE_CMPXCHG_LOCKREF
> > +   select ARCH_HAVE_LIVE_PATCHING
> >
> >  ### Arch settings
> >  config X86
> > @@ -1986,6 +1987,8 @@ config CMDLINE_OVERRIDE
> >   This is used to work around broken boot loaders.  This should
> >   be set to 'N' under normal conditions.
> >
> > +source "kernel/livepatch/Kconfig"
> > +
> >  endmenu
> >
> >  config ARCH_ENABLE_MEMORY_HOTPLUG
> > diff --git a/arch/x86/include/asm/livepatch.h 
> > b/arch/x86/include/asm/livepatch.h
> > new file mode 100644
> > index 000..c2ae592
> > --- /dev/null
> > +++ b/arch/x86/include/asm/livepatch.h
> > @@ -0,0 +1,36 @@
> > +/*
> > + * livepatch.h - x86-specific Kernel Live Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _ASM_X86_LIVEPATCH_H
> > +#define _ASM_X86_LIVEPATCH_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_LIVE_PATCHING
> > +#ifndef CC_USING_FENTRY
> > +#error Your compiler must support -mfentry for live patching to work
> > +#endif
> > +extern int klp_write_module_reloc(struct module *mod, unsigned long type,
> > + unsigned long loc, unsigned long value);
> > +
> > +#else
> > +#error Live patching support is disabled; check CONFIG_LIVE_PATCHING
> > +#endif
> > +
> > +#endif /* _ASM_X86_LIVEPATCH_H */
> > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> > index 5d4502c..316b34e 100644
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -63,6 +63,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o
> >  obj-y  += apic/
> >  obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
> >  obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
> > +obj-$(CONFIG_LIVE_PATCHING)+= livepatch.o
> >  obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
> >  obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
> >  obj-$(CONFIG_X86_TSC)  += trace_clock.o
> > diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c
> > new file mode 100644
> > index 000..4b0ed7b
> > --- /dev/null
> > +++ b/arch/x86/kernel/livepatch.c
> > @@ -0,0 +1,89 @@
> > +/*
> > + * livepatch.c - x86-specific Kernel Live Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; e

Re: [PATCHv7 0/3] Kernel Live Patching

2014-12-16 Thread Seth Jennings
On Tue, Dec 16, 2014 at 11:45:12PM +0530, Balbir Singh wrote:
> On Tue, Dec 16, 2014 at 11:28 PM, Seth Jennings  wrote:
> >
> > Changelog:
> >
> > Thanks for all the feedback!
> >
> 
> Could you describe what this does to signing? I presume the patched
> module should cause a taint on module signing?

The patch module can be signed to avoid the taint of being unsigned,
assuming you have the signing key for the kernel you are running.
However we do taint with a new taint flag (see 1/3) to indicate
that the kernel has been patched.

Seth

> 
> Balbir Singh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv7 0/3] Kernel Live Patching

2014-12-16 Thread Seth Jennings
/sys/kernel/livepatch
/sys/kernel/livepatch/
/sys/kernel/livepatch//enabled
/sys/kernel/livepatch//
/sys/kernel/livepatch///

The old function is located using one of two methods: it is either provided by
the patch module (only possible for a function in vmlinux) or kallsyms lookup.
Symbol ambiguity results in a failure.

The core takes a reference on the patch module itself to keep it from
unloading.  This is because, without a mechanism to ensure that no thread is
currently executing in the patched function, we can not determine whether it is
safe to unload the patch module.  For this reason, unloading patch modules is
currently not allowed.

Disabling patches can be done using the "enabled" attribute of the patch:

echo 0 > /sys/kernel/livepatch//enabled

If a patch module contains a patch for a module that is not currently loaded,
there is nothing to patch so the core does nothing for that patch object.
However, the core registers a module notifier that looks for COMING events so
that if the module is ever loaded, it is immediately patched.  If a module with
patch code is removed, the notifier looks for GOING events and disables any
patched functions for that object before it unloads.  The notifier has a higher
priority than that of the ftrace notifier so that it runs before the ftrace
notifier for GOING events and we can cleanly unregister from ftrace.

kpatch and kGraft each have their own mechanisms for ensuring system
consistency during the patching process. This first version does not implement
any consistency mechanism that ensures that old and new code do not run
together.  In practice, ~90% of CVEs are safe to apply in this way, since they
simply add a conditional check.  However, any function change that can not
execute safely with the old version of the function can _not_ be safely applied
for now.

[1] https://github.com/dynup/kpatch
[2] https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (3):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching
  samples: add sample live patching module

 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 Documentation/oops-tracing.txt   |   2 +
 Documentation/sysctl/kernel.txt  |   1 +
 MAINTAINERS  |  14 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  36 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  89 +++
 include/linux/kernel.h   |   1 +
 include/linux/livepatch.h| 132 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 929 +++
 kernel/panic.c   |   2 +
 samples/Kconfig  |   7 +
 samples/Makefile |   2 +-
 samples/livepatch/Makefile   |   1 +
 samples/livepatch/livepatch-sample.c |  87 +++
 19 files changed, 1372 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv7 3/3] samples: add sample live patching module

2014-12-16 Thread Seth Jennings
Add a sample live patching module.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS  |  1 +
 samples/Kconfig  |  7 +++
 samples/Makefile |  2 +-
 samples/livepatch/Makefile   |  1 +
 samples/livepatch/livepatch-sample.c | 87 
 5 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 94f9f69..15c0929d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5788,6 +5788,7 @@ F:include/linux/livepatch.h
 F: arch/x86/include/asm/livepatch.h
 F: arch/x86/kernel/livepatch.c
 F: Documentation/ABI/testing/sysfs-kernel-livepatch
+F: samples/livepatch/
 L: live-patch...@vger.kernel.org
 
 LLC (802.2)
diff --git a/samples/Kconfig b/samples/Kconfig
index 6181c2c..0aed20d 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -63,4 +63,11 @@ config SAMPLE_RPMSG_CLIENT
  to communicate with an AMP-configured remote processor over
  the rpmsg bus.
 
+config SAMPLE_LIVE_PATCHING
+   tristate "Build live patching sample -- loadable modules only"
+   depends on LIVE_PATCHING && m
+   help
+ Builds a sample live patch that replaces the procfs handler
+ for /proc/cmdline to print "this has been live patched".
+
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 1a60c62..f00257b 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
-obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ \
+obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
diff --git a/samples/livepatch/Makefile b/samples/livepatch/Makefile
new file mode 100644
index 000..7f1cdc1
--- /dev/null
+++ b/samples/livepatch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SAMPLE_LIVE_PATCHING) += livepatch-sample.o
diff --git a/samples/livepatch/livepatch-sample.c 
b/samples/livepatch/livepatch-sample.c
new file mode 100644
index 000..21f159d
--- /dev/null
+++ b/samples/livepatch/livepatch-sample.c
@@ -0,0 +1,87 @@
+/*
+ * livepatch-sample.c - Kernel Live Patching Sample Module
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+
+/*
+ * This (dumb) live patch overrides the function that prints the
+ * kernel boot cmdline when /proc/cmdline is read.
+ *
+ * Example:
+ * $ cat /proc/cmdline
+ * 
+ * $ insmod livepatch-sample.ko
+ * $ cat /proc/cmdline
+ * this has been live patched
+ * $ echo 0 > /sys/kernel/livepatch/klp_sample/enabled
+ * 
+ */
+
+#include 
+static int livepatch_cmdline_proc_show(struct seq_file *m, void *v)
+{
+   seq_printf(m, "%s\n", "this has been live patched");
+   return 0;
+}
+
+static struct klp_func funcs[] = {
+   {
+   .old_name = "cmdline_proc_show",
+   .new_func = livepatch_cmdline_proc_show,
+   }, { }
+};
+
+static struct klp_object objs[] = {
+   {
+   /* name being NULL means vmlinux */
+   .funcs = funcs,
+   }, { }
+};
+
+static struct klp_patch patch = {
+   .mod = THIS_MODULE,
+   .objs = objs,
+};
+
+static int livepatch_init(void)
+{
+   int ret;
+
+   ret = klp_register_patch(&patch);
+   if (ret)
+   return ret;
+   ret = klp_enable_patch(&patch);
+   if (ret) {
+   WARN_ON(klp_unregister_patch(&patch));
+   return ret;
+   }
+   return 0;
+}
+
+static void livepatch_exit(void)
+{
+   WARN_ON(klp_disable_patch(&patch));
+   WARN_ON(klp_unregister_patch(&patch));
+}
+
+module_init(livepatch_init);
+module_exit(livepatch_exit);
+MODULE_LICENSE("GPL");
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv7 1/3] kernel: add TAINT_LIVEPATCH

2014-12-16 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 75511ef..83ab256 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -843,6 +843,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5449d2f..d03e3de 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -471,6 +471,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..8136ad7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -226,6 +226,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -246,6 +247,7 @@ static const struct tnt tnts[] = {
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
  *  'L' - A soft lockup has previously occurred.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv7 2/3] kernel: add support for live patching

2014-12-16 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
Signed-off-by: Josh Poimboeuf 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 MAINTAINERS  |  13 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  36 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  89 +++
 include/linux/livepatch.h| 132 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 929 +++
 11 files changed, 1269 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
new file mode 100644
index 000..5bf42a8
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -0,0 +1,44 @@
+What:  /sys/kernel/livepatch
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   Interface for kernel live patching
+
+   The /sys/kernel/livepatch directory contains subdirectories for
+   each loaded live patch module.
+
+What:  /sys/kernel/livepatch/
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The patch directory contains subdirectories for each kernel
+   object (vmlinux or a module) in which it patched functions.
+
+What:  /sys/kernel/livepatch//enabled
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   A writable attribute that indicates whether the patched
+   code is currently applied.  Writing 0 will disable the patch
+   while writing 1 will re-enable the patch.
+
+What:  /sys/kernel/livepatch//
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The object directory contains subdirectories for each function
+   that is patched within the object.
+
+What:  /sys/kernel/livepatch///
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The function directory contains attributes regarding the
+   properties and state of the patched function.
+
+   There are currently no such attributes.
diff --git a/MAINTAINERS b/MAINTAINERS
index 06d5b4b..94f9f69 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5777,6 +5777,19 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+F: arch/x86/include/asm/livepatch.h
+F: arch/x86/kernel/livepatch.c
+F: Documentation/ABI/testing/sysfs-kernel-livepatch
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ce8dcdf..5c57181 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_HAVE_LIVE_PATCHING
 
 ### Arch settings
 config X86
@@ -1986,6 +1987,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livep

[PATCHv5 0/3] Kernel Live Patching

2014-12-04 Thread Seth Jennings
ns for that object before it unloads.  The notifier has a higher
priority than that of the ftrace notifier so that it runs before the ftrace
notifier for GOING events and we can cleanly unregister from ftrace.

kpatch and kGraft each have their own mechanisms for ensuring system
consistency during the patching process. This first version does not implement
any consistency mechanism that ensures that old and new code do not run
together.  In practice, ~90% of CVEs are safe to apply in this way, since they
simply add a conditional check.  However, any function change that can not
execute safely with the old version of the function can _not_ be safely applied
for now.

[1] https://github.com/dynup/kpatch
[2] https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (3):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching
  samples: add sample live patching module

 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 Documentation/oops-tracing.txt   |   2 +
 Documentation/sysctl/kernel.txt  |   1 +
 MAINTAINERS  |  14 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  36 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  89 +++
 include/linux/kernel.h   |   1 +
 include/linux/livepatch.h| 132 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 902 +++
 kernel/panic.c   |   2 +
 samples/Kconfig  |   7 +
 samples/Makefile |   2 +-
 samples/livepatch/Makefile   |   1 +
 samples/livepatch/livepatch-sample.c |  87 +++
 19 files changed, 1345 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv5 1/3] kernel: add TAINT_LIVEPATCH

2014-12-04 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 75511ef..83ab256 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -843,6 +843,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5449d2f..d03e3de 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -471,6 +471,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..8136ad7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -226,6 +226,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -246,6 +247,7 @@ static const struct tnt tnts[] = {
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
  *  'L' - A soft lockup has previously occurred.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv5 2/3] kernel: add support for live patching

2014-12-04 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 MAINTAINERS  |  13 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  36 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  89 +++
 include/linux/livepatch.h| 132 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 902 +++
 11 files changed, 1242 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
new file mode 100644
index 000..5bf42a8
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -0,0 +1,44 @@
+What:  /sys/kernel/livepatch
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   Interface for kernel live patching
+
+   The /sys/kernel/livepatch directory contains subdirectories for
+   each loaded live patch module.
+
+What:  /sys/kernel/livepatch/
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The patch directory contains subdirectories for each kernel
+   object (vmlinux or a module) in which it patched functions.
+
+What:  /sys/kernel/livepatch//enabled
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   A writable attribute that indicates whether the patched
+   code is currently applied.  Writing 0 will disable the patch
+   while writing 1 will re-enable the patch.
+
+What:  /sys/kernel/livepatch//
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The object directory contains subdirectories for each function
+   that is patched within the object.
+
+What:  /sys/kernel/livepatch///
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The function directory contains attributes regarding the
+   properties and state of the patched function.
+
+   There are currently no such attributes.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4861577..7985293 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5715,6 +5715,19 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+F: arch/x86/include/asm/livepatch.h
+F: arch/x86/kernel/livepatch.c
+F: Documentation/ABI/testing/sysfs-kernel-livepatch
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ec21dfd..78715fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_HAVE_LIVE_PATCHING
 
 ### Arch settings
 config X86
@@ -1991,6 +1992,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h
new file mode 100644
in

[PATCHv5 3/3] samples: add sample live patching module

2014-12-04 Thread Seth Jennings
Add a sample live patching module.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS  |  1 +
 samples/Kconfig  |  7 +++
 samples/Makefile |  2 +-
 samples/livepatch/Makefile   |  1 +
 samples/livepatch/livepatch-sample.c | 87 
 5 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7985293..9d3f9d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5726,6 +5726,7 @@ F:include/linux/livepatch.h
 F: arch/x86/include/asm/livepatch.h
 F: arch/x86/kernel/livepatch.c
 F: Documentation/ABI/testing/sysfs-kernel-livepatch
+F: samples/livepatch/
 L: live-patch...@vger.kernel.org
 
 LLC (802.2)
diff --git a/samples/Kconfig b/samples/Kconfig
index 6181c2c..0aed20d 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -63,4 +63,11 @@ config SAMPLE_RPMSG_CLIENT
  to communicate with an AMP-configured remote processor over
  the rpmsg bus.
 
+config SAMPLE_LIVE_PATCHING
+   tristate "Build live patching sample -- loadable modules only"
+   depends on LIVE_PATCHING && m
+   help
+ Builds a sample live patch that replaces the procfs handler
+ for /proc/cmdline to print "this has been live patched".
+
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 1a60c62..f00257b 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
-obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ \
+obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
diff --git a/samples/livepatch/Makefile b/samples/livepatch/Makefile
new file mode 100644
index 000..7f1cdc1
--- /dev/null
+++ b/samples/livepatch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SAMPLE_LIVE_PATCHING) += livepatch-sample.o
diff --git a/samples/livepatch/livepatch-sample.c 
b/samples/livepatch/livepatch-sample.c
new file mode 100644
index 000..21f159d
--- /dev/null
+++ b/samples/livepatch/livepatch-sample.c
@@ -0,0 +1,87 @@
+/*
+ * livepatch-sample.c - Kernel Live Patching Sample Module
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+
+/*
+ * This (dumb) live patch overrides the function that prints the
+ * kernel boot cmdline when /proc/cmdline is read.
+ *
+ * Example:
+ * $ cat /proc/cmdline
+ * 
+ * $ insmod livepatch-sample.ko
+ * $ cat /proc/cmdline
+ * this has been live patched
+ * $ echo 0 > /sys/kernel/livepatch/klp_sample/enabled
+ * 
+ */
+
+#include 
+static int livepatch_cmdline_proc_show(struct seq_file *m, void *v)
+{
+   seq_printf(m, "%s\n", "this has been live patched");
+   return 0;
+}
+
+static struct klp_func funcs[] = {
+   {
+   .old_name = "cmdline_proc_show",
+   .new_func = livepatch_cmdline_proc_show,
+   }, { }
+};
+
+static struct klp_object objs[] = {
+   {
+   /* name being NULL means vmlinux */
+   .funcs = funcs,
+   }, { }
+};
+
+static struct klp_patch patch = {
+   .mod = THIS_MODULE,
+   .objs = objs,
+};
+
+static int livepatch_init(void)
+{
+   int ret;
+
+   ret = klp_register_patch(&patch);
+   if (ret)
+   return ret;
+   ret = klp_enable_patch(&patch);
+   if (ret) {
+   WARN_ON(klp_unregister_patch(&patch));
+   return ret;
+   }
+   return 0;
+}
+
+static void livepatch_exit(void)
+{
+   WARN_ON(klp_disable_patch(&patch));
+   WARN_ON(klp_unregister_patch(&patch));
+}
+
+module_init(livepatch_init);
+module_exit(livepatch_exit);
+MODULE_LICENSE("GPL");
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 3/3] samples: add sample live patching module

2014-12-01 Thread Seth Jennings
On Thu, Nov 27, 2014 at 06:05:13PM +0100, Petr Mladek wrote:
> On Tue 2014-11-25 11:15:09, Seth Jennings wrote:
> > Add a sample live patching module.
> > 
> > Signed-off-by: Seth Jennings 
> > ---
> >  MAINTAINERS  |  1 +
> >  samples/Kconfig  |  7 +++
> >  samples/livepatch/Makefile   |  1 +
> >  samples/livepatch/livepatch-sample.c | 87 
> > 
> >  4 files changed, 96 insertions(+)
> >  create mode 100644 samples/livepatch/Makefile
> >  create mode 100644 samples/livepatch/livepatch-sample.c
> 
> [...] 
> 
> > diff --git a/samples/livepatch/Makefile b/samples/livepatch/Makefile
> > new file mode 100644
> > index 000..7f1cdc1
> > --- /dev/null
> > +++ b/samples/livepatch/Makefile
> > @@ -0,0 +1 @@
> > +obj-$(CONFIG_SAMPLE_LIVE_PATCHING) += livepatch-sample.o
> > diff --git a/samples/livepatch/livepatch-sample.c 
> > b/samples/livepatch/livepatch-sample.c
> > new file mode 100644
> > index 000..21f159d
> 
> The Makefile is ignored because there is missing:
> 
> 
> diff --git a/samples/Makefile b/samples/Makefile
> index 1a60c62e2045..f00257bcc5a7 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
>  # Makefile for Linux samples code
>  
> -obj-$(CONFIG_SAMPLES)+= kobject/ kprobes/ trace_events/ \
> +obj-$(CONFIG_SAMPLES)+= kobject/ kprobes/ trace_events/ livepatch/ \
>  hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/

Applied, thanks!

Seth

> 
> 
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-12-01 Thread Seth Jennings
On Sun, Nov 30, 2014 at 01:23:48PM +0100, Pavel Machek wrote:
> On Thu 2014-11-06 16:51:02, Jiri Slaby wrote:
> > On 11/06/2014, 03:39 PM, Seth Jennings wrote:
> > > This commit introduces code for the live patching core.  It implements
> > > an ftrace-based mechanism and kernel interface for doing live patching
> > > of kernel and kernel module functions.
> > 
> > Hi,
> > 
> > nice! So we have something to start with. Brilliant!
> > 
> > I have some comments below now. Yet, it obviously needs deeper review
> > which will take more time.
> > 
> > > --- /dev/null
> > > +++ b/include/linux/livepatch.h
> > > @@ -0,0 +1,45 @@
> > > +#ifndef _LIVEPATCH_H_
> > > +#define _LIVEPATCH_H_
> > 
> > This should follow the linux kernel naming: LINUX_LIVEPATCH_H
> > 
> > 
> > > +#include 
> > > +
> > > +struct lp_func {
> > 
> > I am not much happy with "lp" which effectively means parallel printer
> > support. What about lip?
> 
> What about "patch_"?

Hey Pavel,

We are on v4 of this patchset:
https://lkml.org/lkml/2014/11/25/868

We ended up going with klp_ (kernel live patching).

Thanks,
Seth

> 
> It is not so big subsystem that additional typing would matter much...
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 0/3] Kernel Live Patching

2014-11-25 Thread Seth Jennings
On Tue, Nov 25, 2014 at 08:26:22PM +0100, Jiri Kosina wrote:
> On Tue, 25 Nov 2014, Seth Jennings wrote:
> 
> > Masami's IPMODIFY patch is heading for -next via your tree.  Once it 
> > arrives,
> > I'll rebase and make the change to set IPMODIFY.  Do not pull this for -next
> > yet.  This version (v4) is for review and gathering acks.
> 
> Thanks for sending out v4 and incorporating the feedback, I really 
> appreciate your responsiveness!
> 
> Anyway, I don't think targetting 3.19 is realistic, given we're currently 
> already past 3.18-rc6 ... even if we rush it into -next in the coming 
> days, it will get close to zero exposure in there before the merge window 
> opens.

Agreed. Sorry if I gave the impression that I was trying to rush this
into 3.19.  I just wanted to make sure that Steve was aware of the
dependency.

> 
> I'd like to do quite some more testing and still finish some pending 
> portions of code reviews on our side (especially to make sure that this 
> can be easily extended to support any consistency model in the future).

Without knowing how that consistency code will look, how can we "make
sure" that this code can be easily extended to support it?  I don't
think we should hold up this first step based on what we think the
consistency code might look like. The code is not that complex right
now. That was the point :)  We can always adapt things.

> 
> Once we start collecting Reviewed-by's / Acked-by's on this patchset, I 
> can establish a tree on git.kernel.org that we can use to collect any 
> followup patches during 3.20 development cycle and send a pull request to 
> Linus during 3.20 merge window .. if everybody agrees with this course of 
> action, obviously.

I was hoping this first step would go into next via Steve's tree and go
upstream for 3.20 (hopefully) from there.  I would be against anything
that tries to expand the feature set before this base functionality gets
upstream.  However, if we want to have a tree to gather fixes before
3.20, which I think is what you are suggesting, that works for me.  We
would need to agree explicitly that, in this tree, patches would need
both a RH and SUSE ack to be accepted.

Thanks,
Seth

> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv4 1/3] kernel: add TAINT_LIVEPATCH

2014-11-25 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 75511ef..83ab256 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -843,6 +843,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5449d2f..d03e3de 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -471,6 +471,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..8136ad7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -226,6 +226,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -246,6 +247,7 @@ static const struct tnt tnts[] = {
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
  *  'L' - A soft lockup has previously occurred.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv4 2/3] kernel: add support for live patching

2014-11-25 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 MAINTAINERS  |  13 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  40 ++
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  74 +++
 include/linux/livepatch.h| 121 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 807 +++
 11 files changed, 1125 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
new file mode 100644
index 000..5bf42a8
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -0,0 +1,44 @@
+What:  /sys/kernel/livepatch
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   Interface for kernel live patching
+
+   The /sys/kernel/livepatch directory contains subdirectories for
+   each loaded live patch module.
+
+What:  /sys/kernel/livepatch/
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The patch directory contains subdirectories for each kernel
+   object (vmlinux or a module) in which it patched functions.
+
+What:  /sys/kernel/livepatch//enabled
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   A writable attribute that indicates whether the patched
+   code is currently applied.  Writing 0 will disable the patch
+   while writing 1 will re-enable the patch.
+
+What:  /sys/kernel/livepatch//
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The object directory contains subdirectories for each function
+   that is patched within the object.
+
+What:  /sys/kernel/livepatch///
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The function directory contains attributes regarding the
+   properties and state of the patched function.
+
+   There are currently no such attributes.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4861577..7985293 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5715,6 +5715,19 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+F: arch/x86/include/asm/livepatch.h
+F: arch/x86/kernel/livepatch.c
+F: Documentation/ABI/testing/sysfs-kernel-livepatch
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ec21dfd..78715fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_HAVE_LIVE_PATCHING
 
 ### Arch settings
 config X86
@@ -1991,6 +1992,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h
new file mode 100644
in

[PATCHv4 3/3] samples: add sample live patching module

2014-11-25 Thread Seth Jennings
Add a sample live patching module.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS  |  1 +
 samples/Kconfig  |  7 +++
 samples/livepatch/Makefile   |  1 +
 samples/livepatch/livepatch-sample.c | 87 
 4 files changed, 96 insertions(+)
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7985293..9d3f9d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5726,6 +5726,7 @@ F:include/linux/livepatch.h
 F: arch/x86/include/asm/livepatch.h
 F: arch/x86/kernel/livepatch.c
 F: Documentation/ABI/testing/sysfs-kernel-livepatch
+F: samples/livepatch/
 L: live-patch...@vger.kernel.org
 
 LLC (802.2)
diff --git a/samples/Kconfig b/samples/Kconfig
index 6181c2c..0aed20d 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -63,4 +63,11 @@ config SAMPLE_RPMSG_CLIENT
  to communicate with an AMP-configured remote processor over
  the rpmsg bus.
 
+config SAMPLE_LIVE_PATCHING
+   tristate "Build live patching sample -- loadable modules only"
+   depends on LIVE_PATCHING && m
+   help
+ Builds a sample live patch that replaces the procfs handler
+ for /proc/cmdline to print "this has been live patched".
+
 endif # SAMPLES
diff --git a/samples/livepatch/Makefile b/samples/livepatch/Makefile
new file mode 100644
index 000..7f1cdc1
--- /dev/null
+++ b/samples/livepatch/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SAMPLE_LIVE_PATCHING) += livepatch-sample.o
diff --git a/samples/livepatch/livepatch-sample.c 
b/samples/livepatch/livepatch-sample.c
new file mode 100644
index 000..21f159d
--- /dev/null
+++ b/samples/livepatch/livepatch-sample.c
@@ -0,0 +1,87 @@
+/*
+ * livepatch-sample.c - Kernel Live Patching Sample Module
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include 
+#include 
+#include 
+
+/*
+ * This (dumb) live patch overrides the function that prints the
+ * kernel boot cmdline when /proc/cmdline is read.
+ *
+ * Example:
+ * $ cat /proc/cmdline
+ * 
+ * $ insmod livepatch-sample.ko
+ * $ cat /proc/cmdline
+ * this has been live patched
+ * $ echo 0 > /sys/kernel/livepatch/klp_sample/enabled
+ * 
+ */
+
+#include 
+static int livepatch_cmdline_proc_show(struct seq_file *m, void *v)
+{
+   seq_printf(m, "%s\n", "this has been live patched");
+   return 0;
+}
+
+static struct klp_func funcs[] = {
+   {
+   .old_name = "cmdline_proc_show",
+   .new_func = livepatch_cmdline_proc_show,
+   }, { }
+};
+
+static struct klp_object objs[] = {
+   {
+   /* name being NULL means vmlinux */
+   .funcs = funcs,
+   }, { }
+};
+
+static struct klp_patch patch = {
+   .mod = THIS_MODULE,
+   .objs = objs,
+};
+
+static int livepatch_init(void)
+{
+   int ret;
+
+   ret = klp_register_patch(&patch);
+   if (ret)
+   return ret;
+   ret = klp_enable_patch(&patch);
+   if (ret) {
+   WARN_ON(klp_unregister_patch(&patch));
+   return ret;
+   }
+   return 0;
+}
+
+static void livepatch_exit(void)
+{
+   WARN_ON(klp_disable_patch(&patch));
+   WARN_ON(klp_unregister_patch(&patch));
+}
+
+module_init(livepatch_init);
+module_exit(livepatch_exit);
+MODULE_LICENSE("GPL");
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv4 0/3] Kernel Live Patching

2014-11-25 Thread Seth Jennings
.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (3):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching
  samples: add sample live patching module

 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 Documentation/oops-tracing.txt   |   2 +
 Documentation/sysctl/kernel.txt  |   1 +
 MAINTAINERS  |  14 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  40 ++
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  74 +++
 include/linux/kernel.h   |   1 +
 include/linux/livepatch.h| 121 
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 807 +++
 kernel/panic.c   |   2 +
 samples/Kconfig  |   7 +
 samples/livepatch/Makefile   |   1 +
 samples/livepatch/livepatch-sample.c |  87 +++
 18 files changed, 1227 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c
 create mode 100644 samples/livepatch/Makefile
 create mode 100644 samples/livepatch/livepatch-sample.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-21 Thread Seth Jennings
On Fri, Nov 21, 2014 at 06:35:47PM +0100, Jiri Slaby wrote:
> On 11/21/2014, 05:40 PM, Seth Jennings wrote:
> >>> --- /dev/null
> >>> +++ b/arch/x86/include/asm/livepatch.h
> >>> @@ -0,0 +1,37 @@
> ...
> >>> +#ifndef _ASM_X86_LIVEPATCH_H
> >>> +#define _ASM_X86_LIVEPATCH_H
> >>> +
> >>> +#include 
> >>> +
> >>> +#ifdef CONFIG_LIVE_PATCHING
> >>> +extern int klp_write_module_reloc(struct module *mod, unsigned long type,
> >>> +   unsigned long loc, unsigned long value);
> >>> +
> >>> +#else
> >>> +static int klp_write_module_reloc(struct module *mod, unsigned long type,
> >>
> >> static inline?
> > 
> > I think the practice is to let the compiler handle inline determination
> > unless you are sure that the compiler isn't inlining something you think
> > it should.
> 
> Although you are right, it is a correct C, gcc specs (6.39) suggests to
> use 'static inline' on such functions. gcc then shall inline such functions.

Fair enough.  Queued up.

Thanks,
Seth

> 
> And if you look around in the kernel, we use that combination almost
> everywhere.
> 
> thanks,
> -- 
> js
> suse labs
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 3/3] kernel: add sysfs documentation for live patching

2014-11-21 Thread Seth Jennings
On Fri, Nov 21, 2014 at 11:49:46AM +0900, Masami Hiramatsu wrote:
> (2014/11/21 7:29), Seth Jennings wrote:
> > Adds sysfs interface documentation to Documentation/ABI/testing/
> > 
> 
> Hmm, is there any reason to decouple this documentation from code patch?
> I think we'd better merge this to 2/3, or move sysfs interface code from 2/3 
> to this.

Good point.  I'll collapse this into 2/3 for v4.

Thanks,
Seth

> 
> Thank you,
> 
> > Signed-off-by: Seth Jennings 
> > ---
> >  Documentation/ABI/testing/sysfs-kernel-livepatch | 44 
> > 
> >  MAINTAINERS  |  1 +
> >  2 files changed, 45 insertions(+)
> >  create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
> > b/Documentation/ABI/testing/sysfs-kernel-livepatch
> > new file mode 100644
> > index 000..5bf42a8
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
> > @@ -0,0 +1,44 @@
> > +What:  /sys/kernel/livepatch
> > +Date:  Nov 2014
> > +KernelVersion: 3.19.0
> > +Contact:   live-patch...@vger.kernel.org
> > +Description:
> > +   Interface for kernel live patching
> > +
> > +   The /sys/kernel/livepatch directory contains subdirectories for
> > +   each loaded live patch module.
> > +
> > +What:  /sys/kernel/livepatch/
> > +Date:  Nov 2014
> > +KernelVersion: 3.19.0
> > +Contact:   live-patch...@vger.kernel.org
> > +Description:
> > +   The patch directory contains subdirectories for each kernel
> > +   object (vmlinux or a module) in which it patched functions.
> > +
> > +What:  /sys/kernel/livepatch//enabled
> > +Date:  Nov 2014
> > +KernelVersion: 3.19.0
> > +Contact:   live-patch...@vger.kernel.org
> > +Description:
> > +   A writable attribute that indicates whether the patched
> > +   code is currently applied.  Writing 0 will disable the patch
> > +   while writing 1 will re-enable the patch.
> > +
> > +What:  /sys/kernel/livepatch//
> > +Date:  Nov 2014
> > +KernelVersion: 3.19.0
> > +Contact:   live-patch...@vger.kernel.org
> > +Description:
> > +   The object directory contains subdirectories for each function
> > +   that is patched within the object.
> > +
> > +What:  /sys/kernel/livepatch///
> > +Date:  Nov 2014
> > +KernelVersion: 3.19.0
> > +Contact:   live-patch...@vger.kernel.org
> > +Description:
> > +   The function directory contains attributes regarding the
> > +   properties and state of the patched function.
> > +
> > +   There are currently no such attributes.
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index c7f49ae..7985293 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5725,6 +5725,7 @@ F:kernel/livepatch/
> >  F: include/linux/livepatch.h
> >  F: arch/x86/include/asm/livepatch.h
> >  F: arch/x86/kernel/livepatch.c
> > +F: Documentation/ABI/testing/sysfs-kernel-livepatch
> >  L: live-patch...@vger.kernel.org
> >  
> >  LLC (802.2)
> > 
> 
> 
> -- 
> Masami HIRAMATSU
> Software Platform Research Dept. Linux Technology Research Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: masami.hiramatsu...@hitachi.com
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-21 Thread Seth Jennings
On Fri, Nov 21, 2014 at 01:22:33AM +0100, Jiri Kosina wrote:
> On Thu, 20 Nov 2014, Seth Jennings wrote:
> 
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> > 
> > Signed-off-by: Seth Jennings 
> 
> I think this is getting really close, which is awesome. A few rather minor 
> nits below.
> 
> [ ... snip ... ]
> > diff --git a/arch/x86/include/asm/livepatch.h 
> > b/arch/x86/include/asm/livepatch.h
> > new file mode 100644
> > index 000..2ed86ec
> > --- /dev/null
> > +++ b/arch/x86/include/asm/livepatch.h
> > @@ -0,0 +1,37 @@
> > +/*
> > + * livepatch.h - x86-specific Kernel Live Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _ASM_X86_LIVEPATCH_H
> > +#define _ASM_X86_LIVEPATCH_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_LIVE_PATCHING
> > +extern int klp_write_module_reloc(struct module *mod, unsigned long type,
> > + unsigned long loc, unsigned long value);
> > +
> > +#else
> > +static int klp_write_module_reloc(struct module *mod, unsigned long type,
> 
> static inline?

I think the practice is to let the compiler handle inline determination
unless you are sure that the compiler isn't inlining something you think
it should.

All other changes are accepted and queued for v4.

Thanks,
Seth

> 
> [ ... snip ... ]
> > --- /dev/null
> > +++ b/kernel/livepatch/Kconfig
> > @@ -0,0 +1,18 @@
> > +config ARCH_HAVE_LIVE_PATCHING
> > +   boolean
> > +   help
> > + Arch supports kernel live patching
> > +
> > +config LIVE_PATCHING
> > +   boolean "Kernel Live Patching"
> > +   depends on DYNAMIC_FTRACE_WITH_REGS
> > +   depends on MODULES
> > +   depends on SYSFS
> > +   depends on KALLSYMS_ALL
> > +   depends on ARCH_HAVE_LIVE_PATCHING
> 
> We have to refuse to build on x86_64 if the compiler doesn't support 
> fentry. mcount is not really usable (well, it would be possible to use it, 
> be the obstacles are too big to care).
> 
> Something like [1] should be applicable here as well I believe.
> 
> [1] 
> https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/commit/?h=kgraft&id=bd4bc097c72937d18036f1312a4d79ed0bea9991
> 
> [ ... snip ... ]
> > --- /dev/null
> > +++ b/kernel/livepatch/core.c
> > @@ -0,0 +1,828 @@
> > +/*
> > + * core.c - Kernel Live Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * 

Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-21 Thread Seth Jennings
On Fri, Nov 21, 2014 at 04:46:32PM +0100, Miroslav Benes wrote:
> On Fri, 21 Nov 2014, Josh Poimboeuf wrote:
> 
> > On Fri, Nov 21, 2014 at 03:44:35PM +0100, Miroslav Benes wrote:
> > > On Fri, 21 Nov 2014, Jiri Kosina wrote:
> > > 
> > > [...]
> > > 
> > > > [ ... snip ... ]
> > > > > +static int klp_init_patch(struct klp_patch *patch)
> > > > > +{
> > > > > + int ret;
> > > > > +
> > > > > + mutex_lock(&klp_mutex);
> > > > > +
> > > > > + /* init */
> > > > > + patch->state = LPC_DISABLED;
> > > > > +
> > > > > + /* sysfs */
> > > > > + ret = kobject_init_and_add(&patch->kobj, &klp_ktype_patch,
> > > > > +klp_root_kobj, patch->mod->name);
> > > > > + if (ret)
> > > > > + return ret;
> > > > 
> > > > klp_mutex is leaked locked here.
> > > > 
> > > > > +
> > > > > + /* create objects */
> > > > > + ret = klp_init_objects(patch);
> > > > > + if (ret) {
> > > > > + kobject_put(&patch->kobj);
> > > > > + return ret;
> > > > 
> > > > And here as well.
> > > > 
> > > > All in all, this is looking very good to me. I think we are really 
> > > > close 
> > > > to having a code that all the parties would agree with. Thanks 
> > > > everybody,
> > > 
> > > The leaking is my fault. I missed that somehow during rebasing.
> > > 
> > > Seth, could you please fix it in v4? 
> > 
> > Is it necessary to grab the mutex at the beginning of klp_init_patch?  I
> > think we only need it when adding it to the global list at the end of
> > the function.
> 
> I think it's not necessary now after thinking about that. It could happen 
> that init values could be written twice to some patch structure if 
> klp_register_patch would be called twice. But it should not corrupt 
> anything and adding to the global list is protected. However I think we 
> should define what is protected by klp_mutex and comment it somewhere near 
> the mutex definition (if only the klp_patches list is protected or 
> something more (in the future)).

I'll fix it up for v4, moving the mutex just around the list_add() and
adding a comment about what is protected by the mutex.

Seth

> 
> Mira
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 0/3] Kernel Live Patching

2014-11-20 Thread Seth Jennings
Changelog:

Thanks for all the feedback!

changes in v3:
- merge API and core data structures (Miroslav)
- replace lp_ and lpc_ prefixes with klp_
- add ARCH_HAVE_LIVE_PATCHING
- guard livepatch.h with IS_ENABLED(CONFIG_LIVE_PATCHING)
- smaller cleanups
- TODO: FTRACE_OPS_FL_DYNAMIC? allocate fops in core?
- TODO: kernel-doc for API structs once agreed upon

changes in v2:
- rebase to next-20141113
- add copyright/license block to livepatch.h
- add _LINUX prefix to header defines
- replace semaphore with mutex
- add LPC_ prefix to state enum
- convert BUGs to WARNs and handle properly
- change Kconfig default to n
- remove [old|new] attrs from function sysfs dir (KASLR leak, no use)
- disregard user provided old_addr if kernel uses KASLR
- s/out/err for error path labels
- s/unregister/disable for uniform terminology
- s/lp/lpc for module notifier elements
- replace module ref'ing with unload notifier + mutex protection
- adjust notifier priority to run before ftrace
- make LIVE_PATCHING boolean (about to depend on arch stuff)
- move x86-specific reloc code to arch/x86
- s/dynrela/reloc/
- add live patching sysfs documentation
- add API function kernel-doc
- TODO: kernel-doc for API structs once agreed upon

Summary:

This patchset implements an ftrace-based mechanism and kernel interface for
doing live patching of kernel and kernel module functions.  It represents the
greatest common functionality set between kpatch [1] and kGraft [2] and can
accept patches built using either method.  This solution was discussed in the
Live Patching Mini-conference at LPC 2014 [3].

The model consists of a live patching "core" that provides an interface for
other "patch" kernel modules to register patches with the core.

Patch modules contain the new function code and create an klp_patch structure
containing the required data about what functions to patch, where the new code
for each patched function resides, and in which kernel object (vmlinux or
module) the function to be patch resides.  The patch module then invokes the
klp_register_patch() function to register with the core, then klp_enable_patch()
to have the core redirect the execution paths using ftrace.

An example patch module can be found here:
https://github.com/spartacus06/livepatch/blob/master/patch/patch.c

The live patching core creates a sysfs hierarchy for user-level access to live
patching information.  The hierarchy is structured like this:

/sys/kernel/livepatch
/sys/kernel/livepatch/
/sys/kernel/livepatch//enabled
/sys/kernel/livepatch//
/sys/kernel/livepatch///

The old function is located using one of two methods: it is either provided by
the patch module (only possible for a function in vmlinux) or kallsyms lookup.
Symbol ambiguity results in a failure.

The core takes a reference on the patch module itself to keep it from
unloading.  This is because, without a mechanism to ensure that no thread is
currently executing in the patched function, we can not determine whether it is
safe to unload the patch module.  For this reason, unloading patch modules is
currently not allowed.

Disabling patches can be done using the "enabled" attribute of the patch:

echo 0 > /sys/kernel/livepatch//enabled

If a patch module contains a patch for a module that is not currently loaded,
there is nothing to patch so the core does nothing for that patch object.
However, the core registers a module notifier that looks for COMING events so
that if the module is ever loaded, it is immediately patched.  If a module with
patch code is removed, the notifier looks for GOING events and disables any
patched functions for that object before it unloads.  The notifier has a higher
priority than that of the ftrace notifier so that it runs before the ftrace
notifier for GOING events and we can cleanly unregister from ftrace.

kpatch and kGraft each have their own mechanisms for ensuring system
consistency during the patching process. This first version does not implement
any consistency mechanism that ensures that old and new code do not run
together.  In practice, ~90% of CVEs are safe to apply in this way, since they
simply add a conditional check.  However, any function change that can not
execute safely with the old version of the function can _not_ be safely applied
for now.

[1] https://github.com/dynup/kpatch
[2] https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (3):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching
  kernel: add sysfs documentation for live patching

 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 ++
 Documentation/oops-tracing.txt   |   2 +
 Documentation/sysctl/kernel.txt  |   1 +
 MAINTAINERS  |  13 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  37 +
 arch/x86/kernel/Makefile

[PATCHv3 1/3] kernel: add TAINT_LIVEPATCH

2014-11-20 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 75511ef..83ab256 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -843,6 +843,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5449d2f..d03e3de 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -471,6 +471,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..8136ad7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -226,6 +226,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -246,6 +247,7 @@ static const struct tnt tnts[] = {
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
  *  'L' - A soft lockup has previously occurred.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 2/3] kernel: add support for live patching

2014-11-20 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS  |  12 +
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/livepatch.h |  37 ++
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  74 
 include/linux/livepatch.h|  96 +
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |  18 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 828 +++
 10 files changed, 1073 insertions(+)
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4861577..c7f49ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5715,6 +5715,18 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+F: arch/x86/include/asm/livepatch.h
+F: arch/x86/kernel/livepatch.c
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ec21dfd..78715fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_HAVE_LIVE_PATCHING
 
 ### Arch settings
 config X86
@@ -1991,6 +1992,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h
new file mode 100644
index 000..2ed86ec
--- /dev/null
+++ b/arch/x86/include/asm/livepatch.h
@@ -0,0 +1,37 @@
+/*
+ * livepatch.h - x86-specific Kernel Live Patching Core
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ASM_X86_LIVEPATCH_H
+#define _ASM_X86_LIVEPATCH_H
+
+#include 
+
+#ifdef CONFIG_LIVE_PATCHING
+extern int klp_write_module_reloc(struct module *mod, unsigned long type,
+ unsigned long loc, unsigned long value);
+
+#else
+static int klp_write_module_reloc(struct module *mod, unsigned long type,
+ unsigned long loc, unsigned long value)
+{
+   return -ENOSYS;
+}
+#endif
+
+#endif /* _ASM_X86_LIVEPATCH_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5d4502c..316b34e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o
 obj-y  += apic/
 obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
+obj-$(CONFIG_LIVE_PATCHING)+= livepatch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
 obj-$(CONFIG_X86_TSC)  += trace_clock.o
diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c
new file mode 100644
index 000..777a4a4
--- /dev/null
+++ b/arch/x86/kernel/livepatch.c
@@ -0,0 +1,74 @@
+/*
+ * livepatch.c - x86-specific Kernel Live Patching Core
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ 

[PATCHv3 3/3] kernel: add sysfs documentation for live patching

2014-11-20 Thread Seth Jennings
Adds sysfs interface documentation to Documentation/ABI/testing/

Signed-off-by: Seth Jennings 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch | 44 
 MAINTAINERS  |  1 +
 2 files changed, 45 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
new file mode 100644
index 000..5bf42a8
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -0,0 +1,44 @@
+What:  /sys/kernel/livepatch
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   Interface for kernel live patching
+
+   The /sys/kernel/livepatch directory contains subdirectories for
+   each loaded live patch module.
+
+What:  /sys/kernel/livepatch/
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The patch directory contains subdirectories for each kernel
+   object (vmlinux or a module) in which it patched functions.
+
+What:  /sys/kernel/livepatch//enabled
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   A writable attribute that indicates whether the patched
+   code is currently applied.  Writing 0 will disable the patch
+   while writing 1 will re-enable the patch.
+
+What:  /sys/kernel/livepatch//
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The object directory contains subdirectories for each function
+   that is patched within the object.
+
+What:  /sys/kernel/livepatch///
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The function directory contains attributes regarding the
+   properties and state of the patched function.
+
+   There are currently no such attributes.
diff --git a/MAINTAINERS b/MAINTAINERS
index c7f49ae..7985293 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5725,6 +5725,7 @@ F:kernel/livepatch/
 F: include/linux/livepatch.h
 F: arch/x86/include/asm/livepatch.h
 F: arch/x86/kernel/livepatch.c
+F: Documentation/ABI/testing/sysfs-kernel-livepatch
 L: live-patch...@vger.kernel.org
 
 LLC (802.2)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-20 Thread Seth Jennings
On Thu, Nov 20, 2014 at 11:35:52AM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 20, 2014 at 02:10:33PM +0100, Miroslav Benes wrote:
> > 
> > On Sun, 16 Nov 2014, Seth Jennings wrote:
> > 
> > > This commit introduces code for the live patching core.  It implements
> > > an ftrace-based mechanism and kernel interface for doing live patching
> > > of kernel and kernel module functions.
> > > 
> > > It represents the greatest common functionality set between kpatch and
> > > kgraft and can accept patches built using either method.
> > > 
> > > This first version does not implement any consistency mechanism that
> > > ensures that old and new code do not run together.  In practice, ~90% of
> > > CVEs are safe to apply in this way, since they simply add a conditional
> > > check.  However, any function change that can not execute safely with
> > > the old version of the function can _not_ be safely applied in this
> > > version.
> > > 
> > > Signed-off-by: Seth Jennings 
> > 
> > Hi,
> > 
> > below is the patch which merges the internal and external data structures 
> > (so it is only one part of our original patch for version 1). Apart from 
> > that I tried to make minimal changes to the code. Only unnecessary 
> > kobjects were removed and I renamed lpc_create_* functions to lpc_init_* 
> > as it made more sense in this approach, I think.
> > 
> > I hope this clearly shows our point of view stated previously. What do 
> > you say?
> 
> Thanks for rebasing to v2 and splitting up the patches!  Personally I'm
> ok with this patch (though I do have a few comments below).

Thanks Josh :)

Miroslav, before you send out a revision on this patch, I'm merging it
for v3 right now.  I'll fixup any trivial fixes from this email.

I'm putting the finishing touches on v3 now.  Hopefully it will make
everyone happy, or happier, with your changes merged.  Should be getting
close...

Thanks,
Seth

> 
> > Next, I'll look at the three level hierarchy and sysfs directory and see 
> > if we can make it simpler yet keep its advantages.
> > 
> > Regards,
> > 
> > Miroslav Benes
> > SUSE Labs
> > 
> > -- >8 --
> > From aba839eb6b3292b193843715bfce7834969c0c17 Mon Sep 17 00:00:00 2001
> > From: Miroslav Benes 
> > Date: Wed, 19 Nov 2014 16:06:35 +0100
> > Subject: [PATCH] Remove the data duplication in internal and public 
> > structures
> > 
> > The split of internal and external structures is cleaner and makes the API 
> > more
> > stable. But it makes the code more complicated. It requires more space and 
> > data
> > copying. Also the one letter difference of the names (lp_ vs. lpc_ prefix)
> > causes confusion.
> > 
> > The API is not a real issue for live patching. We take care neither of 
> > backward
> > nor forward compatibility. The dependency between a patch and kernel is even
> > more strict than by version. They have to use the same configuration and the
> > same build environment.
> > 
> > This patch merge the external and internal structures into one.  The 
> > structures
> > are initialized using ".item = value" syntax. Therefore the API is 
> > basically as
> > stable as it was before. We could later even hide it under some helper 
> > macros
> > if requested.
> > 
> > For the purpose if this patch, we used the prefix "lpc". It allows to make 
> > as
> > less changes as possible and show the real effect. If the patch is 
> > accepted, it
> > would make sense to merge it into the original patch and even use another
> > common prefix, for example the proposed "klp".
> > 
> > Signed-off-by: Miroslav Benes 
> > ---
> >  include/linux/livepatch.h |  47 +--
> >  kernel/livepatch/core.c   | 338 
> > --
> >  2 files changed, 121 insertions(+), 264 deletions(-)
> > 
> > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> > index 8b68fef..f16de32 100644
> > --- a/include/linux/livepatch.h
> > +++ b/include/linux/livepatch.h
> > @@ -21,10 +21,23 @@
> >  #define _LINUX_LIVEPATCH_H_
> >  
> >  #include 
> > +#include 
> >  
> >  /* TODO: add kernel-doc for structures once agreed upon */
> >  
> > -struct lp_func {
> > +enum lpc_state {
> > +   LPC_DISABLED,
> > +   LPC_ENABLED
> > +};
> > +
> > +struct lpc_func {
> > +   /* internal */
> 
> Would it be

Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-20 Thread Seth Jennings
On Thu, Nov 20, 2014 at 09:19:54AM -0600, Josh Poimboeuf wrote:
> On Sun, Nov 16, 2014 at 07:29:23PM -0600, Seth Jennings wrote:
> > +static int lpc_module_notify(struct notifier_block *nb, unsigned long 
> > action,
> > +   void *data)
> > +{
> > +   struct module *mod = data;
> > +   struct lpc_patch *patch;
> > +   struct lpc_object *obj;
> > +
> > +   mutex_lock(&lpc_mutex);
> > +
> > +   if (action != MODULE_STATE_COMING && action != MODULE_STATE_GOING)
> > +   goto out;
> 
> I think we can get the mutex here instead of above so it doesn't block
> other module actions (and then you can also get rid of the "out" label).

Sure.

Thanks,
Seth

> 
> > +
> > +   list_for_each_entry(patch, &lpc_patches, list) {
> > +   if (patch->state == LPC_DISABLED)
> > +   continue;
> > +   list_for_each_entry(obj, &patch->objs, list) {
> > +   if (strcmp(obj->name, mod->name))
> > +   continue;
> > +   if (action == MODULE_STATE_COMING) {
> > +   obj->mod = mod;
> > +   lpc_module_notify_coming(patch->mod, obj);
> > +   } else /* MODULE_STATE_GOING */
> > +   lpc_module_notify_going(patch->mod, obj);
> > +   break;
> > +   }
> > +   }
> > +out:
> > +   mutex_unlock(&lpc_mutex);
> > +   return 0;
> > +}
> 
> -- 
> Josh
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


v3.18-rc5 build failure on ppc64 in Documentation/mic/mpssd/mpssd.o

2014-11-19 Thread Seth Jennings
One or more in the following commit set in the 3.18 merge window is causing
build failure on ppc64 with the default config:

7b345771ba921361b318e95bf21b257c65ac141c Documentation: update include path for 
mpssd
8c2b0dc83d9840da4d993a5dbb15c5974ad5a188 Documentation: support glibc versions 
without htole macros
6ab0e475f1f38b6be90aff4ef3ebf928c4a73dc8 Documentation: fix misc. warnings
adb19fb66eeebac07fe37d968725bb8906dadb8e Documentation: add makefiles for more 
targets

Building v3.18-rc5 and I have glibc 2.17.

The problem is that for big endian, the htole* macros are not simple compile
time transforms by the precompiler, but rather block code that must be executed
in function context.

Build errors:
===
  HOSTCC  Documentation/mic/mpssd/mpssd.o
In file included from /usr/include/bits/byteswap.h:34:0,
 from /usr/include/endian.h:60,
 from /usr/include/bits/waitstatus.h:64,
 from /usr/include/stdlib.h:42,
 from Documentation/mic/mpssd/mpssd.c:23:
Documentation/mic/mpssd/mpssd.c:93:10: error: braced-group within expression 
allowed only inside a function
   .num = htole16(MIC_VRING_ENTRIES),
  ^
Documentation/mic/mpssd/mpssd.c:96:10: error: braced-group within expression 
allowed only inside a function
   .num = htole16(MIC_VRING_ENTRIES),
  ^
Documentation/mic/mpssd/mpssd.c:113:10: error: braced-group within expression 
allowed only inside a function
   .num = htole16(MIC_VRING_ENTRIES),
  ^
Documentation/mic/mpssd/mpssd.c:116:10: error: braced-group within expression 
allowed only inside a function
   .num = htole16(MIC_VRING_ENTRIES),
  ^
Documentation/mic/mpssd/mpssd.c:119:3: error: initializer element is not 
constant
   .host_features = htole32(
   ^
Documentation/mic/mpssd/mpssd.c:119:3: error: (near initialization for 
‘virtnet_dev_page.host_features’)
In file included from /usr/include/bits/byteswap.h:34:0,
 from /usr/include/endian.h:60,
 from /usr/include/bits/waitstatus.h:64,
 from /usr/include/stdlib.h:42,
 from Documentation/mic/mpssd/mpssd.c:23:
Documentation/mic/mpssd/mpssd.c:146:10: error: braced-group within expression 
allowed only inside a function
   .num = htole16(MIC_VRING_ENTRIES),
  ^
Documentation/mic/mpssd/mpssd.c:149:3: error: initializer element is not 
constant
   htole32(1

Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-19 Thread Seth Jennings
On Tue, Nov 18, 2014 at 03:45:22PM +0100, Miroslav Benes wrote:
> 
> On Sun, 16 Nov 2014, Seth Jennings wrote:
> 
> [...]
> 
> > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> > new file mode 100644
> > index 000..8b68fef
> > --- /dev/null
> > +++ b/include/linux/livepatch.h
> > @@ -0,0 +1,68 @@
> > +/*
> > + * livepatch.h - Live Kernel Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _LINUX_LIVEPATCH_H_
> > +#define _LINUX_LIVEPATCH_H_
> > +
> > +#include 
> > +
> 
> I think we need something like 
> 
> #if IS_ENABLED(CONFIG_LIVE_PATCHING)
> 
> here. Otherwise kernel module with live patch itself would be built 
> even with live patching support disabled (as the structures and needed 
> functions are declared).

What do you think of this (already includes s/lp/klp/ change)?


diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 0143b73..a9821f3 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -21,6 +21,7 @@
 #define _LINUX_LIVEPATCH_H_
 
 #include 
+#include 
 
 /* TODO: add kernel-doc for structures once agreed upon */
 
@@ -58,11 +59,20 @@ struct klp_patch {
struct klp_object *objs;
 };
 
-int klp_register_patch(struct klp_patch *);
-int klp_unregister_patch(struct klp_patch *);
-int klp_enable_patch(struct klp_patch *);
-int klp_disable_patch(struct klp_patch *);
+#ifdef CONFIG_LIVE_PATCHING
 
-#include 
+extern int klp_register_patch(struct klp_patch *);
+extern int klp_unregister_patch(struct klp_patch *);
+extern int klp_enable_patch(struct klp_patch *);
+extern int klp_disable_patch(struct klp_patch *);
+
+#else /* !CONFIG_LIVE_PATCHING */
+
+static int klp_register_patch(struct klp_patch *k) { return -ENOSYS; }
+static int klp_unregister_patch(struct klp_patch *k) { return -ENOSYS; }
+static int klp_enable_patch(struct klp_patch *k) { return -ENOSYS; }
+static int klp_disable_patch(struct klp_patch *k) { return -ENOSYS; }
+
+#endif


This seems to be the way many headers handle this.  Patch modules built
against a kernel that doesn't support live patching will build cleanly,
but will always fail to load.

Seth

> 
> > +/* TODO: add kernel-doc for structures once agreed upon */
> > +
> > +struct lp_func {
> > +   const char *old_name; /* function to be patched */
> > +   void *new_func; /* replacement function in patch module */
> > +   /*
> > +* The old_addr field is optional and can be used to resolve
> > +* duplicate symbol names in the vmlinux object.  If this
> > +* information is not present, the symbol is located by name
> > +* with kallsyms. If the name is not unique and old_addr is
> > +* not provided, the patch application fails as there is no
> > +* way to resolve the ambiguity.
> > +*/
> > +   unsigned long old_addr;
> > +};
> > +
> > +struct lp_reloc {
> > +   unsigned long dest;
> > +   unsigned long src;
> > +   unsigned long type;
> > +   const char *name;
> > +   int addend;
> > +   int external;
> > +};
> > +
> > +struct lp_object {
> > +   const char *name; /* "vmlinux" or module name */
> > +   struct lp_func *funcs;
> > +   struct lp_reloc *relocs;
> > +};
> > +
> > +struct lp_patch {
> > +   struct module *mod; /* module containing the patch */
> > +   struct lp_object *objs;
> > +};
> > +
> > +int lp_register_patch(struct lp_patch *);
> > +int lp_unregister_patch(struct lp_patch *);
> > +int lp_enable_patch(struct lp_patch *);
> > +int lp_disable_patch(struct lp_patch *);
> > +
> > +#include 
> 
> and #endif for CONFIG_LIVE_PATCHING here.
> 
> > +
> > +#endif /* _LINUX_LIVEPATCH_H_ */
> 
> Thanks,
> --
> Miroslav Benes
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-19 Thread Seth Jennings
On Wed, Nov 19, 2014 at 04:27:39PM +0100, Miroslav Benes wrote:
> 
> Hi,
> 
> during rewriting our code I came across few more things. See below.
> 
> On Sun, 16 Nov 2014, Seth Jennings wrote:
> 
> [...]
> 
> > +/**
> > + * module notifier
> > + */
> > +
> > +static void lpc_module_notify_coming(struct module *pmod,
> > +struct lpc_object *obj)
> > +{
> > +   struct module *mod = obj->mod;
> > +   int ret;
> > +
> > +   pr_notice("applying patch '%s' to loading module '%s'\n",
> > + mod->name, pmod->name);
> 
> This looks strange. I guess the arguments should be swapped.

Indeed, you are correct :)

> 
> > +   obj->mod = mod;
> 
> And this is redundant.

True again!

> 
> > +   ret = lpc_enable_object(pmod, obj);
> > +   if (ret)
> > +   pr_warn("failed to apply patch '%s' to module '%s' (%d)\n",
> > +   pmod->name, mod->name, ret);
> > +}
> > +
> > +static void lpc_module_notify_going(struct module *pmod,
> > +   struct lpc_object *obj)
> > +{
> > +   struct module *mod = obj->mod;
> > +   int ret;
> > +
> > +   pr_notice("reverting patch '%s' on unloading module '%s'\n",
> > + pmod->name, mod->name);
> > +   ret = lpc_disable_object(obj);
> > +   if (ret)
> > +   pr_warn("failed to revert patch '%s' on module '%s' (%d)\n",
> > +   pmod->name, mod->name, ret);
> > +   obj->mod = NULL;
> > +}
> > +
> > +static int lpc_module_notify(struct notifier_block *nb, unsigned long 
> > action,
> > +   void *data)
> > +{
> > +   struct module *mod = data;
> > +   struct lpc_patch *patch;
> > +   struct lpc_object *obj;
> > +
> > +   mutex_lock(&lpc_mutex);
> > +
> > +   if (action != MODULE_STATE_COMING && action != MODULE_STATE_GOING)
> > +   goto out;
> > +
> > +   list_for_each_entry(patch, &lpc_patches, list) {
> > +   if (patch->state == LPC_DISABLED)
> > +   continue;
> > +   list_for_each_entry(obj, &patch->objs, list) {
> > +   if (strcmp(obj->name, mod->name))
> > +   continue;
> > +   if (action == MODULE_STATE_COMING) {
> > +   obj->mod = mod;
> > +   lpc_module_notify_coming(patch->mod, obj);
> > +   } else /* MODULE_STATE_GOING */
> > +   lpc_module_notify_going(patch->mod, obj);
> > +   break;
> > +   }
> > +   }
> > +out:
> > +   mutex_unlock(&lpc_mutex);
> > +   return 0;
> > +}
> 
> [...]
> 
> > +static struct lpc_object *lpc_create_object(struct kobject *root,
> > +   struct lp_object *userobj)
> > +{
> > +   struct lpc_object *obj;
> > +   int ret;
> > +
> > +   /* alloc */
> > +   obj = kzalloc(sizeof(*obj), GFP_KERNEL);
> > +   if (!obj)
> > +   return NULL;
> > +
> > +   /* init */
> > +   INIT_LIST_HEAD(&obj->list);
> > +   obj->name = userobj->name;
> > +   obj->relocs = userobj->relocs;
> > +   obj->state = LPC_DISABLED;
> > +   /* obj->mod set by lpc_object_module_get() */
> > +   INIT_LIST_HEAD(&obj->funcs);
> 
> There is nothing like lpc_object_module_get() in the code. Did you mean 
> lpc_find_object_module()?

Yes, this comment should be removed or updated.

Thanks,
Seth

> 
> Thank you,
> --
> Miroslav Benes
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: frontswap: invalidate expired data on a dup-store failure

2014-11-19 Thread Seth Jennings
On Wed, Nov 19, 2014 at 09:06:41PM +0800, Weijie Yang wrote:
> On Wed, Nov 19, 2014 at 6:29 AM, Seth Jennings  
> wrote:
> > On Tue, Nov 18, 2014 at 04:51:36PM +0800, Weijie Yang wrote:
> >> If a frontswap dup-store failed, it should invalidate the expired page
> >> in the backend, or it could trigger some data corruption issue.
> >> Such as:
> >> 1. use zswap as the frontswap backend with writeback feature
> >> 2. store a swap page(version_1) to entry A, success
> >> 3. dup-store a newer page(version_2) to the same entry A, fail
> >> 4. use __swap_writepage() write version_2 page to swapfile, success
> >> 5. zswap do shrink, writeback version_1 page to swapfile
> >> 6. version_2 page is overwrited by version_1, data corrupt.
> >
> > Good catch!
> >
> >>
> >> This patch fixes this issue by invalidating expired data immediately
> >> when meet a dup-store failure.
> >>
> >> Signed-off-by: Weijie Yang 
> >> ---
> >>  mm/frontswap.c |4 +++-
> >>  1 files changed, 3 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/mm/frontswap.c b/mm/frontswap.c
> >> index c30eec5..f2a3571 100644
> >> --- a/mm/frontswap.c
> >> +++ b/mm/frontswap.c
> >> @@ -244,8 +244,10 @@ int __frontswap_store(struct page *page)
> >> the (older) page from frontswap
> >>*/
> >>   inc_frontswap_failed_stores();
> >> - if (dup)
> >> + if (dup) {
> >>   __frontswap_clear(sis, offset);
> >> + frontswap_ops->invalidate_page(type, offset);
> >
> > Looking at __frontswap_invalidate_page(), should we do
> > inc_frontswap_invalidates() too?  If so, maybe we should just call
> > __frontswap_invalidate_page().
> 
> The frontswap_invalidate_page() is for swap_entry_free, while here
> is an inner ops for dup-store, so I think there is no need for
> inc_frontswap_invalidates().

In my mind, I agree we shouldn't call __frontswap_invalidate_page(),
just to keep things separated.

Andrew has already pulled it in and it isn't a big deal.  Just a
statistics thing on a rare situation (dup) counted along with lots
of frequent situations (normal invalidate).  Which makes me think
we make want to count dup-invalidates as a separate stat.  But that
would be a separate patch too :)

Thanks,
Seth

> 
> > Thanks,
> > Seth
> >
> >> + }
> >>   }
> >>   if (frontswap_writethrough_enabled)
> >>   /* report failure so swap also writes to swap device */
> >> --
> >> 1.7.0.4
> >>
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majord...@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: frontswap: invalidate expired data on a dup-store failure

2014-11-18 Thread Seth Jennings
On Tue, Nov 18, 2014 at 04:51:36PM +0800, Weijie Yang wrote:
> If a frontswap dup-store failed, it should invalidate the expired page
> in the backend, or it could trigger some data corruption issue.
> Such as:
> 1. use zswap as the frontswap backend with writeback feature
> 2. store a swap page(version_1) to entry A, success
> 3. dup-store a newer page(version_2) to the same entry A, fail
> 4. use __swap_writepage() write version_2 page to swapfile, success
> 5. zswap do shrink, writeback version_1 page to swapfile
> 6. version_2 page is overwrited by version_1, data corrupt.

Good catch!

> 
> This patch fixes this issue by invalidating expired data immediately
> when meet a dup-store failure.
> 
> Signed-off-by: Weijie Yang 
> ---
>  mm/frontswap.c |4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/frontswap.c b/mm/frontswap.c
> index c30eec5..f2a3571 100644
> --- a/mm/frontswap.c
> +++ b/mm/frontswap.c
> @@ -244,8 +244,10 @@ int __frontswap_store(struct page *page)
> the (older) page from frontswap
>*/
>   inc_frontswap_failed_stores();
> - if (dup)
> + if (dup) {
>   __frontswap_clear(sis, offset);
> + frontswap_ops->invalidate_page(type, offset);

Looking at __frontswap_invalidate_page(), should we do
inc_frontswap_invalidates() too?  If so, maybe we should just call
__frontswap_invalidate_page().

Thanks,
Seth

> + }
>   }
>   if (frontswap_writethrough_enabled)
>   /* report failure so swap also writes to swap device */
> -- 
> 1.7.0.4
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm/zswap: Deletion of an unnecessary check before the function call "free_percpu"

2014-11-18 Thread Seth Jennings
On Mon, Nov 17, 2014 at 06:40:18PM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 17 Nov 2014 18:33:33 +0100
> 
> The free_percpu() function tests whether its argument is NULL and then
> returns immediately. Thus the test around the call is not needed.
> 
> This issue was detected by using the Coccinelle software.

Thanks for the cleanup!

Acked-by: Seth Jennings 

> 
> Signed-off-by: Markus Elfring 
> ---
>  mm/zswap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ea064c1..35629f0 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -152,8 +152,7 @@ static int __init zswap_comp_init(void)
>  static void zswap_comp_exit(void)
>  {
>   /* free percpu transforms */
> - if (zswap_comp_pcpu_tfms)
> - free_percpu(zswap_comp_pcpu_tfms);
> + free_percpu(zswap_comp_pcpu_tfms);
>  }
>  
>  /*
> -- 
> 2.1.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Kernel Live Patching

2014-11-18 Thread Seth Jennings
On Tue, Nov 18, 2014 at 03:23:49PM +0100, Jiri Slaby wrote:
> On 11/17/2014, 03:54 PM, Seth Jennings wrote:
> > On Mon, Nov 17, 2014 at 02:33:02PM +0900, Masami Hiramatsu wrote:
> >> Hmm, btw, "LP" and "LPC" remind me line-printer and LPC bus :(
> >> Can we use LKP (Live Kernel Patching) or KLP (Kernel Live Patching) 
> >> instead ?
> > 
> > Jiri S also mentioned this so I guess it is a common sentiment :)  He
> > suggested "lip" but I think I like "klp" better?  Jiri S sound good?
> 
> Definitely, I like both lkp and klp.

Great.  I'll make the change.

Thanks,
Seth

> 
> -- 
> js
> suse labs
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-18 Thread Seth Jennings
On Tue, Nov 18, 2014 at 03:11:43PM +0100, Miroslav Benes wrote:
> 
> Hi,
> 
> thank you for the revision. I'll rebase our patches on top of that. Anyway 
> there is a small bug in a header file. See below.
> 
> On Sun, 16 Nov 2014, Seth Jennings wrote:
> 
> [...]
> 
> > diff --git a/arch/x86/include/asm/livepatch.h 
> b/arch/x86/include/asm/livepatch.h
> > new file mode 100644
> > index 000..c5fab45
> > --- /dev/null
> > +++ b/arch/x86/include/asm/livepatch.h
> > @@ -0,0 +1,38 @@
> > +/*
> > + * livepatch.h - x86-specific Live Kernel Patching Core
> > + *
> > + * Copyright (C) 2014 Seth Jennings 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _ASM_X86_LIVEPATCH_H
> > +#define _ASM_X86_LIVEPATCH_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_LIVE_PATCHING
> > +extern int lpc_write_module_reloc(struct module *mod, unsigned long 
> type,
> > +   unsigned long loc, unsigned long value);
> > +
> > +#else
> > +static int lpc_write_module_reloc(struct module *mod, unsigned long 
> type,
> > +   unsigned long loc, unsigned long value);
> > +{
> > + pr_err("Kernel does not support live patching\n");
> > + return -ENOSYS;
> > +}
> > +#endif
> > +
> > +#endif /* _ASM_X86_LIVEPATCH_H */
> 
> There should not be a semicolon at the end of the function header in #else 
> branch.

Good catch.

Also, I'm adding "build with LIVE_PATCHING=n" as a test step :-/

Thanks,
Seth

> 
> Regards,
> 
> ---
> Miroslav Benes
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 2/3] kernel: add support for live patching

2014-11-17 Thread Seth Jennings
On Mon, Nov 17, 2014 at 10:45:58AM -0800, Greg KH wrote:
> On Sun, Nov 16, 2014 at 07:29:23PM -0600, Seth Jennings wrote:
> > +#ifdef CONFIG_X86_32
> > +int lpc_write_module_reloc(struct module *mod, unsigned long type,
> > +  unsigned long loc, unsigned long value)
> > +{
> > +   pr_err("Live patching not supported on 32-bit x86\n");
> > +   return -ENOSYS;
> > +}
> 
> Why not just prevent the code from being built on x86_32 instead of
> putting this in the file?

Yep. Masami saw this too and recommended a ARCH_HAVE_LIVE_PATCHING flag
set by the archs that support it.  I'll make the change.

Thanks for the review!

Seth

> 
> thanks,
> 
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Kernel Live Patching

2014-11-17 Thread Seth Jennings
On Mon, Nov 17, 2014 at 02:33:02PM +0900, Masami Hiramatsu wrote:
> Hi Seth,
> 
> (2014/11/17 10:29), Seth Jennings wrote:
> > Changelog:
> > 
> > Thanks for all the feedback!
> > 
> > changes in v2:
> > - rebase to next-20141113
> > - add copyright/license block to livepatch.h
> > - add _LINUX prefix to header defines
> > - replace semaphore with mutex
> > - add LPC_ prefix to state enum
> > - convert BUGs to WARNs and handle properly
> > - change Kconfig default to n
> > - remove [old|new] attrs from function sysfs dir (KASLR leak, no use)
> > - disregard user provided old_addr if kernel uses KASLR
> > - s/out/err for error path labels
> > - s/unregister/disable for uniform terminology
> > - s/lp/lpc for module notifier elements
> 
> Hmm, btw, "LP" and "LPC" remind me line-printer and LPC bus :(
> Can we use LKP (Live Kernel Patching) or KLP (Kernel Live Patching) instead ?

Jiri S also mentioned this so I guess it is a common sentiment :)  He
suggested "lip" but I think I like "klp" better?  Jiri S sound good?

> 
> > - replace module ref'ing with unload notifier + mutex protection
> > - adjust notifier priority to run before ftrace
> > - make LIVE_PATCHING boolean (about to depend on arch stuff)
> 
> For better handling x86-32, we'd better introduce ARCH_HAVE_LIVE_PATCHING and
> avoid enabling LIVE_PATCHING on x86_32, then we can simplify 
> arch/x86/kernel/livepatch.c.

Will do.

Thanks for the review!

Seth

> 
> Thank you,
> 
> > - move x86-specific reloc code to arch/x86
> > - s/dynrela/reloc/
> > - add live patching sysfs documentation
> > - add API function kernel-doc
> > - TODO: kernel-doc for API structs once agreed upon
> > 
> > Summary:
> > 
> > This patchset implements an ftrace-based mechanism and kernel interface for
> > doing live patching of kernel and kernel module functions.  It represents 
> > the
> > greatest common functionality set between kpatch [1] and kGraft [2] and can
> > accept patches built using either method.  This solution was discussed in 
> > the
> > Live Patching Mini-conference at LPC 2014 [3].
> > 
> > The model consists of a live patching "core" that provides an interface for
> > other "patch" kernel modules to register patches with the core.
> > 
> > Patch modules contain the new function code and create an lp_patch structure
> > containing the required data about what functions to patch, where the new 
> > code
> > for each patched function resides, and in which kernel object (vmlinux or
> > module) the function to be patch resides.  The patch module then invokes the
> > lp_register_patch() function to register with the core, then 
> > lp_enable_patch()
> > to have the core redirect the execution paths using ftrace.
> > 
> > An example patch module can be found here:
> > https://github.com/spartacus06/livepatch/blob/master/patch/patch.c
> > 
> > The live patching core creates a sysfs hierarchy for user-level access to 
> > live
> > patching information.  The hierarchy is structured like this:
> > 
> > /sys/kernel/livepatch
> > /sys/kernel/livepatch/
> > /sys/kernel/livepatch//enabled
> > /sys/kernel/livepatch//
> > /sys/kernel/livepatch///
> > 
> > The old function is located using one of two methods: it is either provided 
> > by
> > the patch module (only possible for a function in vmlinux) or kallsyms 
> > lookup.
> > Symbol ambiguity results in a failure.
> > 
> > The core takes a reference on the patch module itself to keep it from
> > unloading.  This is because, without a mechanism to ensure that no thread is
> > currently executing in the patched function, we can not determine whether 
> > it is
> > safe to unload the patch module.  For this reason, unloading patch modules 
> > is
> > currently not allowed.
> > 
> > Disabling patches can be done using the "enabled" attribute of the patch:
> > 
> > echo 0 > /sys/kernel/livepatch//enabled
> > 
> > If a patch module contains a patch for a module that is not currently 
> > loaded,
> > there is nothing to patch so the core does nothing for that patch object.
> > However, the core registers a module notifier that looks for COMING events 
> > so
> > that if the module is ever loaded, it is immediately patched.  If a module 
> > with
> > patch code is removed, the notifier looks for GOING events and disables any
> > patched functions for that object before it unloads. 

[PATCHv2 1/3] kernel: add TAINT_LIVEPATCH

2014-11-16 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 75511ef..83ab256 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -843,6 +843,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5449d2f..d03e3de 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -471,6 +471,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index 4d8d6f9..8136ad7 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -226,6 +226,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -246,6 +247,7 @@ static const struct tnt tnts[] = {
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
  *  'L' - A soft lockup has previously occurred.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 3/3] kernel: add sysfs documentation for live patching

2014-11-16 Thread Seth Jennings
Adds sysfs interface documentation to Documentation/ABI/testing/

Signed-off-by: Seth Jennings 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch | 44 
 MAINTAINERS  |  1 +
 2 files changed, 45 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
new file mode 100644
index 000..5bf42a8
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -0,0 +1,44 @@
+What:  /sys/kernel/livepatch
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   Interface for kernel live patching
+
+   The /sys/kernel/livepatch directory contains subdirectories for
+   each loaded live patch module.
+
+What:  /sys/kernel/livepatch/
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The patch directory contains subdirectories for each kernel
+   object (vmlinux or a module) in which it patched functions.
+
+What:  /sys/kernel/livepatch//enabled
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   A writable attribute that indicates whether the patched
+   code is currently applied.  Writing 0 will disable the patch
+   while writing 1 will re-enable the patch.
+
+What:  /sys/kernel/livepatch//
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The object directory contains subdirectories for each function
+   that is patched within the object.
+
+What:  /sys/kernel/livepatch///
+Date:  Nov 2014
+KernelVersion: 3.19.0
+Contact:   live-patch...@vger.kernel.org
+Description:
+   The function directory contains attributes regarding the
+   properties and state of the patched function.
+
+   There are currently no such attributes.
diff --git a/MAINTAINERS b/MAINTAINERS
index c7f49ae..7985293 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5725,6 +5725,7 @@ F:kernel/livepatch/
 F: include/linux/livepatch.h
 F: arch/x86/include/asm/livepatch.h
 F: arch/x86/kernel/livepatch.c
+F: Documentation/ABI/testing/sysfs-kernel-livepatch
 L: live-patch...@vger.kernel.org
 
 LLC (802.2)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 0/3] Kernel Live Patching

2014-11-16 Thread Seth Jennings
Changelog:

Thanks for all the feedback!

changes in v2:
- rebase to next-20141113
- add copyright/license block to livepatch.h
- add _LINUX prefix to header defines
- replace semaphore with mutex
- add LPC_ prefix to state enum
- convert BUGs to WARNs and handle properly
- change Kconfig default to n
- remove [old|new] attrs from function sysfs dir (KASLR leak, no use)
- disregard user provided old_addr if kernel uses KASLR
- s/out/err for error path labels
- s/unregister/disable for uniform terminology
- s/lp/lpc for module notifier elements
- replace module ref'ing with unload notifier + mutex protection
- adjust notifier priority to run before ftrace
- make LIVE_PATCHING boolean (about to depend on arch stuff)
- move x86-specific reloc code to arch/x86
- s/dynrela/reloc/
- add live patching sysfs documentation
- add API function kernel-doc
- TODO: kernel-doc for API structs once agreed upon

Summary:

This patchset implements an ftrace-based mechanism and kernel interface for
doing live patching of kernel and kernel module functions.  It represents the
greatest common functionality set between kpatch [1] and kGraft [2] and can
accept patches built using either method.  This solution was discussed in the
Live Patching Mini-conference at LPC 2014 [3].

The model consists of a live patching "core" that provides an interface for
other "patch" kernel modules to register patches with the core.

Patch modules contain the new function code and create an lp_patch structure
containing the required data about what functions to patch, where the new code
for each patched function resides, and in which kernel object (vmlinux or
module) the function to be patch resides.  The patch module then invokes the
lp_register_patch() function to register with the core, then lp_enable_patch()
to have the core redirect the execution paths using ftrace.

An example patch module can be found here:
https://github.com/spartacus06/livepatch/blob/master/patch/patch.c

The live patching core creates a sysfs hierarchy for user-level access to live
patching information.  The hierarchy is structured like this:

/sys/kernel/livepatch
/sys/kernel/livepatch/
/sys/kernel/livepatch//enabled
/sys/kernel/livepatch//
/sys/kernel/livepatch///

The old function is located using one of two methods: it is either provided by
the patch module (only possible for a function in vmlinux) or kallsyms lookup.
Symbol ambiguity results in a failure.

The core takes a reference on the patch module itself to keep it from
unloading.  This is because, without a mechanism to ensure that no thread is
currently executing in the patched function, we can not determine whether it is
safe to unload the patch module.  For this reason, unloading patch modules is
currently not allowed.

Disabling patches can be done using the "enabled" attribute of the patch:

echo 0 > /sys/kernel/livepatch//enabled

If a patch module contains a patch for a module that is not currently loaded,
there is nothing to patch so the core does nothing for that patch object.
However, the core registers a module notifier that looks for COMING events so
that if the module is ever loaded, it is immediately patched.  If a module with
patch code is removed, the notifier looks for GOING events and disables any
patched functions for that object before it unloads.  The notifier has a higher
priority than that of the ftrace notifier so that it runs before the ftrace
notifier for GOING events and we can cleanly unregister from ftrace.

kpatch and kGraft each have their own mechanisms for ensuring system
consistency during the patching process. This first version does not implement
any consistency mechanism that ensures that old and new code do not run
together.  In practice, ~90% of CVEs are safe to apply in this way, since they
simply add a conditional check.  However, any function change that can not
execute safely with the old version of the function can _not_ be safely applied
for now.

[1] https://github.com/dynup/kpatch
[2] https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (3):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching
  kernel: add sysfs documentation for live patching

 Documentation/ABI/testing/sysfs-kernel-livepatch |  44 +
 Documentation/oops-tracing.txt   |   2 +
 Documentation/sysctl/kernel.txt  |   1 +
 MAINTAINERS  |  13 +
 arch/x86/Kconfig |   2 +
 arch/x86/include/asm/livepatch.h |  38 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  83 ++
 include/linux/kernel.h   |   1 +
 include/linux/livepatch.h|  68 ++
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig 

[PATCHv2 2/3] kernel: add support for live patching

2014-11-16 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS  |  12 +
 arch/x86/Kconfig |   2 +
 arch/x86/include/asm/livepatch.h |  38 ++
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/livepatch.c  |  83 
 include/linux/livepatch.h|  68 +++
 kernel/Makefile  |   1 +
 kernel/livepatch/Kconfig |   9 +
 kernel/livepatch/Makefile|   3 +
 kernel/livepatch/core.c  | 999 +++
 10 files changed, 1216 insertions(+)
 create mode 100644 arch/x86/include/asm/livepatch.h
 create mode 100644 arch/x86/kernel/livepatch.c
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4861577..c7f49ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5715,6 +5715,18 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+F: arch/x86/include/asm/livepatch.h
+F: arch/x86/kernel/livepatch.c
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ec21dfd..0cf27e8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1991,6 +1991,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h
new file mode 100644
index 000..c5fab45
--- /dev/null
+++ b/arch/x86/include/asm/livepatch.h
@@ -0,0 +1,38 @@
+/*
+ * livepatch.h - x86-specific Live Kernel Patching Core
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ASM_X86_LIVEPATCH_H
+#define _ASM_X86_LIVEPATCH_H
+
+#include 
+
+#ifdef CONFIG_LIVE_PATCHING
+extern int lpc_write_module_reloc(struct module *mod, unsigned long type,
+ unsigned long loc, unsigned long value);
+
+#else
+static int lpc_write_module_reloc(struct module *mod, unsigned long type,
+ unsigned long loc, unsigned long value);
+{
+   pr_err("Kernel does not support live patching\n");
+   return -ENOSYS;
+}
+#endif
+
+#endif /* _ASM_X86_LIVEPATCH_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5d4502c..316b34e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o
 obj-y  += apic/
 obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
+obj-$(CONFIG_LIVE_PATCHING)+= livepatch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
 obj-$(CONFIG_X86_TSC)  += trace_clock.o
diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c
new file mode 100644
index 000..56813ae
--- /dev/null
+++ b/arch/x86/kernel/livepatch.c
@@ -0,0 +1,83 @@
+/*
+ * livepatch.c - x86-specific Live Kernel Patching Core
+ *
+ * Copyright (C) 2014 Seth Jennings 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public 

Re: [PATCH] zsmalloc: correct fragile [kmap|kunmap]_atomic use

2014-11-14 Thread Seth Jennings
On Fri, Nov 14, 2014 at 10:11:01AM +0900, Minchan Kim wrote:
> The kunmap_atomic should use virtual address getting by kmap_atomic.
> However, some pieces of code in zsmalloc uses modified address,
> not the one got by kmap_atomic for kunmap_atomic.
> 
> It's okay for working because zsmalloc modifies the address
> inner PAGE_SIZE bounday so it works with current kmap_atomic's
> implementation. But it's still fragile with potential changing
> of kmap_atomic so let's correct it.

Seems like you could just use PAGE_MASK to get the base page address
from link like this:

---
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b3b57ef..d6ca05a 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -654,7 +654,7 @@ static void init_zspage(struct page *first_page, struct 
size_class *class)
 */
next_page = get_next_page(page);
link->next = obj_location_to_handle(next_page, 0);
-   kunmap_atomic(link);
+   kunmap_atomic((void *)((unsigned long)link & PAGE_MASK));
page = next_page;
off %= PAGE_SIZE;
}
@@ -1087,7 +1087,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
m_offset / sizeof(*link);
first_page->freelist = link->next;
memset(link, POISON_INUSE, sizeof(*link));
-   kunmap_atomic(link);
+   kunmap_atomic((void *)((unsigned long)link & PAGE_MASK));
 
first_page->inuse++;
/* Now move the zspage to another fullness group, if required */
@@ -1124,7 +1124,7 @@ void zs_free(struct zs_pool *pool, unsigned long obj)
link = (struct link_free *)((unsigned char *)kmap_atomic(f_page)
+ f_offset);
link->next = first_page->freelist;
-   kunmap_atomic(link);
+   kunmap_atomic((void *)((unsigned long)link & PAGE_MASK));
first_page->freelist = (void *)obj;
 
first_page->inuse--;
---

This seems cleaner, but, at the same time, it isn't obvious that we are
passing the same value to kunmap_atomic() that we got from
kmap_atomic().  Just a thought.

Either way:

Reviewed-by: Seth Jennings 

> 
> Signed-off-by: Minchan Kim 
> ---
>  mm/zsmalloc.c | 21 -
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b3b57ef85830..85e14f584048 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -629,6 +629,7 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>   struct page *next_page;
>   struct link_free *link;
>   unsigned int i = 1;
> + void *vaddr;
>  
>   /*
>* page->index stores offset of first object starting
> @@ -639,8 +640,8 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>   if (page != first_page)
>   page->index = off;
>  
> - link = (struct link_free *)kmap_atomic(page) +
> - off / sizeof(*link);
> + vaddr = kmap_atomic(page);
> + link = (struct link_free *)vaddr + off / sizeof(*link);
>  
>   while ((off += class->size) < PAGE_SIZE) {
>   link->next = obj_location_to_handle(page, i++);
> @@ -654,7 +655,7 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>*/
>   next_page = get_next_page(page);
>   link->next = obj_location_to_handle(next_page, 0);
> - kunmap_atomic(link);
> + kunmap_atomic(vaddr);
>   page = next_page;
>   off %= PAGE_SIZE;
>   }
> @@ -1055,6 +1056,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t 
> size)
>   unsigned long obj;
>   struct link_free *link;
>   struct size_class *class;
> + void *vaddr;
>  
>   struct page *first_page, *m_page;
>   unsigned long m_objidx, m_offset;
> @@ -1083,11 +1085,11 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t 
> size)
>   obj_handle_to_location(obj, &m_page, &m_objidx);
>   m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
>  
> - link = (struct link_free *)kmap_atomic(m_page) +
> - m_offset / sizeof(*link);
> + vaddr = kmap_atomic(m_page);
> + link = (struct link_free *)vaddr + m_offset / sizeof(*link);
>   first_page->freelist = link->next;
>   memset(link, POISON_INUSE, sizeof(*link));
> - kunmap_atomic(link);
> + kunmap_atomic(vaddr);
>  
>   first_page->inus

Re: [PATCH 2/2] kernel: add support for live patching

2014-11-13 Thread Seth Jennings
On Thu, Nov 13, 2014 at 11:16:00AM +0100, Miroslav Benes wrote:
> 
> Hi,
> 
> thank you for the first version of the united live patching core.
> 
> The patch below implements some of our review objections. Changes are 
> described in the commit log. It simplifies the hierarchy of data 
> structures, removes data duplication (lp_ and lpc_ structures) and 
> simplifies sysfs directory.
> 
> I did not try to repair other stuff (races, function names, function 
> prefix, api symmetry etc.). It should serve as a demonstration of our 
> point of view.
> 
> There are some problems with this. try_module_get and module_put may be 
> called several times for each kernel module where some function is 
> patched in. This should be fixed with module going notifier as suggested 
> by Petr. 
> 
> The modified core was tested with modified testing live patch originally 
> from Seth's github. It worked as expected.
> 
> Please take a look at these changes, so we can discuss them in more 
> detail.

Thanks Miroslav.

The functional changes are a little hard to break out from the
formatting changes like s/disable/unregister and s/lp_/lpc_/ or adding
LPC_ prefix to the enum, most (all?) of which I have included for v2.

A problem with getting rid of the object layer is that there are
operations we do that are object-level operations.  For example,
module lookup and deferred module patching.  Also, the dynamic
relocations need to be associated with an object, not a patch, as not
all relocations will be able to be applied at patch load time for
patches that apply to modules that aren't loaded.  I understand that you
can walk the patch-level dynrela table and skip dynrela entries that
don't match the target object, but why do that when you can cleanly
express the relationship with a data structure hierarchy?

One example is the call to is_object_loaded() (renamed and reworked in
v2 btw) per function rather than per object.  That is duplicate work and
information that could be more cleanly expressed through an object
layer.

I also understand that sysfs/kobject stuff adds code length.  However,
the new "funcs" attribute is procfs style, not sysfs style.  sysfs
attribute should convey _one_ value.

>From Documenation/filesystems/sysfs.txt:
==
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type. 

Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publicly humiliated and your code rewritten without notice. 
==

Also the function list would have object ambiguity.  If there was a
patched function my_func() in both vmlinux and a module, it would just
appear on the list twice. You can fix this by using the mod:func syntax
like kallsyms, but it isn't as clean as expressing it in a hierarchy.

As far as the unification of the API structures with the internal
structures I have two points.  First is that, IMHO, we should assume that
the structures coming from the user are const.  In kpatch, for example,
we pass through some structures that are not created in the code, but by
the patch generation tool and stored in an ELF section (read-only).
Additionally, I am really against exposing the internal fields.
Commenting them as "internal" is just messy and we have to change the .h
file every time when want to add a field for internal use.

It seems that the primary purpose of this patch is to reduce the lines
of code.  However, I think that the object layer of the data structure
cleanly expresses the object<->function relationship and makes code like
the deferred patching much more straightforward since you already have
the functions/dynrelas organized by object.  You don't have to do the
nasty "if (strcmp(func->obj_name, objname)) continue;" business over the
entire patch every time.

Be advised, I have also done away with the new_addr/old_addr attributes
for v2 and replaced the patched module ref'ing with a combination of a
GOING notifier with lpc_mutex for protection.

Thanks,
Seth

> 
> Best regards,
> --
> Miroslav Benes
> SUSE Labs
> 
> 
> 
> From f659a18a630de27b47d375119d793e28ee50da04 Mon Sep 17 00:00:00 2001
> From: Miroslav Benes 
> Date: Thu, 13 Nov 2014 10:25:48 +0100
> Subject: [PATCH] lpc: simplification of structure and sysfs hierarchy
> 
> Original code has several issues this patch tries to remove.
> 
> First, there is only lpc_func structure for patched function and lpc_patch for
> the patch as a whole. Therefore lpc_object structure as middle step of 
> hierarchy
> is removed. Patched function is still associated with some object (vmlinux or
> module) through obj_name. Dynrelas are now in lpc_patch structure and object
> identifier (obj_name) is in the lpc_dynrela to preserve the connection.
> 
> Second, sysfs structure is simplified. We d

[PATCH] zbud, zswap: change module author email

2014-11-12 Thread Seth Jennings
Old email no longer viable.

Signed-off-by: Seth Jennings 
---
 mm/zbud.c  | 2 +-
 mm/zswap.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zbud.c b/mm/zbud.c
index db8de74..4e387be 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -619,5 +619,5 @@ module_init(init_zbud);
 module_exit(exit_zbud);
 
 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Seth Jennings ");
+MODULE_AUTHOR("Seth Jennings ");
 MODULE_DESCRIPTION("Buddy Allocator for Compressed Pages");
diff --git a/mm/zswap.c b/mm/zswap.c
index ea064c1..c154306 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -951,5 +951,5 @@ error:
 late_initcall(init_zswap);
 
 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Seth Jennings ");
+MODULE_AUTHOR("Seth Jennings ");
 MODULE_DESCRIPTION("Compressed cache for swap pages");
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/zswap: add __init to some functions in zswap

2014-11-12 Thread Seth Jennings
On Sun, Nov 09, 2014 at 08:23:52PM +0800, Mahendran Ganesh wrote:
> zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only
> called by __init init_zswap()

Thanks for the cleanup!

Acked-by: Seth Jennings 

> 
> Signed-off-by: Mahendran Ganesh 
> ---
>  mm/zswap.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 51a2c45..2e621fa 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -149,7 +149,7 @@ static int __init zswap_comp_init(void)
>   return 0;
>  }
>  
> -static void zswap_comp_exit(void)
> +static void __init zswap_comp_exit(void)
>  {
>   /* free percpu transforms */
>   if (zswap_comp_pcpu_tfms)
> @@ -206,7 +206,7 @@ static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
>  **/
>  static struct kmem_cache *zswap_entry_cache;
>  
> -static int zswap_entry_cache_create(void)
> +static int __init zswap_entry_cache_create(void)
>  {
>   zswap_entry_cache = KMEM_CACHE(zswap_entry, 0);
>   return zswap_entry_cache == NULL;
> @@ -389,7 +389,7 @@ static struct notifier_block zswap_cpu_notifier_block = {
>   .notifier_call = zswap_cpu_notifier
>  };
>  
> -static int zswap_cpu_init(void)
> +static int __init zswap_cpu_init(void)
>  {
>   unsigned long cpu;
>  
> -- 
> 1.7.9.5
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/zswap: unregister zswap_cpu_notifier_block in cleanup procedure

2014-11-12 Thread Seth Jennings
On Sun, Nov 09, 2014 at 07:22:23PM +0800, Mahendran Ganesh wrote:
> In zswap_cpu_init(), the code does not unregister *zswap_cpu_notifier_block*
> during the cleanup procedure.

This is not needed.  If we are in the cleanup code, we never got to the
__register_cpu_notifier() call.

Thanks,
Seth

> 
> This patch fix this issue.
> 
> Signed-off-by: Mahendran Ganesh 
> ---
>  mm/zswap.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ea064c1..51a2c45 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -404,6 +404,7 @@ static int zswap_cpu_init(void)
>  cleanup:
>   for_each_online_cpu(cpu)
>   __zswap_cpu_notifier(CPU_UP_CANCELED, cpu);
> + __unregister_cpu_notifier(&zswap_cpu_notifier_block);
>   cpu_notifier_register_done();
>   return -ENOMEM;
>  }
> -- 
> 1.7.9.5
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: module notifier: was Re: [PATCH 2/2] kernel: add support for live patching

2014-11-11 Thread Seth Jennings
On Tue, Nov 11, 2014 at 11:17:39PM +0100, Jiri Kosina wrote:
> On Tue, 11 Nov 2014, Seth Jennings wrote:
> 
> > It will be in v2 (hopefully out in the next couple of days).
> 
> FWIW we are also working on a few patches on top of v1 to back some of the 
> proposals we've made during the first round of review, so maybe it might 
> make sense to wait with v2 a little bit more, so that it incorporates as 
> much v1 feedback as possible ... ?

What proposals in particular?  I've already made many of the changes
that we agreed upon.

Thanks,
Seth

> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: module notifier: was Re: [PATCH 2/2] kernel: add support for live patching

2014-11-11 Thread Seth Jennings
On Fri, Nov 07, 2014 at 07:40:11PM +0100, Petr Mladek wrote:
> On Fri 2014-11-07 12:07:11, Seth Jennings wrote:
> > On Fri, Nov 07, 2014 at 06:13:07PM +0100, Petr Mladek wrote:
> > > On Thu 2014-11-06 08:39:08, Seth Jennings wrote:
[...]
> > > > +   up(&lpc_mutex);
> > > > +   WARN("failed to apply patch '%s' to module '%s'\n",
> > > > +   patch->mod->name, mod->name);
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static struct notifier_block lp_module_nb = {
> > > > +   .notifier_call = lp_module_notify,
> > > > +   .priority = INT_MIN, /* called last */
> > > 
> > > The handler for MODULE_STATE_COMMING would need have higger priority,
> > > if we want to cleanly unregister the ftrace handlers.
> > 
> > Yes, we might need two handlers at different priorities if we decide to
> > go that direction: one for MODULE_STATE_GOING at high/max and one for
> > MODULE_STATE_COMING at low/min.
> 
> kGraft has notifier only for the going state. The initialization is
> called directly from load_module() after ftrace_module_init()
> and complete_formation() before it is executed by parse_args().
> 
> I need to investigate if the notifier is more elegant and safe or not.

I looked it up and having a COMING notifier with priority INT_MIN is
effectively the same as having a call between complete_formation() and
parse_args() since the notifiers are called as the last thing in
complete_formation().

I think I've found a clean way to avoid the ref taking on the patched
modules using only the notifier and lpc_mutex. It will be in v2
(hopefully out in the next couple of days).

Thanks,
Seth

> 
> Best Regards,
> Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] kernel: add TAINT_LIVEPATCH

2014-11-11 Thread Seth Jennings
On Sun, Nov 09, 2014 at 12:19:22PM -0800, Greg KH wrote:
> On Thu, Nov 06, 2014 at 08:39:07AM -0600, Seth Jennings wrote:
> > This adds a new taint flag to indicate when the kernel or a kernel
> > module has been live patched.  This will provide a clean indication in
> > bug reports that live patching was used.
> > 
> > Additionally, if the crash occurs in a live patched function, the live
> > patch module will appear beside the patched function in the backtrace.
> > 
> > Signed-off-by: Seth Jennings 
> > ---
> >  Documentation/oops-tracing.txt  | 2 ++
> >  Documentation/sysctl/kernel.txt | 1 +
> >  include/linux/kernel.h  | 1 +
> >  kernel/panic.c  | 2 ++
> >  4 files changed, 6 insertions(+)
> > 
> > diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
> > index beefb9f..f3ac05c 100644
> > --- a/Documentation/oops-tracing.txt
> > +++ b/Documentation/oops-tracing.txt
> > @@ -270,6 +270,8 @@ characters, each representing a particular tainted 
> > value.
> >  
> >   15: 'L' if a soft lockup has previously occurred on the system.
> >  
> > + 16: 'K' if the kernel has been live patched.
> > +
> >  The primary reason for the 'Tainted: ' string is to tell kernel
> >  debuggers if this is a clean kernel or if anything unusual has
> >  occurred.  Tainting is permanent: even if an offending module is
> > diff --git a/Documentation/sysctl/kernel.txt 
> > b/Documentation/sysctl/kernel.txt
> > index d7fc4ab..085f73b 100644
> > --- a/Documentation/sysctl/kernel.txt
> > +++ b/Documentation/sysctl/kernel.txt
> > @@ -831,6 +831,7 @@ can be ORed together:
> >  8192 - An unsigned module has been loaded in a kernel supporting module
> > signature.
> >  16384 - A soft lockup has previously occurred on the system.
> > +32768 - The kernel has been live patched.
> >  
> >  ==
> >  
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 446d76a..a6aa2df 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -473,6 +473,7 @@ extern enum system_states {
> >  #define TAINT_OOT_MODULE   12
> >  #define TAINT_UNSIGNED_MODULE  13
> >  #define TAINT_SOFTLOCKUP   14
> > +#define TAINT_LIVEPATCH15
> >  
> >  extern const char hex_asc[];
> >  #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
> 
> Note, this conflicts with a taint value that others are proposing for
> something else, so be aware you might run into problems when you hit
> linux-next.

Thanks for the notice.  I'll continue rebasing the patchset against the
latest -next and if the other proposal(s) gets in first, I'll change.

Seth

> 
> thanks,
> 
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-07 Thread Seth Jennings
On Fri, Nov 07, 2014 at 11:40:38AM -0800, Andy Lutomirski wrote:
> On 11/06/2014 06:39 AM, Seth Jennings wrote:
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> > 
> 
> [...]
> 
> > +/
> > + * Sysfs Interface
> > + ***/
> > +/*
> > + * /sys/kernel/livepatch
> > + * /sys/kernel/livepatch/
> > + * /sys/kernel/livepatch//enabled
> > + * /sys/kernel/livepatch//
> > + * /sys/kernel/livepatch///
> > + * /sys/kernel/livepatchnew_addr
> > + * /sys/kernel/livepatchold_addr
> > + */
> 
> Letting anyone read new_addr and old_addr is a kASLR leak, and I would
> argue that showing this information to non-root at all is probably a bad
> idea.

Also worth noting that this live patching implementation currently
doesn't support kASLR, as there is a method for the patch module to
supply the old_addr, determined at generation time by pulling from
vmlinux/System.map/etc, for a particular function to resolve symbol
ambiguity in a kallsyms lookup.  Obviously, this old_addr would be wrong
for a kernel using kASLR.

Thanks,
Seth

> 
> Can you make new_addr and old_addr have mode 0600 and
> /sys/kernel/livepatch itself have mode 0500?  For the latter, an admin
> who wants unprivileged users to be able to see it can easily chmod it.
> 
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-07 Thread Seth Jennings
On Fri, Nov 07, 2014 at 11:40:38AM -0800, Andy Lutomirski wrote:
> On 11/06/2014 06:39 AM, Seth Jennings wrote:
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> > 
> 
> [...]
> 
> > +/
> > + * Sysfs Interface
> > + ***/
> > +/*
> > + * /sys/kernel/livepatch
> > + * /sys/kernel/livepatch/
> > + * /sys/kernel/livepatch//enabled
> > + * /sys/kernel/livepatch//
> > + * /sys/kernel/livepatch///
> > + * /sys/kernel/livepatchnew_addr
> > + * /sys/kernel/livepatchold_addr
> > + */
> 
> Letting anyone read new_addr and old_addr is a kASLR leak, and I would
> argue that showing this information to non-root at all is probably a bad
> idea.
> 
> Can you make new_addr and old_addr have mode 0600 and
> /sys/kernel/livepatch itself have mode 0500?  For the latter, an admin
> who wants unprivileged users to be able to see it can easily chmod it.

Good call.  Will do.

Thanks,
Seth

> 
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe live-patching" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: module notifier: was Re: [PATCH 2/2] kernel: add support for live patching

2014-11-07 Thread Seth Jennings
On Fri, Nov 07, 2014 at 07:40:11PM +0100, Petr Mladek wrote:
> On Fri 2014-11-07 12:07:11, Seth Jennings wrote:
> > On Fri, Nov 07, 2014 at 06:13:07PM +0100, Petr Mladek wrote:
> > > On Thu 2014-11-06 08:39:08, Seth Jennings wrote:
> > > > This commit introduces code for the live patching core.  It implements
> > > > an ftrace-based mechanism and kernel interface for doing live patching
> > > > of kernel and kernel module functions.
> > > > 
> > > > It represents the greatest common functionality set between kpatch and
> > > > kgraft and can accept patches built using either method.
> > > > 
> > > > This first version does not implement any consistency mechanism that
> > > > ensures that old and new code do not run together.  In practice, ~90% of
> > > > CVEs are safe to apply in this way, since they simply add a conditional
> > > > check.  However, any function change that can not execute safely with
> > > > the old version of the function can _not_ be safely applied in this
> > > > version.
> > > 
> > > [...]
> > >  
> > > > +/**
> > > > + * module notifier
> > > > + */
> > > > +
> > > > +static int lp_module_notify(struct notifier_block *nb, unsigned long 
> > > > action,
> > > > +   void *data)
> > > > +{
> > > > +   struct module *mod = data;
> > > > +   struct lpc_patch *patch;
> > > > +   struct lpc_object *obj;
> > > > +   int ret = 0;
> > > > +
> > > > +   if (action != MODULE_STATE_COMING)
> > > > +   return 0;
> > > 
> > > IMHO, we should handle also MODULE_STATE_GOING. We should unregister
> > > the ftrace handlers and update the state of the affected objects
> > > (ENABLED -> DISABLED)
> > 
> > The mechanism we use to avoid this right now is taking a reference on
> > patched module.  We only release that reference after the patch is
> > disabled, which unregisters all the patched functions from ftrace.
> 
> I see. This was actually another thing that I noticed and wanted to
> investigate :-) I think that we should not force users to disable
> the entire patch if they want to remove some module.

I agree that would be better.

> 
> 
> > However, your comment reminded me of an idea I had to use
> > MODULE_STATE_GOING and let the lpc_mutex protect against races.  I think
> > it could be cleaner, but I haven't fleshed the idea out fully.
> 
> AFAIK, the going module is not longer used when the notifier is
> called. Therefore we could remove the patch fast way even when
> patching would require the slow path otherwise.

Yes (Josh just brought this to my attention) is that the notifiers are
call with GOING _after_ the module's exit function is called.

Thanks,
Seth

> 
> 
> > > 
> > > > +   down(&lpc_mutex);
> > > > +
> > > > +   list_for_each_entry(patch, &lpc_patches, list) {
> > > > +   if (patch->state == DISABLED)
> > > > +   continue;
> > > > +   list_for_each_entry(obj, &patch->objs, list) {
> > > > +   if (strcmp(obj->name, mod->name))
> > > > +   continue;
> > > > +   pr_notice("load of module '%s' detected, 
> > > > applying patch '%s'\n",
> > > > + mod->name, patch->mod->name);
> > > > +   obj->mod = mod;
> > > > +   ret = lpc_enable_object(patch->mod, obj);
> > > > +   if (ret)
> > > > +   goto out;
> > > > +   break;
> > > > +   }
> > > > +   }
> > > > +
> > > > +   up(&lpc_mutex);
> > > > +   return 0;
> > > > +out:
> > > 
> > > I would name this err_our or so to make it clear that it is used when
> > > something fails.
> > 
> > Just "err" good?
> 
> Fine with me.
>  
> > > > +   up(&lpc_mutex);
> > > > +   WARN("failed to apply patch '%s' to module '%s'\n",
> > > > +   patch->mod->name, mod->name);
>

Re: module notifier: was Re: [PATCH 2/2] kernel: add support for live patching

2014-11-07 Thread Seth Jennings
On Fri, Nov 07, 2014 at 06:13:07PM +0100, Petr Mladek wrote:
> On Thu 2014-11-06 08:39:08, Seth Jennings wrote:
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> 
> [...]
>  
> > +/**
> > + * module notifier
> > + */
> > +
> > +static int lp_module_notify(struct notifier_block *nb, unsigned long 
> > action,
> > +   void *data)
> > +{
> > +   struct module *mod = data;
> > +   struct lpc_patch *patch;
> > +   struct lpc_object *obj;
> > +   int ret = 0;
> > +
> > +   if (action != MODULE_STATE_COMING)
> > +   return 0;
> 
> IMHO, we should handle also MODULE_STATE_GOING. We should unregister
> the ftrace handlers and update the state of the affected objects
> (ENABLED -> DISABLED)

The mechanism we use to avoid this right now is taking a reference on
patched module.  We only release that reference after the patch is
disabled, which unregisters all the patched functions from ftrace.

However, your comment reminded me of an idea I had to use
MODULE_STATE_GOING and let the lpc_mutex protect against races.  I think
it could be cleaner, but I haven't fleshed the idea out fully.

> 
> > +   down(&lpc_mutex);
> > +
> > +   list_for_each_entry(patch, &lpc_patches, list) {
> > +   if (patch->state == DISABLED)
> > +   continue;
> > +   list_for_each_entry(obj, &patch->objs, list) {
> > +   if (strcmp(obj->name, mod->name))
> > +   continue;
> > +   pr_notice("load of module '%s' detected, applying patch 
> > '%s'\n",
> > + mod->name, patch->mod->name);
> > +   obj->mod = mod;
> > +   ret = lpc_enable_object(patch->mod, obj);
> > +   if (ret)
> > +   goto out;
> > +   break;
> > +   }
> > +   }
> > +
> > +   up(&lpc_mutex);
> > +   return 0;
> > +out:
> 
> I would name this err_our or so to make it clear that it is used when
> something fails.

Just "err" good?

> 
> > +   up(&lpc_mutex);
> > +   WARN("failed to apply patch '%s' to module '%s'\n",
> > +   patch->mod->name, mod->name);
> > +   return 0;
> > +}
> > +
> > +static struct notifier_block lp_module_nb = {
> > +   .notifier_call = lp_module_notify,
> > +   .priority = INT_MIN, /* called last */
> 
> The handler for MODULE_STATE_COMMING would need have higger priority,
> if we want to cleanly unregister the ftrace handlers.

Yes, we might need two handlers at different priorities if we decide to
go that direction: one for MODULE_STATE_GOING at high/max and one for
MODULE_STATE_COMING at low/min.

Thanks,
Seth

> 
> Best Regards,
> Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-07 Thread Seth Jennings
On Fri, Nov 07, 2014 at 02:13:37PM +0100, Jiri Kosina wrote:
> On Fri, 7 Nov 2014, Josh Poimboeuf wrote:
> 
> > > Also, lpc_create_object(), lpc_create_func(), lpc_create_patch(), 
> > > lpc_create_objects(), lpc_create_funcs(), ... they all are pretty much 
> > > alike, and are asking for some kind of unification ... perhaps iterator 
> > > for generic structure initialization?
> > 
> > The allocation and initialization code is very simple and
> > straightforward.  I really don't see a problem there.
> 
> This really boils down to the question I had in previous mail, whether 
> three-level hierarchy (patch->object->funcs), which is why there is a lot 
> of very alike initialization code, is not a bit over-designed.

It might right now, but we coded ourselves into a corner a couple of
times in kpatch using optimal, but inflexible data structures and
sharing those data structures with the API.  This structure layout will
give us flexibility to make changes without having to gut everything.  I
see flexibility and modularity being important going forward as we are
both looking to extend the abilities.

Additionally it allows the sysfs directories to correlate to data
structures and we can use the kobject ref count to cleanly do object
cleanup (i.e.  kobject_put() with release handlers for each ktype).

As Josh said, we do have operations that apply to each level.  I think
your point is that we could do away with the object level, but we have
operations that happen on a per-object basis. lpc_enable_object() isn't
just a for loop for registering the functions with ftrace.  It also does
the dynamic relocations.  I'm sure we will find other things in the
future.  It is also nice to have a function that can be called from both
lpc_enable_patch() and lp_module_notify() to enable the object in a
common way.

Thanks,
Seth

> 
> > > I am not also really fully convinced that we need the 
> > > patch->object->funcs abstraction hierarchy (which also contributes to 
> > > the structure allocation being rather a spaghetti copy/paste code) ... 
> > > wouldn't patch->funcs be suffcient, with the "object" being made just 
> > > a property of the function, for example?
> > > 
> > > > Plus, I show that kernel/kgraft.c + kernel/kgraft_files.c is
> > > > 906+193=1099.  I'd say they are about the same size :)
> > > 
> > > Which is still seem to me to be a ratio worth thinking about improving 
> > > :)
> > 
> > Yes, this code doesn't have a consistency model, but it does have some
> > other non-kGraft things like dynamic relocations, 
> 
> BTW we need to put those into arch/x86/ as they are unfortunately not 
> generic. But more on this later independently.
> 
> > deferred module patching,
> 
> FWIW kgraft supports that as well.
> 
> > and a unified API.  There's really no point in comparing lines of code.
> 
> Oh, sure, I didn't mean that this is any kind of metrics that should be 
> taken too seriously at all. I was just expressing my surprise that 
> unification of the API would bring so much code that it makes the result 
> comparably sized to "the whole thing" :)
> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Seth Jennings
On Thu, Nov 06, 2014 at 03:02:04PM -0500, Steven Rostedt wrote:
> On Thu,  6 Nov 2014 08:39:08 -0600
> Seth Jennings  wrote:
> 
> > --- /dev/null
> > +++ b/kernel/livepatch/Kconfig
> > @@ -0,0 +1,11 @@
> > +config LIVE_PATCHING
> > +   tristate "Live Kernel Patching"
> > +   depends on DYNAMIC_FTRACE_WITH_REGS && MODULES && SYSFS && KALLSYMS_ALL
> > +   default m
> 
> Nuke this default. This should be default 'n', which is what kconfig
> defaults to when none is mentioned.

Ok.

Thanks,
Seth

> 
> -- Steve
> 
> > +   help
> > + Say Y here if you want to support live kernel patching.
> > + This setting has no runtime impact until a live-patch
> > + kernel module that uses the live-patch interface provided
> > + by this option is loaded, resulting in calls to patched
> > + functions being redirected to the new function code contained
> > + in the live-patch module.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Seth Jennings
On Thu, Nov 06, 2014 at 04:51:02PM +0100, Jiri Slaby wrote:
> On 11/06/2014, 03:39 PM, Seth Jennings wrote:
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> 
> Hi,
> 
> nice! So we have something to start with. Brilliant!
> 
> I have some comments below now. Yet, it obviously needs deeper review
> which will take more time.
> 
> > --- /dev/null
> > +++ b/include/linux/livepatch.h
> > @@ -0,0 +1,45 @@
> > +#ifndef _LIVEPATCH_H_
> > +#define _LIVEPATCH_H_
> 
> This should follow the linux kernel naming: LINUX_LIVEPATCH_H

Didn't realize that was the convention.  Just to be sure, you meant
_LINUX_LIVEPATCH_H right (with the leading underscore)?

> 
> 
> > +#include 
> > +
> > +struct lp_func {
> 
> I am not much happy with "lp" which effectively means parallel printer
> support. What about lip?

Not sure how much clearer lip is.  It isn't for me :-/  I'm not opposed
to changing it.  I was just trying to keep the name short since it is
used many times.  Reducing the prefix from something like "livepatch_"
to "lp_" seemed to be the shortest and most straightforward way.

> 
> > +   const char *old_name; /* function to be patched */
> > +   void *new_func; /* replacement function in patch module */
> > +   /*
> > +* The old_addr field is optional and can be used to resolve
> > +* duplicate symbol names in the vmlinux object.  If this
> > +* information is not present, the symbol is located by name
> > +* with kallsyms. If the name is not unique and old_addr is
> > +* not provided, the patch application fails as there is no
> > +* way to resolve the ambiguity.
> > +*/
> > +   unsigned long old_addr;
> > +};
> >
> > +struct lp_dynrela {
> > +   unsigned long dest;
> > +   unsigned long src;
> > +   unsigned long type;
> > +   const char *name;
> > +   int addend;
> > +   int external;
> > +};
> > +
> > +struct lp_object {
> > +   const char *name; /* "vmlinux" or module name */
> > +   struct lp_func *funcs;
> > +   struct lp_dynrela *dynrelas;
> > +};
> > +
> > +struct lp_patch {
> > +   struct module *mod; /* module containing the patch */
> > +   struct lp_object *objs;
> > +};
> 
> Please document all the structures and all its members. And use
> kernel-doc format for that. (You can take an inspiration in kgraft.)

Sure.

> 
> > +int lp_register_patch(struct lp_patch *);
> > +int lp_unregister_patch(struct lp_patch *);
> > +int lp_enable_patch(struct lp_patch *);
> > +int lp_disable_patch(struct lp_patch *);
> > +
> > +#endif /* _LIVEPATCH_H_ */
> 
> ...
> 
> > --- /dev/null
> > +++ b/kernel/livepatch/Makefile
> > @@ -0,0 +1,3 @@
> > +obj-$(CONFIG_LIVE_PATCHING) += livepatch.o
> > +
> > +livepatch-objs := core.o
> > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> > new file mode 100644
> > index 000..b32dbb5
> > --- /dev/null
> > +++ b/kernel/livepatch/core.c
> > @@ -0,0 +1,1020 @@
> 
> ...
> 
> > +/*
> > + * Core structures
> > + /
> > +
> > +/*
> > + * lp_ structs vs lpc_ structs
> > + *
> > + * For each element (patch, object, func) in the live-patching code,
> > + * there are two types with two different prefixes: lp_ and lpc_.
> > + *
> > + * Structures used by the live-patch modules to register with this core 
> > module
> > + * are prefixed with lp_ (live patching).  These structures are part of the
> > + * registration API and are defined in livepatch.h.  The structures used
> > + * internally by this core module are prefixed with lpc_ (live patching 
> > core).
> > + */
> 
> I am not sure if the separation and the allocations/kobj handling are
> worth it. It makes the code really less understandable. Can we have just
> struct lip_function (don't unnecessarily abbreviate), lip_objectfile
> (object is too generic, like Java object) and lip_patch containing all
> the needed information? It would clean up the code a lot. (Yes, we would
> have profited from c++ here.)

I looked at doing this and this is actually what we did in kpatch.  We
made one structure that had "private" members that the user wasn't
suppose to access that were only used in the core.  This was messy
though.  Every time you wanted

Re: [PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Seth Jennings
On Thu, Nov 06, 2014 at 04:11:37PM +0100, Jiri Kosina wrote:
> On Thu, 6 Nov 2014, Seth Jennings wrote:
> 
> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> 
> Thanks a lot for having started the work on this!
> 
> We will be reviewing it carefully in the coming days and will getting back 
> to you (I was surprised to see that that diffstat indicates that it's 
> actually more code than our whole kgraft implementation including the 
> consistency model :) ).

The structure allocation and sysfs stuff is a lot of (mundane) code.
Lots of boring error path handling too.

Plus, I show that kernel/kgraft.c + kernel/kgraft_files.c is
906+193=1099.  I'd say they are about the same size :)

> 
> I have one questions right away though.
> 
> > +/
> > + * dynamic relocations (load-time linker)
> > + /
> > +
> > +/*
> > + * external symbols are located outside the parent object (where the parent
> > + * object is either vmlinux or the kmod being patched).
> > + */
> 
> I have no ideas what dynrela is, and quickly reading the source doesn't 
> really help too much.
> 
> Could you please provide some explanation / pointer to some documentation, 
> explaining what exactly it is, and why should it be part of the common 
> infrastructure?

Yes, I should explain it.

This is something that is currently only used in the kpatch approach.
It allows the patching core to do dynamic relocations on the new
function code, similar to what the kernel module linker does, but this
works for non-exported symbols as well.

This is so the patch module doesn't have to do a kallsyms lookup on
every non-exported symbol that the new functions use.

The fields of the dynrela structure are those of a normal ELF rela
entry, except for the "external" field, which conveys information about
where the core module should go looking for the symbol referenced in the
dynrela entry.

Josh was under the impression that Vojtech was ok with putting the
dynrela stuff in the core.  Is that not correct (misunderstanding)?

Thanks,
Seth

> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Seth Jennings
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

Signed-off-by: Seth Jennings 
---
 MAINTAINERS   |   10 +
 arch/x86/Kconfig  |2 +
 include/linux/livepatch.h |   45 ++
 kernel/Makefile   |1 +
 kernel/livepatch/Kconfig  |   11 +
 kernel/livepatch/Makefile |3 +
 kernel/livepatch/core.c   | 1020 +
 7 files changed, 1092 insertions(+)
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f98019e..02d1af7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5671,6 +5671,16 @@ F:   Documentation/misc-devices/lis3lv02d
 F: drivers/misc/lis3lv02d/
 F: drivers/platform/x86/hp_accel.c
 
+LIVE PATCHING
+M: Josh Poimboeuf 
+M: Seth Jennings 
+M: Jiri Kosina 
+M: Vojtech Pavlik 
+S: Maintained
+F: kernel/livepatch/
+F: include/linux/livepatch.h
+L: live-patch...@vger.kernel.org
+
 LLC (802.2)
 M: Arnaldo Carvalho de Melo 
 S: Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9cd2578..fb0bb59 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1982,6 +1982,8 @@ config CMDLINE_OVERRIDE
  This is used to work around broken boot loaders.  This should
  be set to 'N' under normal conditions.
 
+source "kernel/livepatch/Kconfig"
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
new file mode 100644
index 000..c7a415b
--- /dev/null
+++ b/include/linux/livepatch.h
@@ -0,0 +1,45 @@
+#ifndef _LIVEPATCH_H_
+#define _LIVEPATCH_H_
+
+#include 
+
+struct lp_func {
+   const char *old_name; /* function to be patched */
+   void *new_func; /* replacement function in patch module */
+   /*
+* The old_addr field is optional and can be used to resolve
+* duplicate symbol names in the vmlinux object.  If this
+* information is not present, the symbol is located by name
+* with kallsyms. If the name is not unique and old_addr is
+* not provided, the patch application fails as there is no
+* way to resolve the ambiguity.
+*/
+   unsigned long old_addr;
+};
+
+struct lp_dynrela {
+   unsigned long dest;
+   unsigned long src;
+   unsigned long type;
+   const char *name;
+   int addend;
+   int external;
+};
+
+struct lp_object {
+   const char *name; /* "vmlinux" or module name */
+   struct lp_func *funcs;
+   struct lp_dynrela *dynrelas;
+};
+
+struct lp_patch {
+   struct module *mod; /* module containing the patch */
+   struct lp_object *objs;
+};
+
+int lp_register_patch(struct lp_patch *);
+int lp_unregister_patch(struct lp_patch *);
+int lp_enable_patch(struct lp_patch *);
+int lp_disable_patch(struct lp_patch *);
+
+#endif /* _LIVEPATCH_H_ */
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..616994f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -26,6 +26,7 @@ obj-y += power/
 obj-y += printk/
 obj-y += irq/
 obj-y += rcu/
+obj-y += livepatch/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig
new file mode 100644
index 000..312ed81
--- /dev/null
+++ b/kernel/livepatch/Kconfig
@@ -0,0 +1,11 @@
+config LIVE_PATCHING
+   tristate "Live Kernel Patching"
+   depends on DYNAMIC_FTRACE_WITH_REGS && MODULES && SYSFS && KALLSYMS_ALL
+   default m
+   help
+ Say Y here if you want to support live kernel patching.
+ This setting has no runtime impact until a live-patch
+ kernel module that uses the live-patch interface provided
+ by this option is loaded, resulting in calls to patched
+ functions being redirected to the new function code contained
+ in the live-patch module.
diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile
new file mode 100644
index 000..7c1f008
--- /dev/null
+++ b/kernel/livepatch/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_LIVE_PATCHING) += livepatch.o
+
+livepatch-objs := core.o
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/

[PATCH 0/2] Kernel Live Patching

2014-11-06 Thread Seth Jennings
This patchset implements an ftrace-based mechanism and kernel interface for
doing live patching of kernel and kernel module functions.  It represents the
greatest common functionality set between kpatch [1] and kGraft [2] and can
accept patches built using either method.  This solution was discussed in the
Live Patching Mini-conference at LPC 2014 [3].

The model consists of a live patching "core" that provides an interface for
other "patch" kernel modules to register patches with the core.

Patch modules contain the new function code and create an lp_patch
structure containing the required data about what functions to patch, where the
new code for each patched function resides, and in which kernel object (vmlinux
or module) the function to be patch resides.  The patch module then invokes the
lp_register_patch() function to register with the core module, then
lp_enable_patch() to have to core module redirect the execution paths using
ftrace.

An example patch module can be found here:
https://github.com/spartacus06/livepatch/blob/master/patch/patch.c

The live patching core creates a sysfs hierarchy for user-level access to live
patching information.  The hierarchy is structured like this:

/sys/kernel/livepatch
/sys/kernel/livepatch/
/sys/kernel/livepatch//enabled
/sys/kernel/livepatch//
/sys/kernel/livepatch///
/sys/kernel/livepatchnew_addr
/sys/kernel/livepatchold_addr

The new_addr attribute provides the location of the new version of the function
within the patch module.  The old_addr attribute provides the location of the
old function.  The old function is located using one of two methods: it is
either provided by the patch module (only possible for a function in vmlinux)
or kallsyms lookup.  Symbol ambiguity results in a failure.

The core holds a reference on any kernel module that is patched to ensure it
does not unload while we are redirecting calls from it.  Also, the core takes a
reference on the patch module itself to keep it from unloading.  This is
because, without a mechanism to ensure that no thread is currently executing in
the patched function, we can not determine whether it is safe to unload the
patch module.  For this reason, unloading patch modules is currently not
allowed.

The core is able to release its reference on patched modules by disabling all
patches that patch a function in that module.  Disabling patches can be done
like this:

echo 0 > /sys/kernel/livepatch//enabled

Patches can also be re-enabled, however, the core with retake any reference on a
kernel module that contains a patched function.

If a patch module contains a patch for a module that is not currently loaded,
there is nothing to patch so the core does nothing for that object.  However,
the core registers a module notifier so that if the module is ever loaded, it
is immediately patched.

kpatch and kGraft each have their own mechanisms for ensuring system
consistency during the patching process. This first version does not implement
any consistency mechanism that ensures that old and new code do not run
together.  In practice, ~90% of CVEs are safe to apply in this way, since they
simply add a conditional check.  However, any function change that can not
execute safely with the old version of the function can _not_ be safely applied
for now.

[1] https://github.com/dynup/kpatch
[2] https://git.kernel.org/cgit/linux/kernel/git/jirislaby/kgraft.git/
[3] https://etherpad.fr/p/LPC2014_LivePatching

Seth Jennings (2):
  kernel: add TAINT_LIVEPATCH
  kernel: add support for live patching

 Documentation/oops-tracing.txt  |2 +
 Documentation/sysctl/kernel.txt |1 +
 MAINTAINERS |   10 +
 arch/x86/Kconfig|2 +
 include/linux/kernel.h  |1 +
 include/linux/livepatch.h   |   45 ++
 kernel/Makefile |1 +
 kernel/livepatch/Kconfig|   11 +
 kernel/livepatch/Makefile   |3 +
 kernel/livepatch/core.c | 1020 +++
 kernel/panic.c  |2 +
 11 files changed, 1098 insertions(+)
 create mode 100644 include/linux/livepatch.h
 create mode 100644 kernel/livepatch/Kconfig
 create mode 100644 kernel/livepatch/Makefile
 create mode 100644 kernel/livepatch/core.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] kernel: add TAINT_LIVEPATCH

2014-11-06 Thread Seth Jennings
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings 
---
 Documentation/oops-tracing.txt  | 2 ++
 Documentation/sysctl/kernel.txt | 1 +
 include/linux/kernel.h  | 1 +
 kernel/panic.c  | 2 ++
 4 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index beefb9f..f3ac05c 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -270,6 +270,8 @@ characters, each representing a particular tainted value.
 
  15: 'L' if a soft lockup has previously occurred on the system.
 
+ 16: 'K' if the kernel has been live patched.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index d7fc4ab..085f73b 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -831,6 +831,7 @@ can be ORed together:
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
 16384 - A soft lockup has previously occurred on the system.
+32768 - The kernel has been live patched.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 446d76a..a6aa2df 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -473,6 +473,7 @@ extern enum system_states {
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
+#define TAINT_LIVEPATCH15
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..46bca3d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -225,6 +225,7 @@ static const struct tnt tnts[] = {
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
{ TAINT_SOFTLOCKUP, 'L', ' ' },
+   { TAINT_LIVEPATCH,  'K', ' ' },
 };
 
 /**
@@ -244,6 +245,7 @@ static const struct tnt tnts[] = {
  *  'I' - Working around severe firmware bug.
  *  'O' - Out-of-tree module has been loaded.
  *  'E' - Unsigned module has been loaded.
+ *  'K' - Kernel has been live patched.
  *
  * The string is overwritten by the next call to print_tainted().
  */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/9] mm/zbud: support highmem pages

2014-11-04 Thread Seth Jennings
On Tue, Oct 14, 2014 at 08:59:19PM +0900, Heesub Shin wrote:
> zbud is a memory allocator for storing compressed data pages. It keeps
> two data objects of arbitrary size on a single page. This simple design
> provides very deterministic behavior on reclamation, which is one of
> reasons why zswap selected zbud as a default allocator over zsmalloc.
> 
> Unlike zsmalloc, however, zbud does not support highmem. This is
> problomatic especially on 32-bit machines having relatively small
> lowmem. Compressing anonymous pages from highmem and storing them into
> lowmem could eat up lowmem spaces.
> 
> This limitation is due to the fact that zbud manages its internal data
> structures on zbud_header which is kept in the head of zbud_page. For
> example, zbud_pages are tracked by several lists and have some status
> information, which are being referenced at any time by the kernel. Thus,
> zbud_pages should be allocated on a memory region directly mapped,
> lowmem.
> 
> After some digging out, I found that internal data structures of zbud
> can be kept in the struct page, the same way as zsmalloc does. So, this
> series moves out all fields in zbud_header to struct page. Though it
> alters quite a lot, it does not add any functional differences except
> highmem support. I am afraid that this kind of modification abusing
> several fields in struct page would be ok.

Hi Heesub,

Sorry for the very late reply.  The end of October was very busy for me.

A little history on zbud.  I didn't put the metadata in the struct
page, even though I knew that was an option since we had done it with
zsmalloc. At the time, Andrew Morton had concerns about memmap walkers
getting messed up with unexpected values in the struct page fields.  In
order to smooth zbud's acceptance, I decided to store the metadata
inline in the page itself.

Later, zsmalloc eventually got accepted, which basically gave the
impression that putting the metadata in the struct page was acceptable.

I have recently been looking at implementing compaction for zsmalloc,
but having the metadata in the struct page and having the handle
directly encode the PFN and offset of the data block prevents
transparent relocation of the data. zbud has a similar issue as it
currently encodes the page address in the handle returned to the user
(also the limitation that is preventing use of highmem pages).

I would like to implement compaction for zbud too and moving the
metadata into the struct page is going to work against that. In fact,
I'm looking at the option of converting the current zbud_header into a
per-allocation metadata structure, which would provide a layer of
indirection between zbud and the user, allowing for transparent
relocation and compaction.

However, I do like the part about letting zbud use highmem pages.

I have something in mind that would allow highmem pages _and_ move
toward something that would support compaction.  I'll see if I can put
it into code today.

Thanks,
Seth

> 
> Heesub Shin (9):
>   mm/zbud: tidy up a bit
>   mm/zbud: remove buddied list from zbud_pool
>   mm/zbud: remove lru from zbud_header
>   mm/zbud: remove first|last_chunks from zbud_header
>   mm/zbud: encode zbud handle using struct page
>   mm/zbud: remove list_head for buddied list from zbud_header
>   mm/zbud: drop zbud_header
>   mm/zbud: allow clients to use highmem pages
>   mm/zswap: use highmem pages for compressed pool
> 
>  mm/zbud.c  | 244 
> ++---
>  mm/zswap.c |   4 +-
>  2 files changed, 121 insertions(+), 127 deletions(-)
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/9] mm/zbud: support highmem pages

2014-10-24 Thread Seth Jennings
On Thu, Oct 23, 2014 at 07:14:15PM -0400, Dan Streetman wrote:
> On Tue, Oct 14, 2014 at 7:59 AM, Heesub Shin  wrote:
> > zbud is a memory allocator for storing compressed data pages. It keeps
> > two data objects of arbitrary size on a single page. This simple design
> > provides very deterministic behavior on reclamation, which is one of
> > reasons why zswap selected zbud as a default allocator over zsmalloc.
> >
> > Unlike zsmalloc, however, zbud does not support highmem. This is
> > problomatic especially on 32-bit machines having relatively small
> > lowmem. Compressing anonymous pages from highmem and storing them into
> > lowmem could eat up lowmem spaces.
> >
> > This limitation is due to the fact that zbud manages its internal data
> > structures on zbud_header which is kept in the head of zbud_page. For
> > example, zbud_pages are tracked by several lists and have some status
> > information, which are being referenced at any time by the kernel. Thus,
> > zbud_pages should be allocated on a memory region directly mapped,
> > lowmem.
> >
> > After some digging out, I found that internal data structures of zbud
> > can be kept in the struct page, the same way as zsmalloc does. So, this
> > series moves out all fields in zbud_header to struct page. Though it
> > alters quite a lot, it does not add any functional differences except
> > highmem support. I am afraid that this kind of modification abusing
> > several fields in struct page would be ok.
> 
> Seth, have you had a chance to review this yet?  I'm going to try to
> take a look at it next week if you haven't yet.  Letting zbud use
> highmem would be a good thing.

I have looked at it, and it looks sound to me.  I seem to remember
having a comment on something, but I'll have to look back over
it. Haven't tested it yet.

Seth

> 
> 
> >
> > Heesub Shin (9):
> >   mm/zbud: tidy up a bit
> >   mm/zbud: remove buddied list from zbud_pool
> >   mm/zbud: remove lru from zbud_header
> >   mm/zbud: remove first|last_chunks from zbud_header
> >   mm/zbud: encode zbud handle using struct page
> >   mm/zbud: remove list_head for buddied list from zbud_header
> >   mm/zbud: drop zbud_header
> >   mm/zbud: allow clients to use highmem pages
> >   mm/zswap: use highmem pages for compressed pool
> >
> >  mm/zbud.c  | 244 
> > ++---
> >  mm/zswap.c |   4 +-
> >  2 files changed, 121 insertions(+), 127 deletions(-)
> >
> > --
> > 1.9.1
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kprobes: add kprobe_is_function_probed()

2014-10-22 Thread Seth Jennings
On Tue, Oct 21, 2014 at 09:40:32PM -0500, Josh Poimboeuf wrote:
> On Tue, Oct 21, 2014 at 11:25:56PM +0200, Jiri Kosina wrote:
> > On Tue, 21 Oct 2014, Josh Poimboeuf wrote:
> > > 
> > > I'm guessing kGraft doesn't have the address + length?  I think you
> > > could call kallsyms_lookup() to get both values.
> > > 
> > > Maybe we should see what our unified live patching code ends up looking
> > > like before deciding what interface(s) we need here?
> > 
> > Yes, that probably makes sense indeed. I am talking to David Miller wrt. 
> > mailinglist creation on vger.kernel.org as we speak, hopefully it'll 
> > materialize soon.
> 
> Ok, thanks!  Seth is currently slaving away on the code :-)

Yes, I am :)  Let me know if this impacts the information we need to
pass via the *_register() call to the core module.

Currently, I pass the old function name (char *), new function pointer
(void *), and the old_addr (optional, unsigned long).  The old_addr
serves to identify the old function by address instead of name if that
information is provided.

Thanks,
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator

2014-10-07 Thread Seth Jennings
On Tue, Oct 07, 2014 at 04:42:33PM +0900, Joonsoo Kim wrote:
> Hello, Seth.
> Sorry for late response. :)
> 
> 2014-09-30 4:53 GMT+09:00 Seth Jennings :
> > On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> >> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> >> If you know useful place to use this allocator, please let me know.
> >>
> >> This is brand-new allocator, called anti-fragmentation memory allocator
> >> (aka afmalloc), in order to deal with arbitrary sized object allocation
> >> efficiently. zram and zswap uses arbitrary sized object to store
> >> compressed data so they can use this allocator. If there are any other
> >> use cases, they can use it, too.
> >>
> >> This work is motivated by observation of fragmentation on zsmalloc which
> >> intended for storing arbitrary sized object with low fragmentation.
> >> Although it works well on allocation-intensive workload, memory could be
> >> highly fragmented after many free occurs. In some cases, unused memory due
> >> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> >> problem is that other subsystem cannot use these unused memory. These
> >> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> >> use it until zspage is freed to page allocator.
> >
> > Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> > I and others are looking at putting compaction logic into zsmalloc to
> > help with this.
> >
> >>
> >> I guess that there are similar fragmentation problem in zbud, but, I
> >> didn't deeply investigate it.
> >>
> >> This new allocator uses SLAB allocator to solve above problems. When
> >> request comes, it returns handle that is pointer of metatdata to point
> >> many small chunks. These small chunks are in power of 2 size and
> >> build up whole requested memory. We can easily acquire these chunks
> >> using SLAB allocator. Following is conceptual represetation of metadata
> >> used in this allocator to help understanding of this allocator.
> >>
> >> Handle A for 400 bytes
> >> {
> >>   Pointer for 256 bytes chunk
> >>   Pointer for 128 bytes chunk
> >>   Pointer for 16 bytes chunk
> >>
> >>   (256 + 128 + 16 = 400)
> >> }
> >>
> >> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> >> allocator specific store/load functions are needed. These require some
> >> computation overhead and I guess that this is the only drawback this
> >> allocator has.
> >
> > One problem with using the SLAB allocator is that kmalloc caches greater
> > than 256 bytes, at least on my x86_64 machine, have slabs that require
> > high order page allocations, which are going to be really hard to come
> > by in the memory stressed environment in which zswap/zram are expected
> > to operate.  I guess you could max out at 256 byte chunks to overcome
> > this.  However, if you have a 3k object, that would require copying 12
> > chunks from potentially 12 different pages into a contiguous area at
> > mapping time and a larger metadata size.
> 
> SLUB uses high order allocation by default, but, it has fallback method. It
> uses low order allocation if failed with high order allocation. So, we don't
> need to worry about high order allocation.

Didn't know about the fallback method :)

> 
> >>
> >> For optimization, it uses another approach for power of 2 sized request.
> >> Instead of returning handle for metadata, it adds tag on pointer from
> >> SLAB allocator and directly returns this value as handle. With this tag,
> >> afmalloc can recognize whether handle is for metadata or not and do proper
> >> processing on it. This optimization can save some memory.
> >>
> >> Although afmalloc use some memory for metadata, overall utilization of
> >> memory is really good due to zero internal fragmentation by using power
> >
> > Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> > fragmentation per object right?  If so, "near zero".
> >
> >> of 2 sized object. Although zsmalloc has many size class, there is
> >> considerable internal fragmentation in zsmalloc.
> >
> > Lets put a number on it. Internal fragmentation on objects with size >
> > ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> > PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SI

Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator

2014-09-29 Thread Seth Jennings
On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> If you know useful place to use this allocator, please let me know.
> 
> This is brand-new allocator, called anti-fragmentation memory allocator
> (aka afmalloc), in order to deal with arbitrary sized object allocation
> efficiently. zram and zswap uses arbitrary sized object to store
> compressed data so they can use this allocator. If there are any other
> use cases, they can use it, too.
> 
> This work is motivated by observation of fragmentation on zsmalloc which
> intended for storing arbitrary sized object with low fragmentation.
> Although it works well on allocation-intensive workload, memory could be
> highly fragmented after many free occurs. In some cases, unused memory due
> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> problem is that other subsystem cannot use these unused memory. These
> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> use it until zspage is freed to page allocator.

Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
I and others are looking at putting compaction logic into zsmalloc to
help with this.

> 
> I guess that there are similar fragmentation problem in zbud, but, I
> didn't deeply investigate it.
> 
> This new allocator uses SLAB allocator to solve above problems. When
> request comes, it returns handle that is pointer of metatdata to point
> many small chunks. These small chunks are in power of 2 size and
> build up whole requested memory. We can easily acquire these chunks
> using SLAB allocator. Following is conceptual represetation of metadata
> used in this allocator to help understanding of this allocator.
> 
> Handle A for 400 bytes
> {
>   Pointer for 256 bytes chunk
>   Pointer for 128 bytes chunk
>   Pointer for 16 bytes chunk
> 
>   (256 + 128 + 16 = 400)
> }
> 
> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> allocator specific store/load functions are needed. These require some
> computation overhead and I guess that this is the only drawback this
> allocator has.

One problem with using the SLAB allocator is that kmalloc caches greater
than 256 bytes, at least on my x86_64 machine, have slabs that require
high order page allocations, which are going to be really hard to come
by in the memory stressed environment in which zswap/zram are expected
to operate.  I guess you could max out at 256 byte chunks to overcome
this.  However, if you have a 3k object, that would require copying 12
chunks from potentially 12 different pages into a contiguous area at
mapping time and a larger metadata size.

> 
> For optimization, it uses another approach for power of 2 sized request.
> Instead of returning handle for metadata, it adds tag on pointer from
> SLAB allocator and directly returns this value as handle. With this tag,
> afmalloc can recognize whether handle is for metadata or not and do proper
> processing on it. This optimization can save some memory.
> 
> Although afmalloc use some memory for metadata, overall utilization of
> memory is really good due to zero internal fragmentation by using power

Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
fragmentation per object right?  If so, "near zero".

> of 2 sized object. Although zsmalloc has many size class, there is
> considerable internal fragmentation in zsmalloc.

Lets put a number on it. Internal fragmentation on objects with size >
ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
64-bit system with 4k pages.  (Note: I don't think that is it possible to
compress a 4k page to less than 32 bytes, so for zswap, there will be no
allocations in this size range).

So we are looking at up to 7 vs 15 bytes of internal fragmentation per
object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
take into account the per-object metadata overhead of afmalloc, I think
zsmalloc comes out ahead here.

> 
> In workload that needs many free, memory could be fragmented like
> zsmalloc, but, there is big difference. These unused portion of memory
> are SLAB specific memory so that other subsystem can use it. Therefore,
> fragmented memory could not be a big problem in this allocator.

While freeing chunks back to the slab allocator does make that memory
available to other _kernel_ users, the fragmentation problem is just
moved one level down.  The fragmentation will exist in the slabs and
those fragmented slabs won't be freed to the page allocator, which would
make them available to _any_ user, not just the kernel.  Additionally,
there is little visibility into how chunks are organized in the slab,
making compaction at the afmalloc level nearly impossible.  (The only
visibility being the add

Re: [PATCH] sb_edac: avoid INTERNAL ERROR message in EDAC with unspecified channel

2014-09-22 Thread Seth Jennings
On Mon, Sep 08, 2014 at 10:18:51AM -0400, Aristeu Rozanski wrote:
> On Fri, Sep 05, 2014 at 02:28:47PM -0500, Seth Jennings wrote:
> > Intel IA32 SDM Table 15-14 defines channel 0xf as 'not specified', but
> > EDAC doesn't know about this and returns and INTERNAL ERROR when the
> > channel is greater than NUM_CHANNELS:
> > 
> > kernel: [ 1538.886456] CPU 0: Machine Check Exception: 0 Bank 1: 
> > 949f
> > kernel: [ 1538.886669] TSC 2bc68b22e7e812 ADDR 46dae7000 MISC 0 PROCESSOR 
> > 0:306e4 TIME 1390414572 SOCKET 0 APIC 0
> > kernel: [ 1538.971948] EDAC MC1: INTERNAL ERROR: channel value is out of 
> > range (15 >= 4)
> > kernel: [ 1538.972203] EDAC MC1: 0 CE memory read error on unknown memory 
> > (slot:0 page:0x46dae7 offset:0x0 grain:0 syndrome:0x0 -  area:DRAM 
> > err_code::009f socket:1 channel_mask:1 rank:0)
> > 
> > This commit changes sb_edac to forward a channel of -1 to EDAC if the
> > channel is not specified.  edac_mc_handle_error() sets the channel to -1
> > internally after the error message anyway, so this commit should have no
> > effect other than avoiding the INTERNAL ERROR message when the channel
> > is not specified.

Hey Mauro,

I was wanting to make sure this was on your radar.  I don't see it in
-next and haven't gotten any feedback or confirmation that the patch has
been accepted/rejected/needs work.

Seth

> > 
> > Signed-off-by: Seth Jennings 
> > Cc: Aristeu Rozanski 
> > Cc: linux-e...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  drivers/edac/sb_edac.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
> > index 0034c48..07efed4 100644
> > --- a/drivers/edac/sb_edac.c
> > +++ b/drivers/edac/sb_edac.c
> > @@ -283,8 +283,9 @@ static const u32 correrrthrsld[] = {
> >   * sbridge structs
> >   */
> >  
> > -#define NUM_CHANNELS   4
> > -#define MAX_DIMMS  3   /* Max DIMMS per channel */
> > +#define NUM_CHANNELS   4
> > +#define MAX_DIMMS  3   /* Max DIMMS per channel */
> > +#define CHANNEL_UNSPECIFIED0xf /* Intel IA32 SDM 15-14 */
> >  
> >  enum type {
> > SANDY_BRIDGE,
> > @@ -1991,6 +1992,9 @@ static void sbridge_mce_output_error(struct 
> > mem_ctl_info *mci,
> >  
> > /* FIXME: need support for channel mask */
> >  
> > +   if (channel == CHANNEL_UNSPECIFIED)
> > +   channel = -1;
> > +
> > /* Call the helper to output message */
> > edac_mc_handle_error(tp_event, mci, core_err_cnt,
> >  m->addr >> PAGE_SHIFT, m->addr & ~PAGE_MASK, 0,
> 
> Acked-by: Aristeu Rozanski 
> 
> -- 
> Aristeu
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] zsmalloc: always update lru ordering of each zspage

2014-09-11 Thread Seth Jennings
On Thu, Sep 11, 2014 at 04:53:54PM -0400, Dan Streetman wrote:
> Update ordering of a changed zspage in its fullness group LRU list,
> even if it has not moved to a different fullness group.
> 
> This is needed by zsmalloc shrinking, which partially relies on each
> class fullness group list to be kept in LRU order, so the oldest can
> be reclaimed first.  Currently, LRU ordering is only updated when
> a zspage changes fullness groups.

Just something I saw.

fix_fullness_group() is called from zs_free(), which means that removing
an object from a zspage moves it to the front of the LRU.  Not sure if
that is what we want.  If anything that makes it a _better_ candidate
for reclaim as the zspage is now contains fewer objects that we'll have
to decompress and writeback.

Seth

> 
> Signed-off-by: Dan Streetman 
> Cc: Minchan Kim 
> ---
>  mm/zsmalloc.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index fedb70f..51db622 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -467,16 +467,14 @@ static enum fullness_group fix_fullness_group(struct 
> zs_pool *pool,
>   BUG_ON(!is_first_page(page));
>  
>   get_zspage_mapping(page, &class_idx, &currfg);
> - newfg = get_fullness_group(page);
> - if (newfg == currfg)
> - goto out;
> -
>   class = &pool->size_class[class_idx];
> + newfg = get_fullness_group(page);
> + /* Need to do this even if currfg == newfg, to update lru */
>   remove_zspage(page, class, currfg);
>   insert_zspage(page, class, newfg);
> - set_zspage_mapping(page, class_idx, newfg);
> + if (currfg != newfg)
> + set_zspage_mapping(page, class_idx, newfg);
>  
> -out:
>   return newfg;
>  }
>  
> -- 
> 1.8.3.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] zsmalloc: fix init_zspage free obj linking

2014-09-11 Thread Seth Jennings
On Thu, Sep 11, 2014 at 04:53:52PM -0400, Dan Streetman wrote:
> When zsmalloc creates a new zspage, it initializes each object it contains
> with a link to the next object, so that the zspage has a singly-linked list
> of its free objects.  However, the logic that sets up the links is wrong,
> and in the case of objects that are precisely aligned with the page boundries
> (e.g. a zspage with objects that are 1/2 PAGE_SIZE) the first object on the
> next page is skipped, due to incrementing the offset twice.  The logic can be
> simplified, as it doesn't need to calculate how many objects can fit on the
> current page; simply checking the offset for each object is enough.
> 
> Change zsmalloc init_zspage() logic to iterate through each object on
> each of its pages, checking the offset to verify the object is on the
> current page before linking it into the zspage.
> 
> Signed-off-by: Dan Streetman 
> Cc: Minchan Kim 

This one stands on its own as a bugfix.

Reviewed-by: Seth Jennings 

> ---
>  mm/zsmalloc.c | 14 +-
>  1 file changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index c4a9157..03aa72f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -628,7 +628,7 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>   while (page) {
>   struct page *next_page;
>   struct link_free *link;
> - unsigned int i, objs_on_page;
> + unsigned int i = 1;
>  
>   /*
>* page->index stores offset of first object starting
> @@ -641,14 +641,10 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>  
>   link = (struct link_free *)kmap_atomic(page) +
>   off / sizeof(*link);
> - objs_on_page = (PAGE_SIZE - off) / class->size;
>  
> - for (i = 1; i <= objs_on_page; i++) {
> - off += class->size;
> - if (off < PAGE_SIZE) {
> - link->next = obj_location_to_handle(page, i);
> - link += class->size / sizeof(*link);
> - }
> + while ((off += class->size) < PAGE_SIZE) {
> + link->next = obj_location_to_handle(page, i++);
> + link += class->size / sizeof(*link);
>   }
>  
>   /*
> @@ -660,7 +656,7 @@ static void init_zspage(struct page *first_page, struct 
> size_class *class)
>   link->next = obj_location_to_handle(next_page, 0);
>   kunmap_atomic(link);
>   page = next_page;
> - off = (off + class->size) % PAGE_SIZE;
> + off %= PAGE_SIZE;
>   }
>  }
>  
> -- 
> 1.8.3.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >