Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread James Bottomley
On Fri, 2017-04-21 at 14:30 -0700, Kees Cook wrote:
> On Fri, Apr 21, 2017 at 2:27 PM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > On Fri, 2017-04-21 at 13:22 -0700, Kees Cook wrote:
> > > On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers <
> > > ebigge...@gmail.com>
> > > wrote:
> > > > > > Of course, having extra checks behind a debug option is 
> > > > > > fine.  But they should not be part of the base feature; the 
> > > > > > base feature should just be mitigation of reference count
> > > > > > *overflows*.  It would be nice to do more, of course; but 
> > > > > > when the extra stuff prevents people from using refcount_t 
> > > > > > for performance reasons, it defeats the point of the
> > > > > > feature in the first place.
> > > > > 
> > > > > Sure, but as I said above, I think the smaller tricks and 
> > > > > fixes won't be convincing enough, so their value is
> > > > > questionable.
> > > > 
> > > > This makes no sense, as the main point of the feature is 
> > > > supposed to be the security improvement.  As-is, the extra 
> > > > debugging stuff is actually preventing the security improvement 
> > > > from being adopted, which is unfortunate.
> > > 
> > > We've been trying to handle the conflicting desires of those 
> > > wanting very precise refcounting implementation and gaining the 
> > > security protections. Ultimately, the best way forward seemed to 
> > > be to first land the precise refcounting implementation, and 
> > > start conversion until we ran into concerns over performance. 
> > > Now, since we're here, we can move forward with getting a fast 
> > > implementation that provides the desired security protections 
> > > without too greatly messing with the refcount API.
> > 
> > But that's not what it really looks like.  What it looks like is
> > someone came up with a new API and is now intent on forcing it 
> > through the kernel in about 500 patches using security as the
> > hammer.
> 
> The intent is to split refcounting and statistical counters away from
> atomics, since they have distinct APIs. This will let us audit the
> remaining atomic uses much more easily.

But the security problem is counter overflow, right?  That problem, as
far as I can see exists in the atomics as well.  I'm sure there might
be one or two corner cases depending on overflow actually working, but
I can't think of them.

The refcount vs atomic comes on the precise meaning of decrement to
zero.  I'm not saying there's no benefit to detecting the condition,
but the security problem looks to be much more pressing which is why I
think this can be argued on the merits later.

> > If we were really concerned about security first, we'd have fixed 
> > the atomic overflow problem in the atomics and then built the 
> > precise refcounting on that and tried to persuade people of the
> > merits.
> 
> I agree, but this approach was NAKed by multiple atomics maintainers.

Overriding that decision by trying to convince all the consumers to
move to a new API doesn't seem to be going so well either.  Perhaps we
could assist you in changing the minds of the atomics maintainers ...
what was the primary problem?  The additional couple of cycles or the
fact that some use cases want overflow (or something else)?

James

> > Why can't we still do this?  It looks like the overflow protection 
> > will add only a couple of cycles to the atomics.  We can move the
> > current version to an unchecked variant which can be used either in 
> > truly performance critical regions with no security implications or 
> > if someone really needs the atomic to overflow.  From my point of 
> > view it would give us the 90% case (security) and stop this endless
> > argument over the fast paths.  Subsystems which have already moved 
> > to refcount would stay there and the rest get to evaluate a 
> > migration on the merits of the operational utility.
> 
> -Kees
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread James Bottomley
On Fri, 2017-04-21 at 13:22 -0700, Kees Cook wrote:
> On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers  
> wrote:
> > > > Of course, having extra checks behind a debug option is fine. 
> > > >  But they should not be part of the base feature; the base 
> > > > feature should just be mitigation of reference count
> > > > *overflows*.  It would be nice to do more, of
> > > > course; but when the extra stuff prevents people from using 
> > > > refcount_t for performance reasons, it defeats the point of the
> > > > feature in the first place.
> > > 
> > > Sure, but as I said above, I think the smaller tricks and fixes 
> > > won't be convincing enough, so their value is questionable.
> > 
> > This makes no sense, as the main point of the feature is supposed 
> > to be the security improvement.  As-is, the extra debugging stuff 
> > is actually preventing the security improvement from being adopted,
> > which is unfortunate.
> 
> We've been trying to handle the conflicting desires of those wanting
> very precise refcounting implementation and gaining the security
> protections. Ultimately, the best way forward seemed to be to first
> land the precise refcounting implementation, and start conversion
> until we ran into concerns over performance. Now, since we're here, 
> we can move forward with getting a fast implementation that provides 
> the desired security protections without too greatly messing with the
> refcount API.

But that's not what it really looks like.  What it looks like is
someone came up with a new API and is now intent on forcing it through
the kernel in about 500 patches using security as the hammer.

If we were really concerned about security first, we'd have fixed the
atomic overflow problem in the atomics and then built the precise
refcounting on that and tried to persuade people of the merits.

Why can't we still do this?  It looks like the overflow protection will
add only a couple of cycles to the atomics.  We can move the current
version to an unchecked variant which can be used either in truly
performance critical regions with no security implications or if
someone really needs the atomic to overflow.  From my point of view it
would give us the 90% case (security) and stop this endless argument
over the fast paths.  Subsystems which have already moved to refcount
would stay there and the rest get to evaluate a migration on the merits
of the operational utility.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 17:56 +0100, Peter Zijlstra wrote:
> On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > > Now when new refcount_t type and API are finally merged
> > > > (see include/linux/refcount.h), the following
> > > > patches convert various refcounters in the block susystem from 
> > > > atomic_t to refcount_t. By doing this we prevent intentional or
> > > > accidental underflows or overflows that can led to use-after
> > > > -free vulnerabilities.
> > 
> > This description isn't right ... nothing is prevented; we get 
> > warnings on saturation and use after free with this.
> 
> The thing that is prevented is overflow and then a use-after-free by
> making it a leak.
> 
> Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
> we'll never get there.
> 
> So you loose use-after-free, you gain a resource leak. The general 
> idea being that use-after-free is a nice trampoline for exploits, 
> leaks are 'only' a DoS.

OK, I see the intention: it's protection from outside influence.  It
still doesn't prevent *us* from screwing up in the kernel and inducing
a use after free by doing too many puts (or too few gets) ... that's
what the message suggests to me (me coding wrongly is accidental
underflows or overflows as I read it).

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > Now when new refcount_t type and API are finally merged
> > (see include/linux/refcount.h), the following
> > patches convert various refcounters in the block susystem from 
> > atomic_t to refcount_t. By doing this we prevent intentional or 
> > accidental underflows or overflows that can led to use-after-free
> > vulnerabilities.

This description isn't right ... nothing is prevented; we get warnings
on saturation and use after free with this.

> > The below patches are fully independent and can be cherry-picked 
> > separately. Since we convert all kernel subsystems in the same 
> > fashion, resulting in about 300 patches, we have to group them for 
> > sending at least in some fashion to be manageable. Please excuse
> > the long cc list.
> > 
> > Elena Reshetova (5):
> >   block: convert bio.__bi_cnt from atomic_t to refcount_t
> >   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
> >   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
> >   block: convert io_context.active_ref from atomic_t to refcount_t
> >   block: convert bsg_device.ref_count from atomic_t to refcount_t
> 
> I went to look at the implementation, and the size of a refcount_t.
> But the code is not there?! You say it's finally merged, where is
> it merged? Don't send code like this without the necessary
> infrastructure being in mainline.

It's one of those no discussion except notice by tipbot things because
Ingo liked it.

The size is atomic_t, but the primitives check for overflow and check
inc from zero and things, so in a true refcounting situation we get
automatic warnings of problems inside the primitives.

That said, if we were going to convert the block layer to this
semantic, surely the benefit of the conversion would be getting rid of
the now unnecessary BUG_ON and WARN_ON in the code, which do exactly
the same thing the refcount primitives now do?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LSF/MM 2017: Call for Proposals

2016-12-08 Thread James Bottomley
On Thu, 2016-12-08 at 13:26 +0100, Michal Hocko wrote:
> On Wed 07-12-16 06:57:06, James Bottomley wrote:
> [...]
> > Just on this point, since there seems to be a lot of confusion: lsf
> > -pc
> > is the list for contacting the programme committee, so you cannot
> > subscribe to it.
> > 
> > There is no -discuss equivalent, like kernel summit has, because we
> > expect you to cc the relevant existing mailing list and have the
> > discussion there instead rather than expecting people to subscribe
> > to a
> > new list.
> 
> There used to be l...@lists.linux-foundation.org. Is it not active
> anymore?

Not until we get the list of invitees sorted out.  And then it's only
for stuff actually to do with lsf (like logistics, bof or session
requests or other meetups).  Any technical stuff should still be cc'd
to the relevant mailing list.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LSF/MM 2017: Call for Proposals

2016-12-07 Thread James Bottomley
On Thu, 2016-12-01 at 09:11 -0500, Jeff Layton wrote:
> 1) Proposals for agenda topics should be sent before January 15th, 
> 2016 to:
> 
> lsf...@lists.linux-foundation.org
> 
> and cc the Linux list or lists that are relevant for the topic in
> question:
> 
> ATA:   linux-...@vger.kernel.org
> Block: linux-bl...@vger.kernel.org
> FS:linux-fsde...@vger.kernel.org
> MM:linux...@kvack.org
> SCSI:  linux-s...@vger.kernel.org
> NVMe:  linux-n...@lists.infradead.org
> 
> Please tag your proposal with [LSF/MM TOPIC] to make it easier to 
> track.

Just on this point, since there seems to be a lot of confusion: lsf-pc
is the list for contacting the programme committee, so you cannot
subscribe to it.

There is no -discuss equivalent, like kernel summit has, because we
expect you to cc the relevant existing mailing list and have the
discussion there instead rather than expecting people to subscribe to a
new list.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix compile failure on parisc

2013-03-03 Thread James Bottomley
x86 seems to include vmalloc.h by default along some of its arch paths,
but most other architectures don't, leading to this compile failure:

fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
fs/btrfs/raid56.c:206: error: implicit declaration of function 'vzalloc'
fs/btrfs/raid56.c:206: warning: assignment makes pointer from integer
without a cast
fs/btrfs/raid56.c:226: error: implicit declaration of function 'vfree'
make[2]: *** [fs/btrfs/raid56.o] Error 1

Fix this by adding vmalloc.h explicitly to the includes list

Signed-off-by: James Bottomley jbottom...@parallels.com

---

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0722205..1f0f57e 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -30,6 +30,7 @@
 #include linux/raid/pq.h
 #include linux/hash.h
 #include linux/list_sort.h
+#include linux/vmalloc.h
 #include linux/raid/xor.h
 #include asm/div64.h
 #include compat.h

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread James Bottomley
On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote:
 On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
  On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
New firmware/microcode versions are able to reclaim that space if it
sees a certain number of consecutive zero's and will reclaim that 
space to the volume pool. Are there any thoughts on writing a 
low-priority tread that zeros out those non-used blocks?
   
   Patches have been floating around to support this - see the recent 
   patches around DISCARD on linux-ide and lkml.  It would be great to 
   get access to a box that implemented the T10 proposed UNMAP commands 
   that we could test against. 
  
  We've already made btrfs support TRIM, and Matthew has patches which
  hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
  once the dust settles on the spec.
 
 It seems like the dust has settled ... I just need to check that
 my code still conforms to the spec.  Understandably, I've been focused
 on TRIM ;-)
 
  I don't think I've seen anybody talking about deliberately writing
  zeroes instead of just issuing a discard command though. That doesn't
  seem like a massively cunning plan.
 
 Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
 be sure.  I'm not exactly sure how we're supposed to be deciding whether
 to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
 properly it'll tell me.

Actually, the point about WRITE SAME is that it's a far smaller patch to
the standards (just a couple of bits).  Plus it gets around the problem
of what does the array return when an unmapped block is requested (which
occupies pages in the UNMAP proposal), so from that point of view it
seems very logical.

 I just had a quick chat with someone from another storage vendor who
 don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
 their device will notice that and unmap the LBAs in question.

I actually already looked at using WRITE SAME in sd.c ... it turns out
to be surprisingly little work ... the thing you'll like about it is
that there are no extents to worry about and if you plan on writing all
zeros, you can keep a static zeroed data buffer around for the
purpose ...

James


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread James Bottomley
On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote:
 Erwin van Londen wrote:
  Another thing is that some arrays have the capability to thin-provision 
  volumes. In the back-end on the physical layer the array configures, let 
  say, a 1 TB volume and virtually provisions 5TB to the host. On writes it 
  dynamically allocates more pages in the pool up to the 5TB point. Now if 
  for some reason large holes occur on the volume, maybe a couple of ISO 
  images that have been deleted, what normally happens is just some pointers 
  in the inodes get deleted so from an array perspective there is still data 
  on those locations and will never release those allocated blocks. New 
  firmware/microcode versions are able to reclaim that space if it sees a 
  certain number of consecutive zero's and will reclaim that space to the 
  volume pool. Are there any thoughts on writing a low-priority tread that 
  zeros out those non-used blocks?

 
 Patches have been floating around to support this - see the recent 
 patches around DISCARD on linux-ide and lkml.  It would be great to 
 get access to a box that implemented the T10 proposed UNMAP commands 
 that we could test against. 

So we went several times around the block in the upcoming Linux
Filesystem and Storage workshop to see if anyone from the array vendors
might be interested in discussing thin provisioning.  The general result
was no since travel is tight.  The upshot will be that most of our
discard infrastructure will be focussed on SSD TRIM, but that we'll try
to preserve the TP option for arrays ... there are still private
conversations going on with various people who know the UNMAP/WRITE SAME
requirements of the various arrays at the various vendors.

James


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html