Re: kswapd craziness in 3.7

2012-11-27 Thread Linus Torvalds
Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?

 Linus

On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner  wrote:
> Hi everyone,
>
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
>
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures.  This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
>
> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
>
> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size).  The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.  The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
>
> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
>
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
>
> Johannes
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Linus Torvalds
On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner  wrote:
> On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>>
>> Kswapd going crazy is certainly a large part of the problem.
>>
>> However, that leaves the issue of page_alloc.c waking up
>> kswapd when the system is not actually low on memory.
>>
>> Instead, kswapd is woken up because memory compaction failed,
>> potentially even due to lock contention during compaction!
>>
>> Ideally the allocation code would only wake up kswapd if
>> memory needs to be freed, or in order for kswapd to do
>> memory compaction (so the allocator does not have to).
>
> Maybe I missed something, but shouldn't this be solved with my patch?

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I
should apply, and which I should revert, if any.

My current guess is that I should apply the one Johannes just sent
("mm: vmscan: fix kswapd endless loop on higher order allocation")
after having added the cc to stable to it, and then revert the recent
revert (commit 82b212f40059).

But I await the Thunderdome. 

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: unusual update of the security tree

2012-11-27 Thread Linus Torvalds
On Tue, Nov 27, 2012 at 3:28 PM, Stephen Rothwell  wrote:
>
> If that is what happened, it may be worth always using the --no-ff flag
> to git merge/pull to make sure that the top commit on your tree always
> has you as the committer (and maybe SOB).
>
> Linus, does that make sense in general for maintainers?

No. That just hides the real problem - back-merges of random points in history.

Don't do them, people. EVER.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Acpi deadlocks with 3.7.0-rc4

2012-11-28 Thread Linus Torvalds
Adding more people (and the acpi list) to this report.

I'm seeing *very* few changes to the core suspend/resume path in 3.7,
and while there are some acpia updates, they seem to be pretty mild
too.

I think the acpi_os_wait_semaphore thing is a red herring - that's
just stale on the stack.

Do you have the register state from the oops? Or at least the "Code:"
line? It would be nice to see exactly where the oops happens, and I
cannot line up your "acpi_ns_lookup  + 0xa1/0x5b9" with any code due
to different compilers (and configurations etc).

   Linus


On Thu, Nov 15, 2012 at 8:09 AM, Zdenek Kabelac  wrote:
> Hello
>
>
> I've already seen twice this oops after resuming my Lenovo T61 in docking
> station.
>
> Since for some reason currently the serial line doesn't work correctly after
> resume
> (while I'm pretty sure it used to work in past) here is at least
> hand-written oops
> message from mobile camera picture.
>
> From the trace it seem os_wait semaphore is accessed twice.
> Unsure which device is behind it - but it seem docking station is need to
> hit this issue.
>
>
> kernel 3.7.0-rc4
>
> Pid:  pm-suspend
>
> RIP:   acpi_ns_lookup  + 0xa1/0x5b9
>
> Call Trace:
>
> ? acpi_os_wait_semaphore + 0x136/0x149
> acpi_ns_get_mode + 0x96/0x102
> ? __lock_is_held +0x5f/0x90
> acpi_ns_evaluate +0x47/0x2de
> ? _raw_spin_lock_irqsave
> ? acpi_ut_evaluate_object
> ? sub_preempt_count
> ? pnpacpi_can_wakeup
> acpi_rs_get_method_data
> ? acpi_os_signal_semaphore
> acpi_walk_resources
> ? acpi_ut_release_mutex
> pnpacpi_build_resource_template
> ? acpi_bus_get_device
> pnpacpi_set_resources
> ? pnp_device_shutdown
> pnp_start_dev
> pnp_bus_resume
> dpm_run_callback
> device_resume
> dpm_resume
> dpm_resume_end
> ? acpi_suspend_begin_old
> suspend_devices_and_enter
> pm_suspend
> state_store
> kobj_attr_store
> sysfs_write_file
> vgs_write
> sys_write
> system_call_fastpath
>
> Zdenek
>
>
> PS: jpg on request
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 8:22 AM, Kay Sievers  wrote:
> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk  
> wrote:
>
>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>
> Hmm, sounds like the right thing to do.

Right.

And that's the *OLD* behavior (2.6.31).

>> On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
>> by SYSLOG_ACTION_SIZE_UNREAD is unchanged

And this is the *NEW* behavior, and as you say:

> Which sounds at least like weird behaviour, if not "broken".

So the new behavior is insane and different. Let's fix it.

It looks like it is because the new SYSLOG_ACTION_SIZE_UNREAD code
does not take the new clear_seq code into account. Hmm?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Acpi deadlocks with 3.7.0-rc4

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac  wrote:
>
> I've opened  https://bugzilla.kernel.org/show_bug.cgi?id=51071
> and attached picture there which is all I have.
>
> I'll try to decode exact code line.

Uhhuh. It's missing much of the relevant parts of the code line, in
particular the actual oopsing instruction. But what is there decodes
to

41 b8 10 00 00 00   mov$0x10,%r8d
48 c7 c1 88 52 64 81mov$0x81645288,%rcx
31 c0   xor%eax,%eax
48 c7 c2 98 52 64 81mov$0x81645298,%rdx
bf 00 04 00 0.  mov$0x0.00400,%edi

.. oops in here ..

74 33   je 0x50
48 89 dfmov%rbx,%rdi
e8 4d c9 00 00  callq  ? 
48 89 d9mov%rbx,%rcx
48 c7 c2 0a .. .. ..mov$0x..0a,%rdx

which isn't really very obvious. Do you have that kernel around (or at
least the same compiler and configuration)? Doing a

  objdump --disassemble  drivers/acpi/acpica/nsaccess.o

might help pinpoint where that is..

> It's probably not a regression from 3.6 - since this problem was there for
> much longer - but now it has just become much more visible.

Ok.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Acpi deadlocks with 3.7.0-rc4

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 9:27 AM, Zdenek Kabelac  wrote:
>
> I've attached bigger disasfun script output to BZ 51071.
> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c1
>
>
> if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
> 00a1  cmpb   $0xf,0x8(%rbx)
> 00a5  je   0da  
>
> seems to be going out of bounds.

The whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.

I wonder how that happened. It's loaded from 'scope_info->scope.node',
and it *should* be a valid pointer.

Can you add a print-out of

  scope_info->common.descriptor_type

and check that it is ACPI_DESC_TYPE_STATE_WSCOPE (== 8). If it is not,
return early.

Or just something like the attatched, which just uses the root node
(and warns once) if it's not a valid WSCOPE thing.

   Linus


patch.diff
Description: Binary data


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
No, this is crap.

We don't introduce random hooks like this just because the block layer
has shit-for-brains and cannot be bothered to do things right.

The fact is, the whole locking in the block layer open routine is
total and utter crap. It doesn't lock the right thing, even with your
change *anyway* (or with the change Jens had). Absolutely nothing in
"mmap_region()" cares at all about the block-size anywhere - it's
generic, after all - so locking around it is f*cking pointless. There
is no way in hell that the caller of ->mmap can *ever* care about the
block size, since it never even looks at it.

Don't do random crap like this.

Why does the code think that mmap matters so much anyway? As you say,
the mmap itself does *nothing*. It has no impact for the block size.

 Linus

On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka  wrote:
>
>
> On Wed, 28 Nov 2012, Jens Axboe wrote:
>
>> On 2012-11-28 04:57, Mikulas Patocka wrote:
>> >
>> > This patch is wrong because you must check if the device is mapped while
>> > holding bdev->bd_block_size_semaphore (because
>> > bdev->bd_block_size_semaphore prevents new mappings from being created)
>>
>> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
>
> Hmm, it was wrong before the patch and it is wrong after the patch too.
>
> The problem is that ->mmap method doesn't do the actual mapping, the
> caller of ->mmap (mmap_region) does it. So we must actually catch
> mmap_region and protect it with the lock, not ->mmap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 11:43 AM, Al Viro  wrote:
> Have a
> private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
> added to it.

That sounds more reasonable.

However, I suspect the *most* reasonable thing to do is to just remove
the whole damn thing. We really shouldn't care about mmap. If somebody
does a mmap on a block device, and somebody else then changes the
block size, why-ever should we bother to go through any contortions at
*all* to make that kind of insane behavior do anything sane at all.

Just let people mmap things. Then just let the normal page cache
invalidation work right. In fact, it is entirely possible that we
could/should just not even invalidate the page cache at all, just make
sure that the buffer heads attached to any pages get disconnected. No?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka  wrote:
>
> mmap_region() doesn't care about the block size. But a lot of
> page-in/page-out code does.

That seems a bogus argument.

mmap() is in *no* way special. The exact same thing happens for
regular read/write. Yet somehow the mmap code is special-cased, while
the normal read-write code is not.

I suspect it might be *easier* to trigger some issues with mmap, but
that still isn't a good enough reason to special-case it. We don't add
locking to one please just because that one place shows some race
condition more easily. We fix the locking.

So for example, maybe the code that *actually* cares about the buffer
size (the stuff that allocates buffers in fs/buffer.c) needs to take
that new percpu read lock. Basically, any caller of
"alloc_page_buffers()/create_empty_buffers()" or whatever.

I also wonder whether we need it *at*all*. I suspect that we could
easily have multiple block-sizes these days for the same block device.
It *used* to be (millions of years ago, when dinosaurs roamed the
earth) that the block buffers were global and shared with all users of
a partition. But that hasn't been true since we started using the page
cache, and I suspect that some of the block size changing issues are
simply entirely stale.

Yeah, yeah, there could be some coherency issues if people write to
the block device through different block sizes, but I think we have
those coherency issues anyway. The page-cache is not coherent across
different mapping inodes anyway.

So I really suspect that some of this is "legacy logic". Or at least
perhaps _should_ be.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
 wrote:
>
> mmap() is in *no* way special. The exact same thing happens for
> regular read/write. Yet somehow the mmap code is special-cased, while
> the normal read-write code is not.

I just double-checked, because it's been a long time since I actually
looked at the code.

But yeah, block device read/write uses the pure page cache functions.
IOW, it has the *exact* same IO engine as mmap() would have.

So here's my suggestion:

 - get rid of *all* the locking in aio_read/write and the splice paths
 - get rid of all the stupid mmap games

 - instead, add them to the functions that actually use
"blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.

   That's a fairly limited number of functions:
blkdev_{read,write}page(), blkdev_direct_IO() and
blkdev_write_{begin,end}()

Doesn't that sounds simpler? And more logical: it protects the actual
places that use the block size of the device.

I dunno. Maybe there is some fundamental reason why the above is
broken, but it seems to be a much simpler approach. Sure, you need to
guarantee that the people who get the write-lock cannot possibly cause
IO while holding it, but since the only reason to get the write lock
would be to change the block size, that should be pretty simple, no?

Yeah, yeah, I'm probably missing something fundamental, but the above
sounds like the simple approach to fixing things. Aiming for having
the block size read-lock be taken by the things that pass in the
block-size itself.

It would be nice for things to be logical and straightforward.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds
 wrote:
>
> I dunno. Maybe there is some fundamental reason why the above is
> broken, but it seems to be a much simpler approach. Sure, you need to
> guarantee that the people who get the write-lock cannot possibly cause
> IO while holding it, but since the only reason to get the write lock
> would be to change the block size, that should be pretty simple, no?

Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
do unspeakable things to your family and pets.

  Linus


patch.diff
Description: Binary data


Re: [git pull] drm fixes

2012-11-28 Thread Linus Torvalds
[ Hmm. For some reason this seems to have never gone out, and was in
my drafts folder. If you get it twice, my bad ]

On Thu, Nov 22, 2012 at 12:57 AM, Dave Airlie  wrote:
>
> Doh!, yes I picked wrong place to generate report from, okay here is
> one corresponding to what you saw,

You should never even need to "pick" any place to generate the report from.

Just do something like

   git fetch upstream

(where "upstream" is a branch description for the upstream repository
- see "man git-remote" etc, although you can obviously always just
type out the whole repo details etc in full if you would want to).
Note the "fetch" - not pull - you just want to get it, not merge it.

Then you can just point git pull-request at the upstream, and git wll
figure out what the latest common point is. No need for you to
manually try to figure it out.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
 wrote:
>
> Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> do unspeakable things to your family and pets.

Btw, *if* this approach works, I suspect we could just switch the
bd_block_size_semaphore semaphore to be a regular rw-sem.

Why? Because now it's no longer ever gotten in the cached IO paths, we
only get it when we're doing much more expensive things (ie actual IO,
and buffer head allocations etc etc). As long as we just work with the
page cache, we never get to the whole lock at all.

Which means that the whole percpu-optimized thing is likely no longer
all that relevant.

But that's an independent thing, and it's only true *if* my patch
works. It looks fine on paper, but maybe there's something
fundamentally broken about it.

One big change my patch does is to move the sync_bdev/kill_bdev to
*after* changing the block size. It does that so that it can guarantee
that any old data (which didn't see the new block size) will be
sync'ed even if there is new IO coming in as we change the block size.

The old code locked the whole sync() region, which doesn't work with
my approach, since the sync will do IO and would thus cause potential
deadlocks while holding the rwsem for writing.

So with this patch, as the block size changes, you can actually have
some old pages with the old block size *and* some different new pages
with the new  block size all at the same time. It should all be
perfectly fine, but it's worth pointing out.

(It probably won't trigger in practice, though, since doing IO while
somebody else is changing the blocksize is fundamentally an odd thing
to do, but whatever. I also suspect that we *should* perhaps use the
inode->i_sem thing to serialize concurrent block size changes, but
that's again an independent issue)

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka  wrote:
>
> The problem with this approach is that it is very easy to miss points
> where it is assumed that the block size doesn't change - and if you miss a
> point, it results in a hidden bug that has a little possibility of being
> found.

Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw
stones in glass houses, shall we?

The whole reason for this long thread (and several threads before it)
is that your model isn't working and is causing problems. I already
pointed out how bogus your arguments about mmap() locking were, and
then you have the gall to talk about potential bugs, when I have
pointed you to *actual* bugs, and actual mistakes.

> For example, __block_write_full_page and __block_write_begin do
> if (!page_has_buffers(page)) { create_empty_buffers... }
> and then they do
> WARN_ON(bh->b_size != blocksize)
> err = get_block(inode, block, bh, 1)

Right. And none of this is new.

> ... so if the buffers were left over from some previous call to
> create_empty_buffers with a different blocksize, that WARN_ON is trigged.

None of this can happen.

> Locking the whole read/write/mmap operations is crude, but at least it can
> be done without thorough review of all the memory management code.

Umm. Which you clearly didn't do, and raised totally crap arguments for.

In contrast, I have a very simple argument for the correctness of my
patch: every single user of the "get_block[s]()" interface now takes
the lock for as long as get_block[s]() is passed off to somebody else.
And since get_block[s]() is the only way to create those empty
buffers, I think I pretty much proved exactly what you ask for.

And THAT is the whole point and advantage of making locking sane. Sane
locking you can actually *think* about!

In contrast, locking around "mmap()" is absolutely *guaranteed* to be
insane, because mmap() doesn't actually do any of the IO that the lock
is supposed to protect against!

So Mikulas, quite frankly, your arguments argue against you. When you
say "Locking the whole read/write/mmap operations is crude, but at
least it can
be done without thorough", you are doubly correct: it *is* crude, and
it clearly *was* done without thought, since it's a f*cking idiotic
AND INCORRECT thing to do.

Seriously. Locking around "mmap()" is insane. It leads to insane
semantics (the whole EBUSY thing is purely because of that problem)
and it leads to bad code (your "let's add a new "mmap_region" hook is
just disgusting, and while Al's idea of doing it in the existing
"->open" method is at least not nasty, it's definitely extra code and
complexity).

There are serious *CORRECTNESS* advantages to simplicity and
directness. And locking at the right point is definitely very much
part of that.

Anyway, as far as block size goes, we have exactly two cases:

 - random IO that does not care about the block size, and will just do
whatever the current block size is (ie normal anonymous accesses to
the block device).  This is the case that needs the locking - but it
only needs it around the individual page operations, ie exactly where
I put it. In fact, they can happily deal with different block sizes
for different pages, they don't really care.

 - mounted filesystems etc that require a particular block size and
set it at mount time, and they have exclusivity rules

The second case is the case that actually calls set_blocksize(), and
if "kill_bdev()" doesn't get rid of the old blocksizes, then they have
always been in trouble, and would always _continue_ to be in trouble,
regardless of locking.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds
 wrote:
>
>> For example, __block_write_full_page and __block_write_begin do
>> if (!page_has_buffers(page)) { create_empty_buffers... }
>> and then they do
>> WARN_ON(bh->b_size != blocksize)
>> err = get_block(inode, block, bh, 1)
>
> Right. And none of this is new.

.. which, btw, is not to say that *other* things aren't new. They are.

The change to actually change the block device buffer size before then
calling "sync_bdev()" is definitely a real change, and as mentioned, I
have not tested the patch in any way. If any block device driver were
to actually compare the IO size they get against the bdev->block_size
thing, they'd see very different behavior (ie they'd see the new block
size as they are asked to write old the old blocks with the old block
size).

So it does change semantics, no question about that. I don't think any
block device does it, though.

A bigger issue is for things that emulate what blkdev.c does, and
doesn't do the locking. I see code in md/bitmap.c that seems a bit
suspicious, for example. That said, it's not *new* breakage, and the
"lock at mmap/read/write() time" approach doesn't fix it either (since
the mapping will be different for the underlying MD device). So I do
think that we should take a look at all the users of
"alloc_page_buffers()" and "create_empty_buffers()" to see what *they*
do to protect the block-size, but I think that's an independent issue
from the raw device access case in fs/block_dev.c..

I guess I have to actually test my patch. I don't have very
interesting test-cases, though.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
[ Sorry, I was offline for a while driving kids around ]

On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka  wrote:
>
> It can happen. Take your patch (the one that moves bd_block_size_semaphore
> into blkdev_readpage, blkdev_writepage and blkdev_write_begin).

Interesting. The code *has* the block size (it's in "bh->b_size"), but
it actually then uses the inode blocksize instead, and verifies the
two against each other. It could just have used the block size
directly (and then used the inode i_blkbits only when no buffers
existed), avoiding that dependency entirely..

It actually does the same thing (with the same verification) in
__block_write_full_page() and (_without_ the verification) in
__block_commit_write().

Ho humm. All of those places actually do hold the rwsem for reading,
it's just that I don't want to hold it for writing over the sync..

Need to think on this,

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds
 wrote:
>
> Interesting. The code *has* the block size (it's in "bh->b_size"), but
> it actually then uses the inode blocksize instead, and verifies the
> two against each other. It could just have used the block size
> directly (and then used the inode i_blkbits only when no buffers
> existed), avoiding that dependency entirely..

Looking more at this code, that really would be the nicest solution.

There's two cases for the whole get_block() thing:

 - filesystems. The block size will not change randomly, and
"get_block()" seriously depends on the block size.

 - the raw device. The block size *will* change, but to simplify the
problem, "get_block()" is a 1:1 mapping, so it doesn't even care about
the block size because it will always return "bh->b_blocknr = nr".

So we *could* just say that all the fs/buffer.c code should use
"inode->i_blkbits" for creating buffers (because that's the size new
buffers should always use), but use "bh->b_size" for any *existing*
buffer use.

And looking at it, it's even simple. Except for one *very* annoying
thing: several users really don't want the size of the buffer, they
really do want the *shift* of the buffer size.

In fact, that single issue seems to be the reason why
"inode->i_blkbits" is really used in fs/buffer.c.

Otherwise it would be fairly trivial to just make the pattern be just a simple

if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
head = page_buffers(page);
blocksize = head->b_size;

and just use the blocksize that way, without any other games. All
done, no silly WARN_ON() to verify against some global block-size, and
the fs/buffer.c code would be perfectly simple, and would have no
problem at all with multiple different blocksizes in different pages
(the page lock serializes the buffers and thus the blocksize at the
per-page level).

But the fact that the code wants to do things like

block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);

seriously seems to be the main thing that keeps us using
'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
enough to hurt (not everywhere, but on some machines).

Very annoying.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
 wrote:
>
> But the fact that the code wants to do things like
>
> block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
>
> seriously seems to be the main thing that keeps us using
> 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> enough to hurt (not everywhere, but on some machines).
>
> Very annoying.

Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
whole ilog2 thing, but at the same time, in other cases it actually
seems to improve code generation (ie gets rid of the whole unnecessary
two dereferences through page->mapping->host just to get the block
size, when we have it in the buffer-head that we have to touch
*anyway*).

Comments? Again, untested.

And I notice that Al Viro hasn't been cc'd, which is sad, since he's
been involved in much of fs/block_dev.c.

Al - this is an independent patch to fs/buffer.c to make
fs/block_dev.c able to change the block size of a block device while
there is IO in progress that may still use the old block size. The
discussion has been on fsdevel and lkml, but you may have missed it...

Linus


patch.diff
Description: Binary data


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 10:25 PM, Al Viro  wrote:
>
> Umm...  set_blocksize() is calling kill_bdev(), which does
> truncate_inode_pages(mapping, 0).  What's going to happen to data in
> the dirty pages?  IO in progress is not the only thing to worry about...

Hmm. Yes. I think it works by virtue of "if you change the blocksize
while there is active IO, you're insane and you deserve whatever you
get".

It shouldn't even be fundamentally hard to make it work, although I
suspect it would be more code than it would be worth. The sane model
would be to not use truncate_inode_pages(), but instead just walk the
pages and get rid of the buffer heads with the wrong size. Preferably
*combining* that with the sync_blockdev().

We have no real reason to even invalidate the page cache, it's just
the buffers we want to get rid of.

But I suspect it's true that none of that is really *worth* it,
considering that nobody likely wants to do any concurrent IO. We don't
want to crash, or corrupt the data structures, but I suspect "you get
what you deserve" might actually be the right model ;)

So the current "sync_blockdev()+kill_bdev()" takes care of the *sane*
case (we flush any data that happened *before* the block size change),
and any concurrent writes with block-size changes are "good luck with
that".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 10:30 PM, Al Viro  wrote:
>
> Note that sync_blockdev() a few lines prior to that is good only if we
> have no other processes doing write(2) (or dirtying the mmapped pages,
> for that matter).  The window isn't too wide, but...

So with Mikulas' patches, the write actually would block (at write
level) due to the locking. The mmap'ed patches may be around and
flushed, but the logic to not allow currently *active* mmaps (with the
rather nasty random -EBUSY return value) should mean that there is no
race.

Or rather, there's a race, but it results in that EBUSY thing.

With my simplfied locking, the sync_blockdev() is right before (not a
few lines prior) to the kill_bdev(), and in a perfect world they'd
actually be one single operation ("write back and invalidate pages
with the wrong block-size"). But they aren't.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka  wrote:
>
> This sounds sensible. I'm sending this patch.

This looks much better.

I think I'll apply this for 3.7 (since it's too late to do anything
fancier), and then for 3.8 I will rip out all the locking entirely,
because looking at the fs/buffer.c patch I wrote up, it's all totally
unnecessary.

Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
creating new buffers) simply makes the whole locking thing pointless.
Just make the page lock protect the block size, and make it per-page,
and we're done.

No RCU grace period crap, no expedited mess, no nothing.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason  wrote:
>
> Jumping in based on Linus original patch, which is doing something like
> this:
>
> set_blocksize() {
> block new calls to writepage, prepare/commit_write
> set the block size
> unblock
>
> < --- can race in here and find bad buffers --->
>
> sync_blockdev()
> kill_bdev()
>
> < --- now we're safe --- >
> }
>
> We could add a second semaphore and a page_mkwrite call:

Yeah, we could be fancy, but the more I think about it, the less I can
say I care.

After all, the only things that do the whole set_blocksize() thing should be:

 - filesystems at mount-time

 - things like loop/md at block device init time.

and quite frankly, if there are any *concurrent* writes with either of
the above, I really *really* don't think we should care. I mean,
seriously.

So the _only_ real reason for the locking in the first place is to
make sure of internal kernel consistency. We do not want to oops or
corrupt memory if people do odd things. But we really *really* don't
care if somebody writes to a partition at the same time as somebody
else mounts it. Not enough to do extra work to please insane people.

It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
The whole sequence always used to be unlocked. The locking is entirely
new. There is certainly not any legacy users that can possibly rely on
"I did writes at the same time as the mount with no serialization, and
it worked". It never has worked.

So I think this is a case of "perfect is the enemy of good".
Especially since I think that with the fs/buffer.c approach, we don't
actually need any locking at all at higher levels.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason  wrote:
>
> The bigger question is do we have users that expect to be able to set
> the blocksize after mmaping the block device (no writes required)?  I
> actually feel a little bad for taking up internet bandwidth asking, but
> it is a change in behaviour.

Yeah, it is. That said, I don't think people will really notice.
Nobody mmap's block devices outside of some databases, afaik, and
nobody sane mounts a partition at the same time a DB is using it. So I
think the new EBUSY check is *ugly*, but I don't realistically believe
that it is a problem.  The ugliness of the locking is why I'm not a
huge fan of it, but if it works I can live with it.

But yes, the mmap tests are new with the locking, and could in theory
be problematic if somebody reports that it breaks anything.

And like the locking, they'd just go away if we just do the
fs/buffer.c approach instead. Because doing things in fs/buffer.c
simply means that we don't care (and serialization is provided by the
page lock on a per-page basis, which is what mmap relies on anyway).

So doing the per-page fs/buffer.c approach (along with the
"ACCESS_ONCE()" on inode->i_blkbits to make sure we get *one*
consistent value, even if we don't care *which* value it is) would
basically revert to all the old semantics. The only thing it would
change is that we wouldn't see oopses.

(And in theory, it would allow us to actively mix-and-match different
block sizes for a block device, but realistically I don't think there
are any actual users of that - although I could imagine that a
filesystem would use a smaller block size for file tail-blocks etc,
and still want to use the fs/buffer.c code, so it's *possible* that it
would be useful, but filesystems have been able to do things like that
by just doing their buffers by hand anyway, so it's not really
fundamentally new, just a possible generalization of code)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 10:23 AM, Mikulas Patocka  wrote:
>
>
> If you remove that percpu rw lock, you also need to rewrite direct i/o
> code.
>
> In theory, block device direct i/o doesn't need buffer block size at all.
> But in practice, it shares a lot of code with filesystem direct i/o, it
> reads the block size multiple times and it crashes if it changes.

If it's a filesystem, then the size will never change while it is mounted.

So only the direct-block-device case needs to be worried about, no?

And that uses __generic_file_aio_write() and friends, which in turn
use the readpage/writepage functions.

So for block devices, it should be sufficient to make
readpage/writepage (with the writing obviously having all the
"write_begin/write_end/fullpage" variants) be safe as far as I can
see.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
 wrote:
>
> I think I'll apply this for 3.7 (since it's too late to do anything
> fancier), and then for 3.8 I will rip out all the locking entirely,
> because looking at the fs/buffer.c patch I wrote up, it's all totally
> unnecessary.
>
> Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> creating new buffers) simply makes the whole locking thing pointless.
> Just make the page lock protect the block size, and make it per-page,
> and we're done.

There's a 'block-dev' branch in my git tree, if you guys want to play
around with it.

It actually reverts fs/block-dev.c back to the 3.6 state (except for
some whitespace damage that I refused to re-introduce), so that part
of the changes should be pretty safe and well tested.

The fs/buffer.c changes, of course, are new. It's largely the same
patch I already sent out, with a small helper function to simplify it,
and to keep the whole ACCESS_ONCE() thing in just a single place.

That branch may be re-based in case I get reports or acks or whatever,
so don't rely on it (or if you do, please let me know, and I'll stop
editing it).

The fact that I could just revert the fs/block-dev.c part to a known
state makes me wonder if this might be safe for 3.7 after all (the
fs/buffer.c changes all *look* safe). That way we'd not have to worry
about any new semantics (whether they be EBUSY or any possible locking
slowdowns or RT issues). But I'll think about it, and it would be good
for people to double-check my fs/buffer.c stuff.

Mikulas, does that pass your testing?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason  wrote:
>
> The fs/buffer.c part makes sense during a quick read.  But
> fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
> bugs there.

Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
incestuously playing games that should be in fs/buffer.c. I thought it
was doing the sane thing with the page cache.

(I now realize that Mikulas was talking about this mess, while I
thought he was talking about the AIO code which is largely sane).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:26 AM, Linus Torvalds
 wrote:
>
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).

Oh wow.

The direct-IO code really doesn't seem to care at all. I don't think
it needs locking either (it seems to do everything with a private
buffer-head), and the problem appears solely to be that it reads
i_blksize multiple times, so changing it just happens to confuse the
direct-io code.

If it were to read it only once, and then use that value, it looks
like it should all JustWork(tm).

And the right thing to do would seem to just add it to the
"dio_submit" structure, that we already have. And it already *has* a
blkbits field, but that's the "IO blocksize", not the "getblocks
blocksize", if I read that mess correctly.

Of course, it then *ALREADY* has that "blkfactor" thing, which is the
difference between i_blkbits and blktbits, so it effective *does* have
i_blkbits already in the dio_submit structure. But despite it all, it
keeps re-reading i_blksize.

Christ. That code is a mess.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  wrote:
>
> blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
> really don't get why that isn't byte based instead.  DIO is already
> doing the shift & mask game.

The blkdev_get_blocks() this is just sad.

The underlying data structure is actually byte-based (it's
"i_size_read(bdev->bd_inode"), but we convert it to a block-based
number.

Oops.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:55 AM, Linus Torvalds
 wrote:
>
> The blkdev_get_blocks() this is just sad.
>
> The underlying data structure is actually byte-based (it's
> "i_size_read(bdev->bd_inode"), but we convert it to a block-based
> number.
>
> Oops.

Oh, it's even worse than that. The DIO code ends up passing in buffer
heads that have sizes bigger than the inode i_blksize, which can cause
problems at the end of the disk. So blkdev_get_blocks() knows about
it, and will then "fix" that and shrink them down. The games with
"max_block" are hilarious.

In a really sad way.

That whole blkdev_get_blocks() function is pure and utter shit.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  wrote:
>
> It was all a trick to get you to say the AIO code was sane.

It's only sane compared to the DIO code.

That said, I hate AIO much less these days that we've largely merged
the code with the regular IO. It's still a horrible interface, but at
least it is no longer a really disgusting separate implementation in
the kernel of that horrible interface.

So yeah, I guess AIO really is pretty sane these days.

> It looks like we could use the private copy of i_blkbits that DIO is
> already recording.

Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.

I've pushed out two more commits to the 'block-dev' branch at

  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev

in case anybody wants to take a look.

It is - as usual - entirely untested. It compiles, and I *think* that
blkdev_get_blocks() makes a whole lot more sense this way - as you
said, it should be byte-based (although it actually does the block
number conversion because I worried about overflow - probably
unnecessarily).

Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  wrote:
>
> Just reading the new blkdev_get_blocks, it looks like we're mixing
> shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> it has no relation at all to the actual block size of the device.  The
> interface is abusing b_size to ask for as large a mapping as possible.

Ugh. That's a big violation of how buffer-heads are supposed to work:
the block number is very much defined to be in multiples of b_size
(see for example "submit_bh()" that turns it into a sector number).

But you're right. The direct-IO code really *is* violating that, and
knows that get_block() ends up being defined in i_blkbits regardless
of b_size.

What a crock. That direct-IO code is hack-upon-hack. Whoever wrote it
should be shot.

I think the only sane way to fix is is to pass in the block size to
get_blocks(). Which we admittedly should have done long ago, so that's
not a bad fix, but without actually looking at what it involves, I
think it's going to be pretty big patch. All the filesystems that
support the interface need to update it, even if they can then ignore
it, because direct-IO does all these hacks only for the raw device.

And I think it will improve the interface, but damn, direct-IO is
still horrible for playing these kinds of games.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
 wrote:
>
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.

It turns out fs/ioctl.c does the same - it fills in the buffer head
with some random bh->b_size too. I think it's not even a power of two
in that case.

And I guess it's understandable - they don't actually *use* the
buffer, they just want the offset. So the b_size field really is just
random crap to the users of the get_block interfaces, since they've
never cared before.

Ugh, this was definitely a dark and disgusting underbelly of the VFS
layer. We've not had to really touch it for a *looong* time..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason  wrote:
>
> I searched through filemap.c for the magic i_size check that would let
> us get away with ignoring i_blkbits in get_blocks, but its just not
> there.  The whole fallback-to-buffered scheme seems to rely on
> get_blocks checking for i_size.  I really hope I'm just missing
> something.

So generic_write_checks() limits the size to i_size at for writes (and
for "isblk").

Sure, then it will do the buffered part after that, but that should
all be fine anyway, since by then we use the normal page cache.

For reads, generic_file_aio_read() will check pos < size, but doesn't
seem to actually limit the size of the iovec.

I'm not sure why it doesn't just do "iov_shorten()".

Anyway, having looked at actually passing in the block size to
get_block(), I can say that is a horrible idea. There are tons of
get_block functions (for various filesystems), and *none* of them
really want the block size, because they tend to work on block
indexes. And if they do want the block size, they'll just get it from
the inode or sb, since they are filesystems and it's all stable.

So the *only* of the places that would want the block size is
fs/block_dev.c. And the callers really already seem to do the i_size
check, although they sometimes do it badly. And since there are fewer
callers than there are get_block() implementations, I think we should
just fix the callers and be done with it.

Those generic_file_aio_read/write() functions in fs/direct-io.c really
just seem to be badly written. The fact that they may depend on the
i_size check in get_blocks() is sad, but I think we should fix it and
just remove the check for block devices. That's going to simplify so
much..

I updated the 'block-dev' branch to have that simpler fs/block_dev.c
model instead. I'll look at the iovec shortening later. It's a
non-fast-forward thing, look out!

(I actually think we should just add the max-offset check to
rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
and we could make it do a max_offset check as well).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 3.9-rc3

2013-03-17 Thread Linus Torvalds
"

Josef Bacik (1):
  Btrfs: return EIO if we have extent tree corruption

Josh Boyer (1):
  serial: 8250: Keep 8250. module options functional after
driver rename

Junwei Zhang (1):
  afkey: fix a typo

Kamal Mostafa (1):
  Input: cypress_ps2 - fix trackpadi found in Dell XPS12

Kees Cook (2):
  final removal of CONFIG_EXPERIMENTAL
  signal: always clear sa_restorer on execve

Kevin Cernekee (1):
  Input: ALPS - remove unused argument to alps_enter_command_mode()

Kishon Vijay Abraham I (1):
  usb: gadget: make usb functions to load before gadget driver

Konrad Rzeszutek Wilk (2):
  xen/pciback: Don't disable a PCI device that is already disabled.
  acpi: Export the acpi_processor_get_performance_info

Konstantin Khlebnikov (3):
  e1000e: fix pci-device enable-counter balance
  e1000e: fix runtime power management transitions
  e1000e: fix accessing to suspended device

Kumar Amit Mehta (3):
  staging: comedi: drivers: usbdux.c: fix DMA buffers on stack
  staging: comedi: drivers: usbduxfast.c: fix for DMA buffers on stack
  staging: comedi: drivers: usbduxsigma.c: fix DMA buffers on stack

Lars-Peter Clausen (4):
  iio:ad5064: Fix address of the second channel for ad5065/ad5045/ad5025
  iio:ad5064: Fix off by one in DAC value range check
  iio:ad5064: Initialize register cache correctly
  ext3: Fix format string issues

Laxman Dewangan (1):
  mfd: palmas: Provide irq flags through DT/platform data

Ley Foon Tan (1):
  tty/serial: Add support for Altera serial port

Li Zefan (1):
  s390: Fix a header dependencies related build error

Linus Torvalds (2):
  perf,x86: fix wrmsr_on_cpu() warning on suspend/resume
  Linux 3.9-rc3

Liu Bo (4):
  Btrfs: get better concurrency for snapshot-aware defrag work
  Btrfs: remove btrfs_try_spin_lock
  Btrfs: fix warning when creating snapshots
  Btrfs: fix warning of free_extent_map

Liu Jinsong (1):
  xen/acpi: remove redundant acpi/acpi_drivers.h include

Luis Alves (2):
  m68knommu: add CPU_NAME for 68000
  m68knommu: fix MC68328.h defines

Maarten Lankhorst (1):
  drm/nouveau: fix regression in vblanking

Malcolm Priestley (1):
  staging: vt6656: Fix oops on resume from suspend.

Marc Kleine-Budde (1):
  usb: otg: use try_module_get in all usb_get_phy functions and
add missing module_put

Marcin Jurkowski (1):
  w1: fix oops when w1_search is called from netlink connector

Marcin Slusarz (2):
  drm/nouveau: idle channel before releasing notify object
  drm/nv50: use correct tiling methods for m2mf buffer moves

Marco Porsch (1):
  mac80211: fix oops on mesh PS broadcast forwarding

Marco Stornelli (1):
  hostfs: fix a not needed double check

Marek Szyprowski (1):
  ARM: DMA-mapping: add missing GFP_DMA flag for atomic buffer allocation

Mark Brown (5):
  Input: ads7864 - check return value of regulator enable
  Input: mms114 - Fix regulator enable and disable paths
  mfd: tps65912: Declare and use tps65912_irq_exit()
  mfd: twl4030-audio: Fix argument type for twl4030_audio_disable_resource()
  mfd: wm831x: Don't forward declare enum wm831x_auxadc

Mathias Krause (3):
  bridge: fix mdb info leaks
  rtnl: fix info leak on RTM_GETLINK request for VF devices
  dcbnl: fix various netlink info leaks

Mathieu Desnoyers (1):
  Fix: compat_rw_copy_check_uvector() misuse in aio, readv,
writev, and security keys

Matwey V. Kornilov (1):
  usb: cp210x new Vendor/Device IDs

Maxime Ripard (2):
  ARM: mxs: cfa10049: Fix fb initialisation function
  ARM: multiplatform: Sort the max gpio numbers.

Maxin B. John (1):
  tools: usb: ffs-test: Fix build failure

Michel Lespinasse (1):
  mm/fremap.c: fix possible oops on error path

Nicolas Pitre (1):
  ARM: mach-imx: move early resume code out of the .data section

Nishanth Menon (2):
  ARM: dts: remove generated .dtb files on clean
  usb: gadget: composite: fix kernel-doc warnings

Nithin Sujir (1):
  tg3: Update link_up flag for phylib devices

Oliver Neukum (1):
  USB: cdc-wdm: fix buffer overflow

Padmavathi Venna (1):
  Arm: socfpga: pl330: Add #dma-cells for generic dma binding support

Paolo Valente (6):
  pkt_sched: sch_qfq: properly cap timestamps in charge_actual_service
  pkt_sched: sch_qfq: fix the update of eligible-group sets
  pkt_sched: sch_qfq: serve activated aggregates immediately if
the scheduler is empty
  pkt_sched: sch_qfq: prevent budget from wrapping around after a dequeue
  pkt_sched: sch_qfq: do not allow virtual time to jump if an
aggregate is in service
  pkt_sched: sch_qfq: remove a useless invocation of qfq_update_eligible

Paul Bolle (8):
  netfilter: nfnetlink: silence warning if CONFIG_PROVE_RCU isn't set
  ARM: SPEAr13xx: Fix typo "ARCH_HAVE_CPUFREQ"
  m68k: drop "select EMAC_INC"
  

Re: [GIT PULL] arm-soc fixes for v3.9-rc3

2013-03-18 Thread Linus Torvalds
On Mon, Mar 18, 2013 at 7:22 AM, Arnd Bergmann  wrote:
>
> are available in the git repository at:
>
>   git+ssh://gitol...@ra.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc.git 
> tags/fixes

What the heck happened to your script?

Please use the public address so that others could look at it if they
want to (and so that my merge messages make sense in a public
setting).

I fixed it up, but please fix your script for next time.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: Screen turns off when booting in EFI mode

2013-03-19 Thread Linus Torvalds
This is apparently still outstanding, and Mantas hadn't cc'd the
people involved with the commit itself.

Background: with UEFI, commit f9a37be0f02a ("x86: Use PCI setup data")
apparently results in a black screen for Mantas. The commit reverts
fairly easily (there's been a trivial change to the function since due
to dev->rom now being a proper phys_add_t), and considering that the
commit doesn't explain what the f*ck it is needed for, or what it
would help, I'm inclined to do just that.

Trusting firmware-provided values over the things we can find
ourselves is known to be fundamentally crap, so what the hell is the
point of that commit in the first place? The likelihood that firmware
messes up is pretty damn high. Why would we take idiotic "here's the
PCI ROM" data from firmware in the first place? What did this fix? We
know what it broke..

Doing things like blindly trusting the firmware data without even
validating it is a really really bad idea. The commit actually looks
seriously broken in other ways too, like blindly doing phys_to_virt()
on that, and then trusting the result

Mantas, mind changing that "pcibios_add_device()" function so that
instead of setting dev->rom/romlen, it just prints out the values
(including the device address)? Plase also make it print out the
"data->len" field in addition to the rom->xyz fields..

Linus

On Sat, Mar 9, 2013 at 1:42 PM, Mantas MikulÄ—nas  wrote:
> On 2013-02-22 03:03, Mantas MikulÄ—nas wrote:
>> On 2013-02-22 01:54, Dave Airlie wrote:

 | radeon :01:00.0: No connectors reported connected with modes
 | [drm] Cannot find any crtc or sizes - going 1024x768

 The connector is definitely connected, since this is a laptop with a
 built-in screen...

>>>
>>> Can you get the log with drm.debug=6 from both boots as well?
>>
>> Attached.
>
> The log is also at http://nullroute.eu.org/tmp/2013/dmesg-drm-debug.txt
>
> Not to be annoying, but I hope this can be fixed until 3.9...
>
> (I just tested v3.9-rc1-278-g8343bce, and it still does not detect any
> displays. And if I understood it correctly, "nomodeset" is going to go
> away?)
>
> --
> Mantas MikulÄ—nas 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: Screen turns off when booting in EFI mode

2013-03-19 Thread Linus Torvalds
On Tue, Mar 19, 2013 at 10:09 AM, Linus Torvalds
 wrote:
>
> Doing things like blindly trusting the firmware data without even
> validating it is a really really bad idea. The commit actually looks
> seriously broken in other ways too, like blindly doing phys_to_virt()
> on that, and then trusting the result

Ok, looks like the only thing filling it in is eboot.c, and I guess it
relies on the EFI memory allocations having been mapped. Which they
hopefully have been.

Still, even that seems somewhat debatable. eboot.c does a plain
memcpy() on the pci->romimage returned by
EfiPciIoAttributeOperationGet. And I can *guarantee* that that doesn't
work on some PCI chips that end up sharing the decoder for the ROM and
the graphics aperture or other device oddities. Afaik, some Radeons do
that, for example.

So whoever wrote that eboot thing seems to assume that the world is a
lot simpler and saner than it actually is, and that everybody
magically got things right. Which they never do. The code was
presumably tested on just a couple of machines.

The problem (well, at least *one* problem) is that pci_map_rom()
actually knows about some of these issues, but if dev->rom and
dev->romlen have been set, it trusts them unconditionally. So we'd
either need to fix that, or we need to be really *really* sure that we
only set dev->rom to guaranteed-correct buffers.

At least the radeon code seems to verify that the ROM image starts
with 0x55/0xaa, but I'm guessing we can't do that in general, even if
it is the traditional PC rom pattern.

We only have a few users of "pci_map_rom()", I'm wondering if we can
move the "dev->rom/romsize" cases into the callers. Then the callers
could decide if they want to trust that "pseudo-shadowed" ROM image
(which would test that 55/aa pattern for example), or whether they
want to try to map the actual physical ROM.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: Screen turns off when booting in EFI mode

2013-03-19 Thread Linus Torvalds
On Tue, Mar 19, 2013 at 12:59 PM, Matthew Garrett  wrote:
>
> Because it's the only way to get the PCI ROM in some cases, like on
> pretty much all Apples with Radeons. Only using it if we have no other
> options probably makes sense, though. Something like this (entirely
> untested)?

This looks reasonable. Mantas?

Trusting the firmware-provided image when we can't find the actual HW
image is quite reasonable. It's the "trust firmware unconditionally"
part that gets my goat.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Linus Torvalds
On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel  wrote:
>
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

The series looks sane to me, and I like how each individual step is
pretty small and makes sense.

It *would* be lovely to see this run with the actual Swingbench
numbers. The microbenchmark always looked much nicer. Do the
additional multi-semaphore scalability patches on top of Davidlohr's
patches help with the swingbench issue, or are we still totally
swamped by the ipc lock there?

Maybe there were already numbers for that, but the last swingbench
numbers I can actually recall was from before the finer-grained
locking..

And obviously, getting this tested so that there aren't any more
missed wakeups etc would be lovely. I'm assuming the plan is that this
all goes through Andrew? Do we have big semop users who could test it
on real loads? Considering that I *suspect* the main users are things
like Oracle etc, I'd assume that there's some RH lab or partner or
similar that is interested in making sure this not only helps, but
also that it doesn't break anything ;)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Linus Torvalds
On Wed, Mar 20, 2013 at 1:49 PM, Linus Torvalds
 wrote:
>
> It *would* be lovely to see this run with the actual Swingbench
> numbers. The microbenchmark always looked much nicer. Do the
> additional multi-semaphore scalability patches on top of Davidlohr's
> patches help with the swingbench issue, or are we still totally
> swamped by the ipc lock there?
>
> Maybe there were already numbers for that, but the last swingbench
> numbers I can actually recall was from before the finer-grained
> locking..

Ok, and if the spinlock is still a big deal even with the finer
granularity, it might be interesting to hear if Michel's fast locks
make a difference. I'm guessing that this series might actually make
it easier/cleaner to do the semaphore locking using another lock,
since the ipc_lock got split up and out...

I think Michel did it for some socket code too. I think that was
fairly independent and was for netperf.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL tip/core/urgent] Fix for hlist_entry_safe() regression

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 7:22 AM, Paul E. McKenney
 wrote:
> [Reposting with corrected subject line.]
>
> Hello, Ingo,
>
> This series contains a single commit that fixes a regression in
> hlist_entry_safe().  ..

You do realize that I already merged this a week ago directly? (Merge
commit f4846e52c517)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rwsem: steal writing sem for better performance

2013-02-06 Thread Linus Torvalds
On Wed, Feb 6, 2013 at 10:28 PM, Ingo Molnar  wrote:
>
> Linus, Andrew, what is your thinking about the patch and about
> the timing of the patch?

Not for 3.8. Queue it for 3.9, with possibly a stable tag with a big
comment "apply after much testing".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] f2fs fixes for v3.8-rc7

2013-02-07 Thread Linus Torvalds
No.

You guys need to realize that I'm not talking crap like this this late.

This is not major bugfixes. I already looked away once just because
it's a new filesystem, but enough is enough. This is way way WAY too
late to start sendign "enhancements". Seriously.

Send them for the next merge window. Not just before rc7.

   Linus

On Thu, Feb 7, 2013 at 11:21 AM, Jaegeuk Kim  wrote:
> Hi Linus,
>
> Here are four patches which are critical bug fixes on f2fs, three
> enhancement patches, and a number of trivial patches.
> Please pull the following tag. Sorry for the late request.
>
> Thanks,
>
> The following changes since commit
> 6abb7c25775b7fb2225ad0508236d63ca710e65f:
>
>   Merge tag 'regulator-3.8-rc5' of
> git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
> (2013-01-28 22:44:53 -0800)
>
> are available in the git repository at:
>
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
> tags/f2fs-for-v3.8
>
> for you to fetch changes up to 1efc6d3277f59b764384781c0f8dfc821f229380:
>
>   f2fs: add compat_ioctl to provide backward compatability (2013-02-06
> 17:38:59 +0900)
>
> 
> f2fs fixes for v3.8
>
> [Major bug fixes]
> o Store device file information correctly
> o Fix -EIO handling with respect to power-off-recovery
> o Allocate blocks with global locks
> o Fix wrong calculation of the SSR cost
>
> [Enhancement]
> o Support (un)freeze_fs
> o Enhance the f2fs_gc flow
> o Support 32-bit binary execution on 64-bit kernel
>
> 
> Alejandro Martinez Ruiz (1):
>   f2fs: fix disable_ext_identify option spelling
>
> Changman Lee (5):
>   f2fs: save device node number into f2fs_inode
>   f2fs: add un/freeze_fs into super_operations
>   f2fs: stop repeated checking if cp is needed
>   f2fs: remove repeated F2FS_SET_SB_DIRT call
>   f2fs: remove unnecessary gc option check and balance_fs
>
> Jaegeuk Kim (6):
>   f2fs: prevent checkpoint once any IO failure is detected
>   f2fs: cover global locks for reserve_new_block
>   f2fs: remove the use of page_cache_release
>   f2fs: avoid balanc_fs during evict_inode
>   f2fs: clarify and enhance the f2fs_gc flow
>   f2fs: fix calculation of max. gc cost in the SSR case
>
> Namjae Jeon (8):
>   f2fs: avoid redundant call to has_not_enough_free_secs in f2fs_gc
>   f2fs: reorganize code for ra_node_page
>   f2fs: fix typo mistake for data_version description
>   f2fs: name gc task as per the block device
>   f2fs: mark gc_thread as NULL when thread creation is failed
>   f2fs: make an accessor to get sections for particular block type
>   f2fs: optimize the return condition for has_not_enough_free_secs
>   f2fs: add compat_ioctl to provide backward compatability
>
> majianpeng (4):
>   f2fs: clean up the add_orphan_inode func
>   f2fs: add device name in debugfs
>   f2fs: use F2FS_BLKSIZE to judge bloksize and page_cache_size
>   f2fs: when check superblock failed, try to check another
> superblock
>
>  fs/f2fs/checkpoint.c |  63 +++---
>  fs/f2fs/debug.c  |   4 +-
>  fs/f2fs/f2fs.h   |  32 ++---
>  fs/f2fs/file.c   |  35 ---
>  fs/f2fs/gc.c | 124
> ++-
>  fs/f2fs/gc.h |  21 -
>  fs/f2fs/inode.c  |  53 +-
>  fs/f2fs/node.c   |  14 +++---
>  fs/f2fs/recovery.c   |   4 +-
>  fs/f2fs/segment.c|  29 
>  fs/f2fs/segment.h|  23 +++---
>  fs/f2fs/super.c  |  92 +-
>  12 files changed, 262 insertions(+), 232 deletions(-)
>
> --
> Jaegeuk Kim
> Samsung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2013-02-08 Thread Linus Torvalds
Pulled.

However, there's still the r8169 regressions (see the emails with the
subject "regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0
timed out"). It's bisected, and a revert is reported to fix things.
It's not in this pull request. Comments?

   Linus

On Sat, Feb 9, 2013 at 7:17 AM, David Miller  wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
>
> for you to fetch changes up to a1c83b054ebe1264ed9ae9d5c286f9eae68e60ea:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-08 Thread Linus Torvalds
On Sat, Feb 9, 2013 at 1:55 AM, Ingo Molnar  wrote:
>
> I'll remove it if Linus insists on it, but I think you guys are
> putting form before substance and utility :-(

No. Your pull requests are just illogical. I have yet to see a single
reason why it should be merged.

I *thought* "ease of use" could be a reason, and then people posted
five-liner scripts to give some of the other virtual boxes the same
kind of interface.

Avoiding five lines of shell script is not a reason to pull a project
into the kernel.

> tools/kvm/ is a useful utility to kernel development, that in
> just this past cycle was used as an incubator to:

That's total bullshit.

It could be useful whether it is merged into the kernel or not.

"git" is a hell of a lot more useful utility for kernel development,
to the point that practically we couldn't do without it any more, and
it isn't merged into the kernel. It's a separate project with a
separate life, and it is *better* for it. The same goes for all the
tools that everybody uses every day for kernel development, often
without even thinking about them.

So "utility to kernel development" is not a reason for merging it into
the kernel. NOT AT ALL.

> *Please* don't try to harm useful code just for the heck of it.

Again, total *bullshit*. Right now, the whole "attach the kvmtool to
the kernel as a remora" doesn't make any sense at all, and not merging
it doesn't harm anything AT ALL. Quite the reverse.

The fact that kvmtool isn't available as a standalone project probably
keeps people actively from using it. You can't just fetch kvmtool. You
have to fetch the kernel and kvmtool, and if you're a kernel developer
you either have to make a whole new kernel tree for it (which is
stupid) or merge it into your normal kernel tree that has development
that has nothing to do with kvmtool (which is stupid AND F*CKING
INSANE)

> Please stop this silliness, IMO it's not constructive at all ...

Ingo, it's not us being silly, it is *you*.

What the heck is the advantage of merging it into the kernel? It has
never ever been explained.

This is not like "perf", where there wasn't any alternatives, and
oprofile had shown over many many years that the situation wasn't
improving: it was only getting worse, and more disconnected from the
actual capabilities of the hardware.

But kvmtool is no "perf". There are other virtual boxes, and rather
than being limited, they are more capable! The selling tool of kvmtool
was never that it did something particularly magical, it was always
that it did less, and was easier to use. But that does not in any way
mean "should be merged". You can do less and be easier to use and
stand on your own legs.

So here, let me state it very very clearly: I will not be merging
kvmtool. It's not about "useful code". It's not about the project
keeping to improve. Both of those would seem to be *better* outside
the kernel, where there isn't that artificial and actually harmful
tie-in.

In other words, I don't see *any* advantage to merging kvmtool. I
think merging it would be an active mistake, and would just tie two
projects together that just shouldn't be tied together.

So nobody is "hurting useful code", except perhaps you.

Explain to me why I'm wrong. I don't think you can. You certainly
haven't so far.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux v3.8-rc7

2013-02-08 Thread Linus Torvalds
: Provide dma_mmap_coherent() and dma_get_sgtable()
  blackfin: Provide dma_mmap_coherent() and dma_get_sgtable()
  c6x: Provide dummy dma_mmap_coherent() and dma_get_sgtable()
  cris: Provide dma_mmap_coherent() and dma_get_sgtable()
  frv: Provide dummy dma_mmap_coherent() and dma_get_sgtable()
  m68k: Provide dma_mmap_coherent() and dma_get_sgtable()
  mn10300: Provide dummy dma_mmap_coherent() and dma_get_sgtable()
  parisc: Provide dummy dma_mmap_coherent() and dma_get_sgtable()
  xtensa: Provide dummy dma_mmap_coherent() and dma_get_sgtable()

Glauber Costa (1):
  memcg: fix typo in kmemcg cache walk macro

H. Peter Anvin (1):
  x86, doc: Boot protocol 2.12 is in 3.8

Hans Verkuil (1):
  [media] radio: set vfl_dir correctly to fix modulator regression

Haojian Zhuang (1):
  drivers/rtc/rtc-pl031.c: fix the missing operation on enable

Hauke Mehrtens (2):
  bcma: unregister gpios before unloading bcma
  ssb: unregister gpios before unloading ssb

Heiko Carstens (1):
  atm/iphase: rename fregt_t -> ffreg_t

Ian Campbell (3):
  xen/netback: shutdown the ring if it contains garbage.
  xen/netback: free already allocated memory on failure in
xen_netbk_get_requests
  netback: correct netbk_tx_err to handle wrap around.

Ilpo Järvinen (1):
  tcp: fix for zero packets_in_flight was too broad

Jan Beulich (2):
  x86-64: Replace left over sti/cli in ia32 audit exit code
  xen-pciback: rate limit error messages from xen_pcibk_enable_msi{,x}()

Jan Luebbe (1):
  drivers/rtc/rtc-isl1208.c: call rtc_update_irq() from the alarm
irq handler

Jan Schmidt (1):
  Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata

Jason Wang (3):
  vhost_net: correct error handling in vhost_net_set_backend()
  vhost_net: handle polling errors when setting backend
  tuntap: allow polling/writing/reading when detached

Jesse Gross (1):
  openvswitch: Move LRO check from transmit to receive.

Jiri Olsa (1):
  perf: Fix event group context move

Joe Perches (1):
  checkpatch: fix $Float creation of match variables

Johan Hedberg (1):
  Bluetooth: Fix handling of unexpected SMP PDUs

Johannes Naab (1):
  netem: fix delay calculation in rate extension

Joonsoo Kim (1):
  tools/vm: add .gitignore to ignore built binaries

Josef Bacik (3):
  Btrfs: do not merge logged extents if we've removed them from the tree
  Btrfs: fix missing i_size update
  Btrfs: fix possible stale data exposure

Kirill A. Shutemov (1):
  thp: avoid dumping huge zero page

Kukjin Kim (1):
  pinctrl: exynos: change PINCTRL_EXYNOS option

Lan Tianyu (1):
  usb: Using correct way to clear usb3.0 device's remote wakeup feature.

Larry Finger (2):
  rtlwifi: Fix the usage of the wrong variable in usb.c
  rtlwifi: Fix scheduling while atomic bug

Lars Ellenberg (1):
  drbd: fix potential protocol error and resulting disconnect/reconnect

Linus Torvalds (1):
  Linux 3.8-rc7

Liu Bo (1):
  Btrfs: fix race between snapshot deletion and getting inode

Lucas Stach (1):
  net: usb: fix regression from FLAG_NOARP code

Luis Llorente Campo (1):
  USB: add OWL CM-160 support to cp210x driver

Marcelo Ricardo Leitner (1):
  ipv6: do not create neighbor entries for local delivery

Marek Szyprowski (1):
  regulator: max8998: fix incorrect min_uV value for ldo10

Matthew Daley (1):
  xen/netback: don't leak pages on failure in xen_netbk_tx_check_gop.

Matthias Brugger (1):
  MAINTAINERS: update avr32 web ressources

Matthieu CASTET (1):
  mtd: nand: onfi don't WARN if we are in 16 bits mode

Miao Xie (2):
  Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
  Btrfs: fix missing release of the space/qgroup reservation in
start_transaction()

Michael S. Tsirkin (2):
  tun: fix carrier on/off status
  tcm_vhost: fix pr_err on early kick

Mike Marciniszyn (1):
  IB/qib: Fix for broken sparse warning fix

Mikko Tiihonen (1):
  drm/radeon: protect against div by 0 in backend setup

Milos Vyletel (1):
  bonding: unset primary slave via sysfs

Neil Horman (1):
  vmxnet3: set carrier state properly on probe

Nicholas Bellinger (5):
  target: Fix zero-length INQUIRY additional sense code regression
  target: Fix zero-length MODE_SENSE regression
  target: Fix zero-length READ_CAPACITY_16 regression
  target: Fix regression allowing unconfigured devices to fabric port link
  target: Fix divide by zero bug in fabric_max_sectors for
unconfigured devices

Nickolai Zeldovich (1):
  drivers: xhci: fix incorrect bit test

Nivedita Singhvi (1):
  tcp: Increment LISTENOVERFLOW and LISTENDROPS in tcp_v4_conn_request()

Or Gerlitz (1):
  mlx4_core: Fix advertisement of wrong PF context behaviour

Paul Gortmaker (2):
  rcu: Prevent soft-lockup complaints about no-CBs CPUs
  rcu: Make rcu_nocb_poll an early_pa

Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-08 Thread Linus Torvalds
On Sat, Feb 9, 2013 at 10:57 AM, Pekka Enberg  wrote:
>
> And yes, you are absolutely correct that living in the kernel tree is
> suboptimal for the casual user. However, it's a trade-off to make
> tools/kvm *development* easier especially when you need to touch both
> kernel and userspace code.

Quite frankly, that's just optimizing for the wrong case.

The merged case seems to make sense for you and Ingo, and nobody else.

And then you wonder why nobody else wants to merge it.

I've told you my reasons, you didn't give me *any* actual reasons for
me to merge the code. NONE of your statements made any sense at all,
since everything you talk about could have been done with a separate
project.

The only thing the lock-step does is to generate the kind of
dependency that I ABSOLUTELY DETEST, where one version of kvmtools
goes along with one version of the kernel. That's a huge disadvantage
(and we've actually seen signs of that in the perf tool too, where old
versions of the tools have been broken, because the people working on
them have been way too much in lock-step with the kernel it is used
on).

And if it's not the case that they have to be used in lockstep, then
clearly kvmtool developers could just as easily just have a separate
git repository.

So you can't have it both ways. What's so wrong with just making it a
separate project?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-09 Thread Linus Torvalds
You do realize that none of your arguments touched the "why should
Linus merge the tree" question at all?

Everything you said was about how it's more convenient for you and
Ingo, not at all about why it should be better for anybody else. You
haven't bothered to even try making it an external project, so it
doesn't compile that way. You're the only one working on it, so being
convenient for you is the primary issue. Arguments like that actively
make me not want to merge it, because they are only arguments for you
continuing to work the way you have, not arguments for why the project
would make sense to merge into the main kernel repository.

So I think we should just remove this from linux-next, and be done
with the fantasy that it makes sense to merge this. You're not even
trying to convince anybody else about the merge making sense.

You might as well continue to work the way you do, and I don't mind -
but none of it argues for me merging it into the kernel. There's no
reason why kvmtool couldn't be external the way all the other
virtualization projects are.

 Linus

On Feb 9, 2013 2:01 AM, "Pekka Enberg"  wrote:
>
> On Sat, Feb 9, 2013 at 2:45 AM, Linus Torvalds
>  wrote:
> > Quite frankly, that's just optimizing for the wrong case.
>
> I obviously don't agree. I'm fairly sure there wouldn't be a kvmtool
> that supports x86, PPC64, ARM, and all the virtio drivers had we not
> optimized for making development for kernel folks easy.
>
> In fact that's something Ingo pushed for pretty hard early on and we
> also worked hard just to make the code 'feel familiar' to kernel folks.
> The assumption was that if we did that, we'd see contributions from
> people who would normally not write userspace code.
>
> On Sat, Feb 9, 2013 at 2:45 AM, Linus Torvalds
>  wrote:
> > The merged case seems to make sense for you and Ingo, and nobody else.
>
> That's hardly surprising. I'm the only person who was crazy enough to
> listen to Ingo and follow through with the damn thing. So I either have
> the same experience and perspective as Ingo does on the matter - or I'm
> just as full of 'bullshit' as he is.
>
> On Sat, Feb 9, 2013 at 2:45 AM, Linus Torvalds
>  wrote:
> > The only thing the lock-step does is to generate the kind of
> > dependency that I ABSOLUTELY DETEST, where one version of kvmtools
> > goes along with one version of the kernel.
>
> That is simply NOT TRUE. We have never done such a thing with 'kvmtool'
> nor I have any evidence that 'perf' has done that either. I regularily
> run old versions to make sure that we stay that way.
>
> On Sat, Feb 9, 2013 at 2:45 AM, Linus Torvalds
>  wrote:
> > So you can't have it both ways. What's so wrong with just making it a
> > separate project?
>
> Do you really think it's an option I have not considered several times?
>
> There are the immediate practical problems:
>
>   - What code should we take with us from the Linux repository. If I take
> just tools/kvm, it won't even build.
>
>   - Where do we do our development? Right now we are using the KVM list
> and are part of tip tree workflow. As a separate project, we need to
> build the kind of infrastructure we already are relying on now.
>
> Then there are the long term issues:
>
>   - How do we keep up with KVM and virtio improvements?
>
>   - How do we ensure we get improvements that happened in the kernel
> tree to our codebase for the code we share?
>
>   - How do we make it easy for future KVM and virtio developers to
> access our code?
>
> If you want perspective on this just ask Ingo sometime how he feels
> about making tools/perf a separate project (which I have actually done).
> Much of the *practical* aspects applies to tools/kvm.
>
> And really, I'm a practical kind of guy. Why do you think I'm willing to
> bang to my head to the wall if spinning off as a separate project would
> be as simple as you seem to think it is?
>
> Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-09 Thread Linus Torvalds
On Sun, Feb 10, 2013 at 6:39 AM, Pekka Enberg  wrote:
>
> The main argument for merging into the main kernel repository has always been
> that (we think) it improves the kernel because significant amount of
> development is directly linked to kernel code (think KVM ARM port here, for
> example). The secondary argument has been to make it easy for kernel 
> developers
> to work on both userspace and kernel in tandem (like has happened with vhost
> drivers). In short: it speeds up development of Linux virtualization code.

Why? You've made this statement over and over and over again, and I've
dismissed it over and over and over again because I simply don't think
it's true.

It's simply a statement with nothing to back it up. Why repeat it?

THAT is my main contention. I told you why I think it's actually
actively untrue. You claim it helps, but what is it about kvmtool that
makes it so magically helpful to be inside the kernel repository? What
is it about this that makes it so critical that you get the kernel and
kvmtool with a single pull, and they have to be in sync? When you then
at the same time claim that you make very sure that they don't have to
be in sync at all. See your earlier emails about how you claim to have
worked very hard to make sure they work across different versions.

So you make these unsubstantiated claims about how much easier it is,
and they make no sense. You never explain *why* it's so magically
easier. Is git so hard to use that you can't do "git pull" twice? And
why would you normally even *want* to do git pull twice? 99% of the
work in the kernel has nothing what-so-ever to do with kvmtool, and
hopefully the reverse is equally true.

And tying into the kernel just creates this myopic world of only
looking at the current kernel. What if somebody decides that they
actually want to try to boot Windows with kvmtool? What if somebody
tells you that they are really tired of Xen, and actually want to turn
kvmtool into  *replacement* for Xen instead? What if somebody wants to
branch off their own work, concentrating on some other issue entirely,
and wants to merge with upstream kvmtool but not worry about the
kernel, because they aren't working on the Linux kernel at all, and
their work is about something else?

I just don't think it makes sense. I don't see what the huge advantage
of a single git tree is.

Anyway, I'm done arguing. You can do what you want, but just stop
misrepresenting it as being "linux-next" material unless you are
willing to actually explain why it should be so.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 4:26 AM, Ingo Molnar  wrote:
>
> If you are asking whether it is critical for the kernel project
> to have tools/kvm/ integrated then it isn't. The kernel will
> live just fine without it, even if that decision is a mistake.

You go on to explain how this helps kvmtool, and quite frankly, I DO NOT CARE.

Everything you talk about is about helping your work, by making the
kernel maintenance be more. The fact that you want to use kernel
infrastructure in kvmtool is a great example: you may think it's a
great thing, but for the kernel it's just extra work, and extra layers
of abstraction etc etc.

And then you make it clear that you haven't even *bothered* to try to
make it a separate project.

Sorry, but with that kind of approach, I get less and less interested.
I think this whole tying together is a big mistake. It encourages
linkages that simply shouldn't be there.

And no, perf is not the perfect counter-example. With perf,. the
linkages made sense! There's supposed to be deep linkages to profiling
and event counting. There is ABSOLUTELY NOT supposed to be deep
linkages with virtualization. Quite the reverse.

And no, I don't want to maintain the mess that is both. There's just
no gain, and lots of potential pain.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 5:18 AM, David Woodhouse  wrote:
>
> That's complete nonsense. If you want to use pieces of the kernel
> infrastructure, then just *take* them. There are loads of projects which
> use the kernel config tools, for example. There's no need to be *in* the
> kernel repo.

Exactly. I do *not* want a abstraction layer just because somebody
wants to use it. It causes idiotic guards in the header files etc. We
already had that pain with the user-level header inclusions etc.

Just copy it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 9:58 AM, Ingo Molnar  wrote:
>
> So basically Pekka optimistically thought it's an eventual 'tit
> for tat', a constant stream of benefits to the kernel, in the
> hope of finding a home in the upstream kernel which would
> further help both projects. The kernel wants to keep the 'tit'
> only though.

Ingo, stop this idiotic nonsense.

You seem to think that "kvmtool is useful for kernel" is somehow relevant.

IT IS TOTALLY IRRELEVANT.

"sparse" is useful for kernel development. "git" is useful for kernel
development. "xterm" is useful for kernel development.

See a pattern? We have tons of tools that are helping develop the
kernel, and for absolutely NONE of them is that at all an argument for
merging them into the kernel.

If the Xen people came and asked me to merge their virtualizer code
into the kernel, I'd call them idiots.

Why is kvmtool so magical that you use this argument for merging it
into the kernel?

It makes no sense.

Yet you continue to use it as if it was somehow an argument for
merging it. Despite the hundreds of projects to the contrary.

So this whole "constant stream of benefits" you talk about is PURE AND
UTTER GARBAGE. And that's not a commentary on whether it is true or
not, it's a commentary on the fact that it is entirely irrelevant to
whether something should be merged.

Merging two projects does not make them easier to maintain. Quite the
reverse. It just ties the maintenance together in irrelevant ways.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 5:37 PM, tip-bot for H. Peter Anvin
 wrote:
>
> However, we can declare a bitfield using sizeof(), which is legal
> because sizeof() is a constant expression.  This quiets the warning,
> although the code generated isn't 100% identical from the baseline
> before 96477b4 x86-32: Add support for 64bit get_user():

Christ. This is so ugly that it's almost a work of art.

Has anybody run this past any gcc developers? And if so, did they run
away screaming?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 8:21 PM, H. Peter Anvin  wrote:
> On 02/11/2013 07:33 PM, Linus Torvalds wrote:
>
>> Has anybody run this past any gcc developers? And if so, did they run
>> away screaming?
>
> I haven't no... H.J., any comments on this patch?

I'd be most worried about any known pitfalls about bitfield code
generation. Looking at your code size numbers, it actually seems to
*improve* code generation except for the odd i386.pae case (bigger
code but also a different data size - odd) and i386 noconfig
(different bss, bigger code).

The code/data changes makes me wonder if the variable sometimes gets
flushed to memory as a 8-byte entry, and maybe there are things gcc
people can suggest..

But I don't see anything fundamentally wrong with it. Certainly it
looks much better than the disgusting and warning-prone

unsigned long long __val_gu8

thing.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-11 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 8:42 PM, Linus Torvalds
 wrote:
>
> But I don't see anything fundamentally wrong with it. Certainly it
> looks much better than the disgusting and warning-prone
>
> unsigned long long __val_gu8
>
> thing.

Oh. I just realized. That was your _baseline_ in the comparisons, wasn't it?

Can you please make the baseline be the current mainline git version
of , not the first "unsigned long long __val_gu8"
version of the 64-bit get_user()?

Because we should compare against the straightforward code, not the
one that could have messed things up already..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.8] fs, dlm: fix build error when EXPERIMENTAL is disabled

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 1:50 AM, Steven Whitehouse  wrote:
>
> That doesn't seem right to me... DLM has not been experimental for a
> long time now. Why not just select CRC32 in addition to IP_SCTP ?

Hmm. IP_SCTP already does a "select libcrc32c". So why doesn't that
end up working?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 8:38 AM, H.J. Lu  wrote:
>
> Can you do something similar to what we did in glibc:

No. Because we use macros to be type-independent (i e"get_user()"
works *regardless* of type), so casting to "uintptr_t" doesn't work.
It throws away the type information, and truncates 64-bit values on
32-bit architectures.

The whole point of the bitmask thing is that it doesn't have that
issue, and gets the size correct automatically. It's not pretty, but
it allows the rest of the sources to be readable.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-12 Thread Linus Torvalds
On Mon, Feb 11, 2013 at 10:25 PM, Artem S. Tashkinov  wrote:
> Hello Linus,
>
> I've already posted a bug report 
> (https://bugzilla.kernel.org/show_bug.cgi?id=53551),
> a message to LKML 
> (http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/00837.html)
> and so far I've received zero response even though the bug is quite critical 
> as it prevents
> me from using suspend altogether.
>
> I wonder if you could tell me who is responsible for this problem and who I 
> need to CC in
> bugzilla.

According to your bugzilla it doesn't really seem to be strictly
UEFI-specific, and it's hard to tell what subsystem is to blame.

A few things to try to pinpoint:

 (a) Is it *only* write performance that suffers, or is it other
performance too? Networking (DMA? Perhaps only writing *to* the
network?)? CPU?

 (b) the fact that it apparently happens with both SATA and USB
implies that it's neither, and is more likely something core like
memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever).

 (c) can you find anything that changes over the suspend/resume? IOW,
look at things like "lspci -vvxxx" before-and-after, and see what
changed on the bridges leading to both things etc.

The performance drop sounds extreme enough that it sounds like caches
got disabled or something, but that should show up as CPU performance
in general being slow, not just writes to disk. But basically, I think
we need more clues about which sub-area is actually the culprit. My
*guess* would be some core PCI thing not being initialized, but I
don't see how you could even make PCI go that slow. Interrupt
problems? DMA failures? I have no idea.

Has it ever worked? Suspend on desktop motherboards used to be quite
spotty (nobody ever used it, manufacturers didn't care), but it
generally has gotten better since people use it more these days..

Added lkml and Bjorn to the participants, in case anybody has any ideas..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 9:14 AM, H. Peter Anvin  wrote:
>
> No, I think what he is talking about it this bit:

Ok, I agree that the bitfield code actually looks cleaner.

That said, maybe gcc has an easier time using a few odd builtins and
magic typeof's. But at least the bitfield trick looks half-way
portable..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 9:35 AM, H. Peter Anvin  wrote:
>
> On the other hand, it still uses two gcc extensions: long long bitfields and
> typeof.
>
> I'll see what kind of code we get with the macro.

At least one thing to look out for is the poor LLVM people who are
trying to make the kernel compile with that compiler.. We shouldn't
make it arbitrarily harder for them, so *some* level of portability is
a good idea.

Then there is icc, but I don't know how relevant that would ever be.
At least LLVM has the potential to be widely available.

Of course, they may both already support even the odd gcc builtins -
we already use a lot of the more straightforward ones...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Use a bitfield to mask nuisance get_user() warnings

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 10:25 AM, H. Peter Anvin  wrote:
> I just thought up this variant, I'm about to test it, but H.J., do you
> see any problems with it?

Looks good to me. And we already use __builtin_choose_expr(), so it's
"portable". And it should avoid all the potential issues with
bitfields (rmk already pointed out how bitfields don't work well with
the ARM model, who knows what other pitfalls bitfield code generation
could have)

I wonder if we could/should eventually do some of the sizeof() in
generic code - not have these magic things duplicated in all the
architectures, just have the architectures specify the raw typed
details (__copy_to_user_4() etc). So cross-platform portability could
be a good thing. That's a separate discussion, though, and possibly
not worth it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 10:29 AM, Artem S. Tashkinov  wrote:
> Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote:
>>
>>A few things to try to pinpoint:
>>
>> (a) Is it *only* write performance that suffers, or is it other
>>performance too? Networking (DMA? Perhaps only writing *to* the
>>network?)? CPU?
>
> I've tested hdpard -tT --direct and the output on boot and after suspend
> is quite similar.
>
> I've also checked my network read/write speed, and it's the same
> ~ 100MBit/sec (I have no 1Gbit computers on my network
> unfortunately).

Ok. So it really sounds like just USB and HD writes. Which is quite
odd, since they have basically nothing in common I can think of
(except the obvious block layer issues).

>> (b) the fact that it apparently happens with both SATA and USB
>>implies that it  's neither, and is more likely something core like
>>memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever).
>
> I've no idea, please, check my bug report where I've just added lots of
> information including a diff between on boot and after suspend.

I'm not seeing anything particularly interesting there.

Except why/how did the MSI address/data change for the SATA
controller? The irq itself hasn't changed.. There's probably some sane
reason for that too (it's an odd encoding, maybe they code for the
same thing), and there's nothing like that for USB, so...

And if it was irq problems, I'd expect you to see it more for reads
than for writes anyway. Along with a few messages about missed irqs
and whatever.

I'm stumped, and have no ideas. I can't even begin to guess how this
would happen. One thing to try is if it happens for all USB ports (you
have multiple controllers) and I assume performance doesn't come back
if you unplug and replug the USB disk..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 11:39 AM, Dave Jones  wrote:
> My Thinkpad T430s suspend/resumes fine most of the time. But every so often
> (like one in ten times or so), as soon as I suspend, I get a black screen,
> and a blinking power button.
>
> (Note: Not the capslock lights like when we panic, this laptop 'conveniently
>  doesn't have those. This is the light surrounding the power button, which 
> afaik
>  isn't even OS controlled, so maybe we're dying somewhere in SMI/BIOS land?)

Yeah, the blinking power light is a feature of the chipset, the SMI
code sets a magic bit in one the register and it will pulse a pin at a
given frequency so that you get the "power light blinking while
suspended" thing.

So the suspend finished, and

> I tried debugging this with pm_trace, which told me..
>
> [4.576035]   Magic number: 0:455:740
> [4.576037]   hash matches drivers/base/power/main.c:645
>
> Which points me at..
>
>  642  Complete:
>  643 complete_all(&dev->power.completion);
>  644
>  645 TRACE_RESUME(error);
>  646
>  647 return error;
>  648 }

I suspect it's the last tracepoint, and the kernel thinks it
sucessfully resumed all devices. You *should* be able to match the
magic number with the last device too, but that's only interesting if
you get the hash matching *before* the device is resumed (ie you can
try to figure out if the resume hung in the device resume list). And
it only works if it gets a matching name on the dpm_list (see
show_dev_hash), and it apparently didn't. I suspect it's some system
device and not interesting, and you really just hit the last entry in
the resume tree.

> The only thing interesting here I think is that this is the resume path.
> So perhaps something failed to suspend, and we tried to back out of 
> suspending,
> but something was too screwed up to abort cleanly ?

Yes, the trace is definitely in the resume path. And maybe we have something

> I've tried hooking up a serial console, and even tried console_noblank,
> which yielded no additional info at all. (I'm guessing the consoles are 
> suspended
> at the time of panic)

serial consoles and even nonblanking consoles seldom tend to work well
for suspend debugging. It *has* happened, but it's rare.

> I also tried unloading all the modules I have loaded before the suspend, which
> seemed to reduce the chances of it happening, but eventually it reoccurred.
>
> Any ideas on how I can further debug this ?

The design of the TRACE_RESUME() thing really is as a really poor mans
"printf()". IOW, the existing points are more "suggested starting
points" than anything else, and the idea is that you can start adding
more and more of them as you try to narrow down exactly where it
fails..

And it's painful has hell. Plus add too many of them, and you get hash
collisions etc. It's a last-ditch effort, but it exists mainly because
we have never really figured out anything better.

There's a reason I've asked Intel for better CPU lockup tracing
facilities for the last 10+ years ;)

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Redesign get_user with a __builtin_choose_expr hack

2013-02-12 Thread Linus Torvalds
So this looks clean, but I noticed something (that was true even of
the old 64-bit accesses)

On Tue, Feb 12, 2013 at 12:55 PM, tip-bot for H. Peter Anvin
 wrote:
> +   register __inttype(*(ptr)) __val_gu asm("%edx");\

How does gcc even alllow this?

On x86-32, you cannot put a 64-bit value in %edx.

Where do the upper bits go? It clearly cannot be %edx:%eax, since we
put the error value in %eax.

So is the rule for x86-32 that naming "long long" register values
names the first register, and the high bits go into the next one (I
forget the crazy register numbering, I assume it's %ecx). Or what?
This should have a comment.

Also, come to think of it, we have tried the "named register
variables" thing before, and it has resulted in problems with scope.
In particular, two variables within the same scope and the same
register have been problematic. And it *does* happen, when you have
things like

   /* copy_user */
   put_user(get_user(.., addr), addr2);

and then things go downhill.

Maybe we do not have these issues, but there are good reasons why
we've tried very hard on x86 to avoid named register variables.

(I realize that they happen, and some other architectures don't even
have good support for naming registers any other way so they are way
more common there, so I probably worry needlessly, but it does worry
me).

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86, mm: Redesign get_user with a __builtin_choose_expr hack

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 3:19 PM, H. Peter Anvin  wrote:
>
> Yes, but there doesn't seem to be any other way to do this.  gcc won't
> even allow "=cd" even if we know the variable is 64 bits, even though
> "=A" is documented to be equivalent to "=da".

No, "=da" means value "in edx _or_ %eax". Not the same as "A".

But you're right, there's nothing similar for %ebx:%ecx. I thought
there was. I was really sure we did something special for 64-bit adc
etc.

> Let me know what you think.

I guess we don't have any choice. And the other cleanups certainly look good.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> If you guys are already looking at this, the conversions between size_t,
> long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
> trigger anything from:

Good catch.

Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX".

I suspect we should change some of the "int" arguments to "size_t" too
so that we don't have these kinds of odd "different routines see
different values due to subtle casting errors", but in the end we
don't really want to ever help people have these kinds of potential
overflow issues. We already limit normal read/write/sendmsg etc to
INT_MAX (although we tend to *truncate* it to INT_MAX rather than
return an error, but I think the simpler patch here is preferable
unless somebody complains).

Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> By just playing with the 'msgsz' parameter with MSG_COPY set.

Hmm. Looking closer, I suspect you're testing without commit
88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
should limit the size passed in to prepare_copy -> load_copy to
msg_ctlmax.

Now, I think it's possibly still a good idea to limit bufsz to INT_MAX
regardless, but as far as I can see that prepare_copy -> load_copy
path is the only place that can get confused. Everybody else uses
size_t (or "long" in the case of r_maxsize) as far as I can tell.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


af_unix udev startup regression

2013-04-04 Thread Linus Torvalds
[ Fixed odd legacy subject line that has nothing to do with the actual bug ]

Hmm. Can you double-check and verify that reverting that commit makes
things work again for you?

Also, what's your distribution and setup? I'd like this to get
verified, just to see that it's not some timing-dependent thing or a
bisection mistake, but if so, then the LSB test-cases obviously have
to be fixed, and the commit that causes the problem needs to be
reverted. Test-cases count for nothing compared to actual users.

Linus

On Thu, Apr 4, 2013 at 9:17 AM, Lai Jiangshan  wrote:
> Hi, ALL
>
> I also encountered the same problem.
>
> git bisect:
>
> 14134f6584212d585b310ce95428014b653dfaf6 is the first bad commit
> commit 14134f6584212d585b310ce95428014b653dfaf6
> Author: dingtianhong 
> Date:   Mon Mar 25 17:02:04 2013 +
>
> af_unix: dont send SCM_CREDENTIAL when dest socket is NULL
>
> SCM_SCREDENTIALS should apply to write() syscalls only either source or
> destination
> socket asserted SOCK_PASSCRED. The original implememtation in
> maybe_add_creds is wrong,
> and breaks several LSB testcases ( i.e.
> /tset/LSB.os/netowkr/recvfrom/T.recvfrom).
>
> Origionally-authored-by: Karel Srot 
> Signed-off-by: Ding Tianhong 
> Acked-by: Eric Dumazet 
> Signed-off-by: David S. Miller 
>
> :04 04 ef0356cc0fc168a39c0f94cff0ba27c46c4d0048
> ae34e59f235c379f04d6145f0103cccd5b3a307a M net
>
> ===
> Like Brian Gerst, no obvious bug, but the system can't boot, "service udev
> start" fails when boot
> (also DEBUG_PAGEALLOC=n, I did not try to test with it=y)
>
> [   11.022976] systemd[1]: udev-control.socket failed to listen on sockets:
> Address already in use
> [   11.023293] systemd[1]: Unit udev-control.socket entered failed state.
> [   11.182478] systemd-readahead-replay[399]: Bumped block_nr parameter of
> 8:16 to 16384. This is a temporary hack and should be removed one day.
> [   14.473283] udevd[410]: bind failed: Address already in use
> [   14.478630] udevd[410]: error binding udev control socket
> [   15.201158] systemd[1]: udev.service: main process exited, code=exited,
> status=1
> [   16.900792] udevd[427]: error binding udev control socket
> [   18.356484] EXT4-fs (sdb7): re-mounted. Opts: (null)
> [   19.738401] systemd[1]: udev.service holdoff time over, scheduling
> restart.
> [   19.742494] systemd[1]: Job pending for unit, delaying automatic restart.
> [   19.747764] systemd[1]: Unit udev.service entered failed state.
> [   19.752303] systemd[1]: udev-control.socket failed to listen on sockets:
> Address already in use
> [   19.770723] udevd[459]: bind failed: Address already in use
> [   19.771027] udevd[459]: error binding udev control socket
> [   19.771175] udevd[459]: error binding udev control socket
> [   19.813256] systemd[1]: udev.service: main process exited, code=exited,
> status=1
> [   19.914450] systemd[1]: udev.service holdoff time over, scheduling
> restart.
> [   19.918374] systemd[1]: Job pending for unit, delaying automatic restart.
> [   19.923392] systemd[1]: Unit udev.service entered failed state.
> [   19.923808] systemd[1]: udev-control.socket failed to listen on sockets:
> Address already in use
> [   19.943792] udevd[465]: bind failed: Address already in use
> [   19.944056] udevd[465]: error binding udev control socket
> [   19.944210] udevd[465]: error binding udev control socket
> [   19.946071] systemd[1]: udev.service: main process exited, code=exited,
> status=1
> [   20.047524] systemd[1]: udev.service holdoff time over, scheduling
> restart.
> [   20.051939] systemd[1]: Job pending for unit, delaying automatic restart.
> [   20.057539] systemd[1]: Unit udev.service entered failed state.
> [   20.058069] systemd[1]: udev-control.socket failed to listen on sockets:
> Address already in use
> [   20.081141] udevd[467]: bind failed: Address already in use
> [   20.087120] udevd[467]: error binding udev control socket
> [   20.092040] udevd[467]: error binding udev control socket
> [   20.096519] systemd[1]: udev.service: main process exited, code=exited,
> status=1
> [   20.184910] systemd[1]: udev.service holdoff time over, scheduling
> restart.
> [   20.189863] systemd[1]: Job pending for unit, delaying automatic restart.
> [   20.195440] systemd[1]: Unit udev.service entered failed state.
> [   20.196012] systemd[1]: udev-control.socket failed to listen on sockets:
> Address already in use
> [   20.220543] udevd[469]: bind failed: Address already in use
> [   20.220584] udevd[469]: error binding udev control socket
> [   20.220780] udevd[469]: error binding udev control socket
> [   20.222830] systemd[1]: udev.service: main process exited, code=exited,
> status=1
> [   20.323906] systemd[1]: udev.service holdoff time over, scheduling
> restart.
> [   20.329170] systemd[1]: Job pending for unit, delaying automatic restart.
> [   20.334785] systemd[1]: Unit udev.service entered failed state.
> [   20.335318] systemd[1]: 

Re: [PATCH] mm: prevent mmap_cache race in find_vma()

2013-04-04 Thread Linus Torvalds
On Thu, Apr 4, 2013 at 11:35 AM, Hugh Dickins  wrote:
>
> find_vma() can be called by multiple threads with read lock
> held on mm->mmap_sem and any of them can update mm->mmap_cache.
> Prevent compiler from re-fetching mm->mmap_cache, because other
> readers could update it in the meantime:

Ack. I do wonder if we should mark the unlocked update too some way
(also in find_vma()), although it's probably not a problem in practice
since there's no way the compiler can reasonably really do anything
odd with it. We *could* make that an ACCESS_ONCE() write too just to
highlight the fact that it's an unlocked write to this optimistic data
structure.

Anyway, applied.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: prevent mmap_cache race in find_vma()

2013-04-04 Thread Linus Torvalds
On Thu, Apr 4, 2013 at 12:01 PM, Hugh Dickins  wrote:
>
> When Paul reminded us of it yesterday, I came to wonder if actually
> every use of ACCESS_ONCE in the read form should strictly be matched
> by ACCESS_ONCE whenever modifying the location.
>
> My uneducated guess is that strictly it ought to, in the sense of
> insurance policy; but that (apart from that strange split writing
> issue which came up a couple of months ago) in practice our compilers
> have not "advanced" to the point of making this an issue yet.

I don't see how a compiler could reasonably really ever do anything
different, but I do think the ACCESS_ONCE() modification version might
be a good thing just as a "documentation".

This is a good example of this issue, exactly because we have a mix of
both speculative cases (the find_vma() lookup and modification)
together with strictly exclusive locked accesses to the same field
(the ones that invalidate the cache under the write lock). So
documenting that the write in find_vma() is this kind of "optimistic
unlocked access" is actually a potentially interesting piece of
information for programmers, completely independently of whether the
compiler will then treat it really differently or not.

Of course, a plain comment would do the same, but would be less greppable.

And despite the verbiage here, I don't really have a very strong
opinion on this. I'm going to let it go, and if somebody sends me a
patch with a good explanation in the next merge window, I'll probably
apply it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Sound fixes for 3.9-rc6

2013-04-05 Thread Linus Torvalds
On Fri, Apr 5, 2013 at 12:46 AM, Takashi Iwai  wrote:
>
> please pull sound fixes for v3.9-rc6 from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git for-linus

Argh, Takashi, you're usually so reliable...

But you actually meant for me to pull the sound-3.9 tag, didn't you?
That "for-linus" branch isn't a signed tag..

Please double-check your scripts,

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS2: Pull request (fixes)

2013-04-05 Thread Linus Torvalds
On Fri, Apr 5, 2013 at 9:27 AM, David Teigland  wrote:
> On Fri, Apr 05, 2013 at 11:34:45AM +0100, Steven Whitehouse wrote:
>> Please consider pulling the following changes,
>
> There's some mixup here that should be cleared up first.
>
>> David Teigland (2):
>>   GFS2: Fix unlock of fcntl locks during withdrawn state
>>
>> Steven Whitehouse (1):
>>   GFS2: Fix unlock of fcntl locks during withdrawn state

Looks like the summary line for one got leaked through an email
follow-up to the other. So now the summary of the second commit is
meaningless and doesn't actually describe it.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] firmware,IB/qib: revert firmware file move

2013-04-05 Thread Linus Torvalds
On Fri, Apr 5, 2013 at 11:15 AM, Mike Marciniszyn
 wrote:
> Commit e2eed58 ("IB/qib: change QLogic to Intel") moved a firmware file
> potentially breaking the ABI.

Please send things like this generated with the "-M" flag so that you
can see it as a rename, instead of a huge add/del patch.

Sure, some people may still use traditional "patch", but catering to
them when it actually hides what the patch does is just not worth it.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] Please pull powerpc.git merge branch

2013-01-28 Thread Linus Torvalds
On Mon, Jan 28, 2013 at 3:42 PM, Benjamin Herrenschmidt
 wrote:
>
> Whenever you have a chance between two dives, you might want to consider
> pulling my merge branch to pickup a few fixes for 3.8 that have been
> accumulating for the last couple of weeks (I was myself travelling
> then on vacation).

I'll have you know that I haven't quite even left for Au yet, and I
have LCA before diving. So no snarky "in between dives" comments,
please.

At least not for a few days.

>   git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git

Nothing there. Forgot to push? Or some unnamed branch/tag?

(And I _am_ leaving for the airport soon, so I may not get to it for a
while unless you reply asap)

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: circular locking dependency detected

2013-01-30 Thread Linus Torvalds
On Thu, Jan 31, 2013 at 9:19 AM, Russell King  wrote:
>
> So... what you seem to be telling me is that 3.9 is going to be a
> release which issues lockdep complaints when the console blanks, and
> you think that's acceptable?
>
> Adding Linus and Andrew so they're aware of this issue...

Oh, we're extremely aware of it. And it's not a new issue, the locking
problem have apparently been around forever, although I'm not sure why
the lockdep splat itself started happening only recently.

They'll make it into 3.9, it's 3.8 that won't have them. The patches
initially caused way *worse* behavior than just a lockdep splat - they
caused actual hard lockups (and that was *after* the initial series of
fixes). That got fixed (hopefully for the last case!) fairly recently,
and I'm not willing to take the scary patch-series that has had
several problem cases.

  LInus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: circular locking dependency detected

2013-01-30 Thread Linus Torvalds
On Thu, Jan 31, 2013 at 11:13 AM, Russell King  wrote:
>
> Which may or may not be a good thing depending how you look at it; it
> means that once your kernel blanks, you get a lockdep dump.  At that
> point you lose lockdep checking for everything else because lockdep
> disables itself after the first dump.

Fair enough, we may want to revert the lockdep checking for
console_lock, and make re-enabling it part of the patch-series that
fixes the locking.

Daniel/Dave? Does that sound reasonable?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/40] CPU hotplug rework - episode I

2013-01-31 Thread Linus Torvalds
On Fri, Feb 1, 2013 at 8:48 AM, Thomas Gleixner  wrote:
>> Methinks Tejun needed a cc on this lot ;)
>
> Not really.

I think we want as many people as possible cc'd on this. You may think
it's an obvious improvement, but maybe it's just because you now
understand the code because you wrote it yourself, not because it's
*actually* better.

Having some explicitly documented states may be nice, but do we need
eleven of them? And do we want to expose them? At least not for the
f*cking notifiers, I hope. Notifiers are a disgrace, and almost all of
them are a major design mistake. They all have locking problems, the
introduce internal arbitrary API's that are hard to fix later (because
you have random people who decided to hook into them, which is the
whole *point* of those notifier chains).

Since the patches themselves weren't cc'd, I don't know if you
actually made each state transition do those insane notifiers or not,
but I seriously hope you didn't. With that many states, hopefully the
idea is that you don't have any notifiers at all, and you just then
call the people associated with a particular state directly. Yes? No?

Because if this adds tons of new notifiers, I'm going to say that we
need about a hundred people signing off on the patches.  Part of your
explanation made me think you got rid of the notifiers, but then it
became clear that you just renamed them as "state callbacks". If
that's some generic exposed interface, I'll NAK it. No way in hell do
we want to expose eleven states with some random generic "SMP state
callback interface". F*ck no.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/40] CPU hotplug rework - episode I

2013-01-31 Thread Linus Torvalds
On Fri, Feb 1, 2013 at 9:44 AM, Thomas Gleixner  wrote:
>
> Just face it. The current hotplug maze has 100+ states which are
> completely undocumented. They are asymetric vs. startup and
> teardown. They just exists and work somehow aside of the occasional
> hard to decode hickup.
>
> Do you really want to preserve that state by all means [F*ck no]?

No., But I also don't want to replace it with "there's now eleven
documented states, and random people hook into random documented
states".

So for me it's the "expose these states" that I get worried about.. A
random driver should not necessarily even be able to *see* this, and
decide to be clever and take advantage of the ordering.

So I'd hope there would be some visibility restrictions. We currently
have drivers already being confused by DOWN_PREPARE vs DOWN_FAILED etc
etc random state transitions, and giving them even more flexibility to
pick random states sounds like a really bad idea. I'd like to make
sure that drivers and filesystems etc do not even *see* the states
that are meant for the scheduler or workqueues, for example).

So 11 states (although some of those seem to have lots of substates,
so there may be many more) is too many to *expose*. It's not
necessarily too many to "have and document", if you see the
difference.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 3.8-rc6

2013-01-31 Thread Linus Torvalds
   DM-RAID: Fix RAID10's check for sufficient redundancy

Kukjin Kim (1):
  pinctrl: samsung: removing duplicated condition for PINCTRL_SAMSUNG

Larry Finger (1):
  rtlwifi: Fix build warning introduced by commit a290593

Lee Jones (1):
  mfd: Fix compile errors and warnings when !CONFIG_AB8500_BM

Li RongQing (2):
  ah4/esp4: set transport header correctly for IPsec tunnel mode.
  ah6/esp6: set transport header correctly for IPsec tunnel mode.

Li Zhong (1):
  powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning for ppc32

Liam Girdwood (2):
  regulator: MAINTAINERS: update email address
  ASoC: MAINTAINERS: Update email address.

Lingzhu Xiang (1):
  efivarfs: Drop link count of the right inode

Linus Torvalds (1):
  Linux 3.8-rc6

Linus Walleij (2):
  mfd: db8500-prcmu: Fix irqdomain usage
  mfd: tc3589x: Use simple irqdomain

Maarten Lankhorst (2):
  x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
  x86, efi: remove attribute check from setup_efi_pci

Mark Brown (8):
  ASoC: dapm: Fix sense of regulator bypass mode
  ASoC: wm5102: Correct AEC loopback mask
  ASoC: wm5110: Correct AEC loopback mask
  ASoC: arizona: Use actual rather than desired BCLK when calculating LRCLK
  ASoC: wm_adsp: Use GFP_DMA for things that may be DMAed
  mfd: arizona: Disable control interface reporting for WM5102 and WM5110
  mfd: arizona: Check errors from regcache_sync()
  mfd: wm5102: Fix definition of WM5102_MAX_REGISTER

Matt Fleming (5):
  efivarfs: Never return ENOENT from firmware
  efivarfs: Delete dentry from dcache in efivarfs_file_write()
  x86, efi: Set runtime_version to the EFI spec revision
  efi: Make 'efi_enabled' a function to query EFI facilities
  samsung-laptop: Disable on EFI hardware

Matthias Schiffer (3):
  batman-adv: fix skb leak in batadv_dat_snoop_incoming_arp_reply()
  batman-adv: check for more types of invalid IP addresses in DAT
  batman-adv: filter ARP packets with invalid MAC addresses in DAT

Michal Kubecek (1):
  xfrm: fix freed block size calculation in xfrm_policy_fini()

Michel Dänzer (1):
  drm/radeon: Enable DMA_IB_SWAP_ENABLE on big endian hosts.

Mike Snitzer (1):
  dm thin: fix queue limits stacking

Nathan Zimmer (1):
  efi, x86: Pass a proper identity mapping in efi_call_phys_prelog

Neil Horman (1):
  sctp: refactor sctp_outq_teardown to insure proper re-initalization

Nicholas Santos (1):
  HID: usbhid: quirk for Formosa IR receiver

Nickolai Zeldovich (2):
  3c574_cs: fix operator precedence between << and &
  net/xfrm/xfrm_replay: avoid division by zero

Nithin Nayak Sujir (2):
  tg3: Avoid null pointer dereference in tg3_interrupt in netconsole mode
  tg3: Fix crc errors on jumbo frame receive

Olivier Sobrie (3):
  can: c_can: fix invalid error codes
  can: ti_hecc: fix invalid error codes
  can: pch_can: fix invalid error codes

Or Gerlitz (1):
  net/mlx4_core: Set number of msix vectors under SRIOV mode to
firmware defaults

Pablo Neira Ayuso (2):
  netfilter: xt_CT: fix unset return value if conntrack zone are disabled
  netfilter: nf_conntrack: fix BUG_ON while removing nf_conntrack with netns

Paul Moore (2):
  selinux: add the "attach_queue" permission to the "tun_socket" class
  tun: fix LSM/SELinux labeling of tun/tap devices

Peter Korsgaard (1):
  dm9601: support dm9620 variant

Piotr Haber (1):
  brcmsmac: increase timer reference count for new timers only

Pravin B Shelar (1):
  IP_GRE: Fix kernel panic in IP_GRE with GRE csum.

Rahul Sharma (1):
  drm/exynos: let drm handle edid allocations

Ralf Baechle (5):
  MIPS: BCM47xx: Enable SSB prerequisite SSB_DRIVER_PCICORE.
  MIPS: Export .
  MIPS: Add struct p_format to union mips_instruction.
  MIPS: PNX833x: Fix comment.
  MIPS: Octeon: Fix warning.

Randy Dunlap (1):
  x86/olpc: Fix olpc-xo1-sci.c build errors

Rob Herring (1):
  net: calxedaxgmac: throw away overrun frames

Romain KUNTZ (1):
  ipv6: fix header length calculation in ip6_append_data()

Sachin Kamat (4):
  drm/exynos: Make g2d_userptr_get_dma_addr static
  drm/exynos: Make ipp_handle_cmd_work static
  drm/exynos: Add missing static specifiers in exynos_drm_rotator.c
  drm/exynos: Make 'drm_hdmi_get_edid' static

Sean Paul (2):
  drm/exynos: Replace mdelay with usleep_range
  drm/exynos: Remove "internal" interrupt handling

Sergio Cambra (1):
  Bluetooth device 04ca:3008 should use ath3k

Seung-Woo Kim (1):
  drm/exynos: added validation of edid for vidi connection

Shawn Guo (1):
  ASoC: fsl: fix multiple definition of init_module

Shirish S (1):
  drm/exynos: add check for the device power status

Simon Guinot (1):
  pinctrl: mvebu: fix MPP6 value for kirkwood driver

Stanislaw Gruszka (2):
  mac80211

Re: [git pull] fbcon locking fixes.

2013-01-24 Thread Linus Torvalds
On Thu, Jan 24, 2013 at 4:42 PM, Dave Airlie  wrote:
>
> These patches have been sailing around long enough, waiting for a maintainer
> to reappear, so I've decided enough is enough, lockdep is kinda useful to 
> have.

Last this was tried, these patches failed miserably.

They caused instant lockdep splat and then a total lockup with efifb.
It may be that Takashi's patch helps fix that problem, but it's in no
way clear that it does, so the patch series isn't at all obviously
stable.

Yes, lockdep is indeed "kinda useful", and there clearly are locking
problems in fbdev. But I'm not seeing myself pulling these for 3.8.
They've been too problematic to pull in at this late stage.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] fbcon locking fixes.

2013-01-24 Thread Linus Torvalds
On Thu, Jan 24, 2013 at 5:45 PM, Dave Airlie  wrote:
>
> Okay I've just sent out another fbcon patch to fix the locking harder.
>
> There was a path going into set_con2fb_path if an fb driver was
> already registered, I just pushed the locking out further on anyone
> going in there.
>
> it boots on my EFI macbook here.

Ok, good. Sounds like we'll finally get it fixed, but I'm still too
much of a scaredy-cat to take it for now, so -next it is...

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs fixes

2013-01-24 Thread Linus Torvalds
On Thu, Jan 24, 2013 at 1:52 PM, Chris Mason  wrote:
>
> Update on this, we've tracked down the crc errors and are doing final
> checks on the patches.  Linus are you planning on taking this pull?  If
> not I can just fold the new stuff into a bigger request.

If you have them basically ready, add them to this, I haven't pulled
yet. So I'll just ignore this and wait for another pull request.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/asm] x86/defconfig: Turn on CONFIG_CC_OPTIMIZE_FOR_SIZE= y in the 64-bit defconfig

2013-01-26 Thread Linus Torvalds
On Sat, Jan 26, 2013 at 7:18 AM, H. Peter Anvin  wrote:
> On the CPUs Ling is testing on the downsides of -Os probably matter less, in 
> particular since rep movsb works well.
>
> It is questionable as a generic default, though.

So being the person who really pushed for -Os to begin with (I think
I$ and instruction decode bandwidth is one of the most fundamental
limits to CPU performance), I wouldn't mind it if we reintroduced it.

HOWEVER.

It wasn't just "rep movs". The thing that killed -Os for me was that
it makes it impossible to try to optimize hot code, because -Os seems
to throw out branch prediction information. So when you use "likely()"
etc to try to teach the compiler to lay out code a certain way so that
code that never really gets executed isn't even brought into the I$,
-Os then screws it up completely.

Of course, maybe newer versions of gcc might not suck so horribly with
-Os, I haven't actually tried in a while.

[ Just tested. Still does it ]

Also, I doubt Ling was testing a SB CPU. Because "rep movb" still
sucks pretty bad on SB. What core *is* Ling testing? Haswell?

Ugh. We could make it depend on the optimization target. I'd also wish
there was some way to just tune gcc -Os to be closer to reasonable. Or
make -O2 not do some of the excessive crap it does (it aligns code
*much* too much, for example - who cares if you can do it with a
single instruction, if that instruction is so long that it uses up
half your decode bandwidth?)

The problem, of course, is that most -O2 code generation is done
assuming hot loops that don't show much if any I$ issues. And the -Os
thing is done *purely* for size, not taking any performance into
account at all. There's no balanced middle ground, which is what _we_
would want.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]smp: Fix send func call IPI to empty cpu mask

2013-01-26 Thread Linus Torvalds
On Fri, Jan 25, 2013 at 11:53 PM, Wang YanQing  wrote:
> I get below warning every day with 3.7,
> one or two times per day.
>
> [ 2235.186027] WARNING: at 
> /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 
> default_send_IPI_mask_logical+0x2f/0xb8()
> [ 2235.186030] Hardware name: Aspire 4741
> [ 2235.186032] empty IPI mask
> [ 2235.186079]  [] native_send_call_func_ipi+0x4f/0x57
> [ 2235.186087]  [] smp_call_function_many+0x191/0x1a9
> [ 2235.186097]  [] native_flush_tlb_others+0x21/0x24
> [ 2235.186101]  [] flush_tlb_page+0x63/0x89
> [ 2235.186105]  [] ptep_set_access_flags+0x20/0x26
> [ 2235.186111]  [] do_wp_page+0x234/0x502
> [ 2235.186121]  [] handle_pte_fault+0x50d/0x54c
> [ 2235.186148]  [] handle_mm_fault+0xd0/0xe2
> [ 2235.186153]  [] __do_page_fault+0x411/0x42d
> [ 2235.186166]  [] do_page_fault+0x8/0xa
> [ 2235.186170]  [] error_code+0x5a/0x60
>
> This patch fix it.
>
> This patch also fix some system hang problem:
> If the data->cpumask been cleared after pass
>
> if (WARN_ONCE(!mask, "empty IPI mask"))
> return;
> then the problem 83d349f3 fix will happen again.

Hmm. We have very consciously tried to avoid the extra copy, although
I'm not entirely sure why (it might possibly hurt on the MAXSMP
configuration).

See for example commit 723aae25d5cd ("smp_call_function_many: handle
concurrent clearing of mask") which fixed another version of this
problem.

But I do agree that it looks like the copy is required, simply because
- as you say - once we've done the "list_add_rcu()" to add it to the
queue, we can have (another) IPI to the target CPU that can now see it
and clear the mask.

So by the time we get to actually send the IPI, the mask might have
been cleared by another IPI. So I do agree that your patch seems
correct, but I really really want to run it by other people.

Guys? Original patch on lkml. The other possible fix might be to take
the &call_function.lock earlier in
generic_smp_call_function_interrupt(), so that we can never clear the
bit while somebody is adding entries to the list... But I think it
very much tries to avoid that on purpose right now, with only the last
CPU responding to that IPI taking the lock.

So copying the IPI mask seems to be the reasonable approach. Comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] parisc updates for 3.9

2013-02-22 Thread Linus Torvalds
On Fri, Feb 22, 2013 at 1:16 PM, Helge Deller  wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git 
> parisc-3.9

In general, I'd love to also get a short human-readable explanation of
what the pull does for the merge message. As it is, I just made
something up.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] irq: Cleanup context state transitions in irq_exit()

2013-02-23 Thread Linus Torvalds
On Sat, Feb 23, 2013 at 10:21 AM, Frederic Weisbecker
 wrote:
>
> But tick_nohz_irq_exit() may trigger the timer softirq itself.

Suggestion: merge it with the whole softirq handler.

The softirq code *already* knows about the whole "oops, one softirq
may trigger another" issue, and has a loop - with protection against
excess - for exactly this reason. See the whole "goto restart" thing.

And tick_nohz_irq_exit() really has very similar semantics to
softiq's, it's just "CPU is idle and no pending reschedule" instead of
a softirq. But the basic rules are the same ("only run this at the
top-level context when exiting the last irq").

So maybe the right thing to do is move the whole "goto restart" one
level up, and do softirq's and tick_nohz_irq_exit both inside that
loop.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] signal.git

2013-02-23 Thread Linus Torvalds
On Wed, Feb 20, 2013 at 2:52 PM, Al Viro  wrote:
> * a bunch of signal-related syscalls (both native and compat) unified.

Ok, in the meantime I had merged the parisc and powerpc trees, which
had their own fixes in this area: powerpc added the transactional
memory support for power8 (which impacted signal save/restore), and
parisc had some fixes to the routines you then removed in favor of
generic ones.

I fixed up the conflicts, and they didn't look that bad, but I could
easily have messed something up, so people - please double-check the
end result.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] KVM updates for the 3.9 merge window

2013-02-24 Thread Linus Torvalds
On Wed, Feb 20, 2013 at 5:17 PM, Marcelo Tosatti  wrote:
>
> Please pull from
>
> git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.9-1
>
> to receive the KVM updates for the 3.9 merge window [..]

Ok, particularly the s390 people should check me resolution of the
conflicts, since they include the renaming of IOINT_VIR to IRQIO_VIR.
But the uapi header file move should be couble-checked by people who
use this too.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] drm merge for 3.9-rc1

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 4:05 PM, Dave Airlie  wrote:
>
> So up front, this has a massive merge conflict in
> drivers/gpu/drm/radeon/evergreen_cs.c I've fixed it up in drm-next-merged
> in the same tree, I fixed up some small ordering issues in my merge as
> well, however they aren't important if you want the fun of doing a major
> conflict resolution.

I did the fun conflict resolution, so my tree doesn't have the ordering changes.

I also did some things slightly differently from you - you had left
some direct ib[] accesses that I spotted (see for example "case 0x48"
(aka "Copy L2T Frame to Field"), and yours apparently has a few cases
where you use "idx_value" instead of my mindless conflict resolution
that just re-did the brute-force "repace direct ib[] read accesses
with the radeon_get_ib_value() helper function". But you don't do it
for *all* the radeon_get_ib_value(p, idx+2) users, so whatever.

Anyway - my conflict resolution isn't exactly the same as yours, and
maybe I screwed something up. But it's damn close, and the differences
_seem_ be all be benign.

Btw, why is it ok that some functions still read the ib[] array
directly (eg evergreen_vm_packet3_check() or evergreen_cs_check_reg()
etc)?


Whatever. I prefer doing my own resolutions just so that I know what's
going on, and it all seems to build and looks reasonable, but it's
always good to get a second opinion. Particularly since I can't
actually test the radeon stuff, so just eyeballing it and saying
"looks semantically identical to Dave's resolution" may not be 100%
sufficient..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Load keys from signed PE binaries

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 7:28 PM, Matthew Garrett  wrote:
>
> You're happy advising Linux vendors that they don't need to worry about
> module signing because it's "not obvious" that Microsoft would actually
> enforce the security model they've spent significant money developing
> and advertising?

And you're happy shilling for a broken model?

The fact is, the only valid user for the whole security model is to
PROTECT THE USER.

Your arguments constantly seem to miss that rather big point. You
think this is about bending over when MS whispers sweet nothings in
your ear..

The whole and only reason I ever merged module signatures is because
it actually allows *users* to do a good job at security. You, on the
other hand, seem to have drunk the cool-aid on the whole "let's
control the user" crap.

Did you forget what security was all about?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Load keys from signed PE binaries

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 7:42 PM, Matthew Garrett  wrote:
>
> The user Microsoft care about isn't running Linux

How f*cking hard is it for you to understand?

Stop arguing about what MS wants. We do not care. We care bout the
*user*. You are continually missing the whole point of security, and
then you make some idiotic arguments about what MS wants you to do.

It's irrelevant. The only thing that matters is what our *users* want
us to do, and protecting *their* rights. As long as you seem to treat
this as some kind of "let's please MS, not our users" issue, all your
arguments are going to be crap.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Load keys from signed PE binaries

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 7:48 PM, Matthew Garrett  wrote:
>
> Our users want to be able to boot Linux. If Microsoft blacklist a
> distribution's bootloader, that user isn't going to be able to boot
> Linux any more. How does that benefit our users?

How does bringing up an unlikely and bogus scenario - and when people
call you on it, just double down on it - help users?

Stop the fear mongering already.

So here's what I would suggest, and it is based on REAL SECURITY and
on PUTTING THE USER FIRST instead of your continual "let's please
microsoft by doing idiotic crap" approach.

So instead of pleasing microsoft, try to see how we can add real security:

 - a distro should sign its own modules AND NOTHING ELSE by default.
And it damn well shouldn't allow any other modules to be loaded at all
by default, because why the f*ck should it? And what the hell should a
microsoft signature have to do with *anything*?

 - before loading any third-party module, you'd better make sure you
ask the user for permission. On the console. Not using keys. Nothing
like that. Keys will be compromised. Try to limit the damage, but more
importantly, let the user be in control.

 - encourage things like per-host random keys - with the stupid UEFI
checks disabled entirely if required. They are almost certainly going
to be *more* secure than depending on some crazy root of trust based
on a big company, with key signing authorities that trust anybody with
a credit card. Try to teach people about things like that instead.
Encourage people to do their own (random) keys, and adding those to
their UEFI setups (or not: the whole UEFI thing is more about control
than security), and strive to do things like one-time signing with the
private key thrown out entirely. IOW try to encourage *that* kind of
"we made sure to ask the user very explicitly with big warnings and
create his own key for that particular module" security. Real
security, not "we control the user" security.

Sure, users will screw that up too. They'll want to load crazy nvidia
binary modules etc crap. But make it *their* decision, and under
*their* control, instead of trying to tell the world about how this
should be blessed by Microsoft.

Because it really shouldn't be about MS blessings, it should be about
the *user* blessing kernel modules.

Quite frankly, *you* are what he key-hating crazies were afraid of.
You peddle the "control, not security" crap-ware. The whole "MS owns
your machine" is *exactly* the wrong way to use keys.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Load keys from signed PE binaries

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 8:23 PM, Matthew Garrett  wrote:
>
> If the user has explicitly enrolled a hash then they're stepping outside
> the trust model.

This is the kind of totally bogus crap that no sane person should ever
spout. Stop it.

If the user has explicitly enrolled a hash, then that should be the
*primary* trust model, dammit. That should be very much what you
should care about first and foremost, and that should be your goal in
life. That's when the user says "I'm in control of my own machine, and
I want to trust *this*".

It's not about "stepping outside of the trust model". Quite the
reverse. It's about actually being *part* of the trust model, and
taking control of your own machine. It's the *good* scenario. It's
what you should encourage users to do.

No, it likely can't be the default because we shouldn't expect users
to care enough, but on the other hand the default should definitely
*not* be "enable random third party modules signed indirectly by MS",
which is what your crazy world-view seems to be.

So the first order should be: "we provide modules to cover all normal
users". You use the RH key for that.

The *second* order should be: "we encourage and tell people how to add
their own keys and sign modules they trust".

The third order should probably be "we encourage people to use random
one-time keys - probably with UEFI key checking turned off entirely,
because let's face it, that doesn't really add any real security for
most people". It's what kernel developers and most servers would
probably want to use. They likely don't do the whole UEFI crap anyway,
and random one-time keys are actually better against things like
rootkits etc than *any* centrally administered chain of trust.

Only somewhere really really deep down should the "ok, what about a MS
signature" thing be. It could be part of the user-level application
(part of your distribution) that displays the "are you really sure you
want to load this module with an unrecognized signature? I can tell
that it has a MS signature on it". But by the time you get this far,
you've already failed the first few normal levels.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] PCI changes for v3.9

2013-02-25 Thread Linus Torvalds
On Sat, Feb 23, 2013 at 6:49 PM, Yinghai Lu  wrote:
>
> Please check if attached diff is right, and hope it could save Linus some 
> time.

Hmm. I did things a bit differently, moving things around more in
drivers/acpi/internal.h.

Also, my *gut* feel is that the new _handle_hotplug_event_root()
function should do that whole dance with
acpi_scan_lock_acquire()/acpi_scan_lock_release(), but I didn't really
know if it's required or appropriate, so I left it alone. Could you
take a look?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.9-rc1

2013-02-25 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 7:17 PM, Rafael J. Wysocki  wrote:
>
> I wonder if this went unnoticed or there's anything wrong with it or it just
> needs to wait for some more time?

Just going through things slowly. It's merged in my tree now.

Oh, and a request: _please_ don't use unknown TLA's like OPP. This has
become a huge problem, to the point that we have a
"Documentation/power/opp.txt" file THAT NEVER CLEARLY STATES WHAT THE
F*CK OPP ACTUALLY MEANS! What nice "documentation".

Ok, I can look up things like this and find that it is "Operating
Performance Points". At least in this context. But no, it's not some
kind of generic standard, and no, it's not something people should be
expected to know in general. Please stop doing "explanations" of
things that use TLA's like this. And people shouldn't have to even
wonder.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bug in generic strncpy_from_user

2013-02-26 Thread Linus Torvalds
On Tue, Feb 26, 2013 at 4:57 AM, Heiko Carstens
 wrote:
>
> I was wrong. -EFAULT will be returned, however the vma will grow nevertheless.
> Which in turn will cause trouble.

Ok. We should fix that too.

There whole "access just past the end of the previous vma" should
never cause the stack above to expand. The guard page at least gives
people a SIGSEGV, but one of the main reasons for the guard page was
actually to make sure that new "mmap()" calls do not create mappings
just under the stack (in addition to the obvious SIGSEGV when you then
access into that thing).

So while part of the meaning of the guard page is to get that SIGSEGV,
that part is "for safetly". And apparently it works. But at the same
time, there is absolutely no reason to ever expand the stack only to
hit the guard page _anyway_, so if the stack expansion will cause the
requested address to be in the guard page, then the stack expansion
should just have failed.

I think the problem is that we add the guard page *after* we do the
normal "let's try to expand" logic.

I'll take a look.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bug in generic strncpy_from_user

2013-02-26 Thread Linus Torvalds
On Tue, Feb 26, 2013 at 7:51 AM, Linus Torvalds
 wrote:
>
> I think the problem is that we add the guard page *after* we do the
> normal "let's try to expand" logic.
>
> I'll take a look.

Ahh, no. The guard page logic happens later at the fault time. We do
this in two phases - first "find_extend_vma()" does what the name
claims, and then check_stack_guard_page() is done for the last-page
case from within do_anonymous_page() when we actually touch the last
page itself.

But that's actually fine. We can simply make "find_extend_vma()" do
the obvious "refuse to extend the vma all the way", because we will
later allow the guard page to extend downwards to "touch" the mapping,
but that uses separate logic. So the attached trivial patch seems to
make perfect sense:

It is totally untested, though.  Does it work for you (and we should
do the same thing for the grows-up case, obviously)?

 Linus


patch.diff
Description: Binary data


Re: [GIT PULL] PCI changes for v3.9

2013-02-26 Thread Linus Torvalds
On Mon, Feb 25, 2013 at 10:46 PM, Yinghai Lu  wrote:
> On Mon, Feb 25, 2013 at 9:19 PM, Linus Torvalds
>  wrote:
>>
>> Also, my *gut* feel is that the new _handle_hotplug_event_root()
>> function should do that whole dance with
>> acpi_scan_lock_acquire()/acpi_scan_lock_release(), but I didn't really
>> know if it's required or appropriate, so I left it alone. Could you
>> take a look?
>
> Yes, we need that for root bridge hot add path.
>
> for hot remove path, we already have lock acquire/release in
> acpi_bus_hot_remove_device().
>
> Please check attached patch for hot add path.

Quite frankly, doing this in handle_root_bridge_insertion() doesn't
match the pattern elsewhere. Elsewhere you also protected the whole
acpi_get_name() lookup etc. Which is why I felt that it would make
more sense to add this to _handle_hotplug_event_root().

But there may be good reasons why the root bridge case is different,
and I don't have strong opinions, I just wanted people to look at his
case. I'll let you and Bjorn sort it out...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.9-rc1

2013-02-26 Thread Linus Torvalds
On Tue, Feb 26, 2013 at 8:10 AM, Nishanth Menon  wrote:
> On 16:55-20130226, Rafael J. Wysocki wrote:
>>
>> It says that in "Introduction", but it would be clearer if the title of the
>> doc was something like "Operating Performance Points (OPP) Library".  
>> Nishanth?
>
> Yes indeed. Will the following help? I can post it as an official patch
> if the direction is proper

Yes, this will definitely help. I didn't even find it in the
introduction (Rafael is correct that it is indeed there), because it's
hard to see when you don't know what to scan for and it's in a big
block of text.

I am also happy to note that it is in the Kconfig help and single-line
description. Which wasn't true for the new SATA_ZPODD ("Zero Power
ODD" - what the heck is ODD?) which was another new entry I wondered
about.

It turns out that ODD is an odd TLA for "Optical Disk Drive". I'm sure
it makes perfect sense if you are a SATA person, but it sure doesn't
for any normal human being, even otherwise highly technical ones.

Aaron, Tejun, Jeff, can I ask you to also not use specialized TLA's
without explaining them? Especially in help text and "documentation",
it's very unhelpful to have TLA's that aren't common.

We don't have to explain *all* TLA's, since there's a lot that really
are rather widespread. But there's a big difference between something
like CPU or TLB that have been in generic literature for decades, wrt
OPP and ODD that are specialized terms used inside a very particular
group and haven't been around for very long either.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >