Re: [report] renicing X, cfs-v5 vs sd-0.46

2007-04-23 Thread Michael K. Edwards

On 4/23/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

Basically this hack is bad on policy grounds because it is giving X an
"legislated, unfair monopoly" on the system. It's the equivalent of a
state-guaranteed monopoly in certain 'strategic industries'. It has some
advantages but it is very much net harmful. Most of the time the
"strategic importance" of any industry can be cleanly driven by the
normal mechanics of supply and demand: anything important is recognized
by 'people' as important via actual actions of giving it 'money'. (This
approach also gives formerly-strategic industries the boot quickly, were
they to become less strategic to people as things evolve.)


If you're going to drag free-market economics into it, why not
actually use the techniques of free-market economics?  Design a
bidding system in which agents (tasks) earn "money" by getting things
done, and can use that "money" to bid on "resources".  You will of
course need accurate cost accounting in order to decide which bids are
most "profitable" for the scheduler to accept, and accurate transfer
accounting to design price structures for contracts between agents in
which one agrees to accomplish work on behalf on another.  Actual
revenues come from doing the work that the consumer wants done and is
willing to pay for.  Etc., etc.  Has your horsepucky filter kicked in
yet?

If your system doesn't work this way -- perhaps because you think as I
do that scheduler design is principally an engineering problem, not an
economics problem -- then analogies from economics are probably worth
zip.  Yes, I wrote earlier about "economic dispatch" -- that's an
operations problem, a control theory problem, an _engineering_
problem, that happens to have a set of engineering goals and
constraints that take profitability into account.  I think you might
be able to design a better Linux scheduler anchored in the techniques
and literature of control theory, perhaps specifically with reference
to electric-utility economic dispatch, because the systems under
control and the goals of control are similar.

But there's a good reason not to treat X as special.  Namely, that it
_isn't_.  It may be the only program on many people's Linux desktops
with an opaque control structure -- a separate class of interactive
activities hidden inside an oversubscribed push-model pipeline stage
-- but it's hardly the only program designed this way.  Treat the X
server as a easily instrumented exemplar of a event-loop-centric
design whose thread structure doesn't distinguish between fast-twitch
and best-effort activity patterns.  I wrote earlier about what one
might do about this (attach urgency to to the work in the queue
instead of the worker being asked to do it).

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Renice X for cpu schedulers

2007-04-20 Thread Michael K. Edwards

On 4/19/07, hui Bill Huey <[EMAIL PROTECTED]> wrote:

DSP operations like, particularly with digital synthesis, tend to max
the CPU doing vector operations on as many processors as it can get
a hold of. In a live performance critical application, it's important
to be able to deliver a protected amount of CPU to a thread doing that
work as well as response to external input such as controllers, etc...


Actual fractional CPU reservation is a bit different, and is probably
best handled with "container"-type infrastructure (not quite
virtualization, but not quite scheduling classes either).  SGI
pioneered this (in "open systems" space -- IBM probably had it first,
as usual) with GRIO in XFS.  (That was I/O throughput reservation of
course, not "CPU bandwidth" -- but IIRC IRIX had CPU reservation too).
There's a more general class of techniques in which it's worth
spending idle cycles speculating along paths that might or might not
be taken depending on unpredictable I/O; I'd be surprised if you
couldn't approximate most of the sane balancing strategies in this
area within the "economic dispatch" scheduler model.  (Good JIT
bytecode engines more or less do this already if you let them, with a
cap on JIT cache size serving as a crude CPU throttle.)


> In practice, you probably don't want to burden desktop Linux with
> priority inheritance where you don't have to.  Priority queues with
> algorithmically efficient decrease-key operations (Fibonacci heaps and
> their ilk) are complicated to implement and have correspondingly high
> constant factors.  (However, a sufficiently clever heuristic for
> assigning quasi-static task priorities would usually short-circuit the
> priority cascade; if you can keep N small in the
> tasks-with-unpredictable-priority queue, you can probably use a
> simpler flavor with O(log N) decrease-key.  Ask someone who knows more
> about data structures than I do.)

These are app issue and not really somethings that's mutable in kernel
per se with regard to the -rt patch.


I don't know where the -rt patch enters in.  But if you need agile
reprioritization with a deep runnable queue, either under direct
application control or as a side effect of priority inheritance or a
related OS-enforced protocol, then you need a kernel-level data
structure with a fancier interface than the classic
insert/find/delete-min priority queue.  From what I've read (this is
not my area of expertise and I don't have Knuth handy), the relatively
simple heap-based implementations of priority queues can't
reprioritize an entry any more quickly than find+delete+insert, which
pretty much rules them out as a basis for a scalable scheduler with
priority inheritance (let alone PCP emulation).


I have Solaris style adaptive locks in my tree with my lockstat patch
under -rt. I've also modified my lockstat patch to track readers
correctly now with rwsem and the like to see where the single reader
limitation in the rtmutex blows it.


Ooh, that's neat.  The next time I can cook up an excuse to run a
kernel that won't load this damn WiFi driver, I'll try it out.  Some
of the people I work with are real engineers and respect in-system
instrumentation.


So far I've seen less than 10 percent of in-kernel contention events
actually worth spinning on and the rest of the stats imply that the
mutex owner in question is either preempted or blocked on something
else.


That's a good thing; it implies that in-kernel algorithms don't take
locks needlessly as a matter of cargo-cult habit.  Attempting to take
a lock (other than an uncontended futex, which is practically free)
should almost always transfer control to the thread that has the power
to deliver the information (or the free slot) that you're looking for
-- or in the case of an external data source/sink, should send you
into low-power mode until time and tide give you something new to do.
Think of it as a just-in-time inventory system; if you keep too much
product in stock (or free warehouse space), you're wasting space and
harming responsiveness to a shift in demand.  Once in a while you have
to play Sokoban in order to service a request promptly; that's exactly
the case that priority inheritance is meant to help with.

The fiddly part, on a non-real-time-no-matter-what-the-label-says
system with an opaque cache architecture and mysterious hidden costs
of context switching, is to minimize the lossage resulting from brutal
timer- or priority-inheritance-driven preemption.  Given the way
people code these days -- OK, it was probably just as bad back in the
day -- the only thing worse than letting the axe fall at random is to
steal the CPU away the moment a contended lock is released, because
the next 20 lines of code probably poke one last time at all the data
structures the task had in cache right before entering the critical
section.  That doesn't hurt so bad on RTOS-friendly hardware -- an
MMU-less system with either zero or near-infinite cache -- but it's
got to make this

Re: Renice X for cpu schedulers

2007-04-19 Thread Michael K. Edwards

On 4/19/07, Lee Revell <[EMAIL PROTECTED]> wrote:

IMHO audio streamers should use SCHED_FIFO thread for time critical
work.  I think it's insane to expect the scheduler to figure out that
these processes need low latency when they can just be explicit about
it.  "Professional" audio software does it already, on Linux as well
as other OS...


It is certainly true that SCHED_FIFO is currently necessary in the
layers of an audio application lying closest to the hardware, if you
don't want to throw a monstrous hardware ring buffer at the problem.
See the alsa-devel archives for a patch to aplay (sched_setscheduler
plus some cleanups) that converts it from "unsafe at any speed" (on a
non-RT kernel) to a rock-solid 18ms round trip from PCM in to PCM out.
(The hardware and driver aren't terribly exotic for an SoC, and the
measurement was done with aplay -C | aplay -P -- on a
not-particularly-tuned CONFIG_PREEMPT kernel with a 12ms+ peak
scheduling latency according to cyclictest.  A similar test via
/dev/dsp, done through a slightly modified OSS emulation layer to the
same driver, measures at 40ms and is probably tuned too
conservatively.)

Note that SCHED_FIFO may be less necessary on an -rt kernel, but I
haven't had that option on the embedded hardware I've been working
with lately.  Ingo, please please pretty please pick a -stable branch
one of these days and provide a git repo with -rt integrated against
that branch.  Then I could port our chip support to it -- all of which
will be GPLed after the impending code review -- after which I might
have a prayer of strong-arming our chip vendor into porting their WiFi
driver onto -rt.  It's really a much more interesting scheduler use
case than make -j200 under X, because it's a best-effort
SCHED_BATCH-ish load that wants to be temporally clustered for power
management reasons.

(Believe it or not, a stable -rt branch with a clock-scaling-aware
scheduler is the one thing that might lead to this major WiFi vendor's
GPLing their driver core.  They're starting to see the light on the
biz dev side, and the nature of the devices their chip will go in
makes them somewhat less concerned about the regulatory fig leaf
aspect of a closed-source driver; but they would have to port off of
the third-party real-time executive embedded within the driver, and
mainline's task and timer granularity won't cut it.  I can't even get
more detail about _why_ it won't cut it unless there's some remotely
supportable -rt base they could port to.)

But I think SCHED_FIFO on a chain of tasks is fundamentally not the
right way to handle low audio latency.  The object with a low latency
requirement isn't the task, it's the device.  When it's starting to
get urgent to deliver more data to the device, the task that it's
waiting on should slide up the urgency scale; and if it's waiting on
something else, that something else should slide up the scale; and so
forth.  Similarly, responding to user input is urgent; so when user
input is available (by whatever mechanism), the task that's waiting
for it should slide up the urgency scale, etc.

In practice, you probably don't want to burden desktop Linux with
priority inheritance where you don't have to.  Priority queues with
algorithmically efficient decrease-key operations (Fibonacci heaps and
their ilk) are complicated to implement and have correspondingly high
constant factors.  (However, a sufficiently clever heuristic for
assigning quasi-static task priorities would usually short-circuit the
priority cascade; if you can keep N small in the
tasks-with-unpredictable-priority queue, you can probably use a
simpler flavor with O(log N) decrease-key.  Ask someone who knows more
about data structures than I do.)

More importantly, non-real-time application coders aren't very smart
about grouping data structure accesses on one side or the other of a
system call that is likely to release a lock and let something else
run, flushing application data out of cache.  (Kernel coders aren't
always smart about this either; see LKML threads a few weeks ago about
racy, and cache-stall-prone, f_pos handling in VFS.)  So switching
tasks immediately on lock release is usually the wrong thing to do if
letting the task run a little longer would allow it to reach a point
where it has to block anyway.

Anyway, I already described the urgency-driven strategy to the extent
that I've thought it out, elsewhere in this thread.  I only held this
draft back because I wanted to double-check my latency measurements.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL-incompatible Module Error Message

2007-04-19 Thread Michael K. Edwards

On 4/19/07, Alan Cox <[EMAIL PROTECTED]> wrote:

The troll is back I see.


Troll, shmoll.  I call 'em like I see 'em.  As much as I like and
depend on Linux, and as much as I respect the contributions and the
ideals of the EXPORT_SYMBOL_GPL partisans, they're spreading needless
FUD by spraying "private-don't-touch-me" all over mechanisms that are
_explicitly_designed_ as interoperation boundaries.  They're also
aiding and abetting the FSF's hypocritical charlatanry about the
meaning of "derivative work".


Why don't you give him some useful information instead


Alternate technical solutions are also useful.  You seem to know them;
I don't pretend to.  Thanks for providing them.


- Turn off the paravirt option - you don't need it, and its just bloat
and slows down the kernel. Then rebuild the kernel and other bits and it
should all work fine.


Just out of curiosity -- it seems thoroughly unlikely that ATI has
intentionally touched paravirt_ops in fglrx.  Do you think that
redefining bog-standard Linux interfaces when CONFIG_PARAVIRT (or
whatever) is enabled suddenly makes fglrx a derivative work of
whatever code underlies paravirt_ops?


The legality of the ati driver as a derivative work is another matter,
but I don't see what _GPL symbols have to do with its legality beyond
providing a hint.


Then surely you don't approve of spraying FATAL messages on people's
consoles under these circumstances.  Allowing code into one's kernel
whose integration problems can't or won't be diagnosed by mainline
developers may be foolish, but it's not FATAL.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Renice X for cpu schedulers

2007-04-19 Thread Michael K. Edwards

On 4/19/07, Con Kolivas <[EMAIL PROTECTED]> wrote:

The cpu scheduler core is a cpu bandwidth and latency
proportionator and should be nothing more or less.


Not really.  The CPU scheduler is (or ought to be) what electric
utilities call an economic dispatch mechanism -- a real-time
controller whose goal is to service competing demands cost-effectively
from a limited supply, without compromising system stability.

If you live in the 1960's, coal and nuclear (and a little bit of
fig-leaf hydro) are all you have, it takes you twelve hours to bring
plants on and off line, and there's no live operational control or
pricing signal between you and your customers.  So you're stuck
running your system at projected peak + operating margin, dumping
excess power as waste heat most of the time, and browning or blacking
people out willy-nilly when there's excess demand.  Maybe you get to
trade off shedding the loads with the worst transmission efficiency
against degrading the customers with the most tolerance for brownouts
(or the least regulatory clout).  That's life without modern economic
dispatch.

If you live in 2007, natural gas and (outside the US) better control
over nuclear plants give you more ability to ramp supply up and down
with demand on something like a 15-minute cycle.  Better yet, you can
store a little energy "in the grid" to smooth out instantaneous demand
fluctuations; if you're lucky, you also have enough fast-twitch hydro
(thanks, Canada!) that you can run your coal and lame-ass nuclear very
close to base load even when gas is expensive, and even pump water
back uphill when demand dips.  (Coal is nasty stuff and a worse
contributor by far to radiation exposure than nuclear generation; but
on current trends it's going to last a lot longer than oil and gas,
and it's a lot easier to stockpile next to the generator.)

Best of all, you have industrial customers who will trade you live
control (within limits) over when and how much power they take in
return for a lower price per unit energy.  Some of them will even dump
power back into the grid when you ask them to.  So now the biggest
challenge in making supply and demand meet (in the short term) is to
damp all the different ways that a control feedback path might result
in an oscillation -- or in runaway pricing.  Because there's always
some asshole greedhead who will gamble with system stability in order
to game the pricing mechanism.  Lots of 'em, if you're in California
and your legislature is so dumb, or so bought, that they let the
asshole greedheads design the whole system so they can game it to the
max.  (But that's a whole 'nother rant.)

Embedded systems are already in 2007, and the mainline Linux scheduler
frankly sucks on them, because it thinks it's back in the 1960's with
a fixed supply and captive demand, pissing away "CPU bandwidth" as
waste heat.  Not to say it's an easy problem; even academics with a
dozen publications in this area don't seem to be able to model energy
usage to the nearest big O, let alone design a stable economic
dispatch engine.  But it helps to acknowledge what the problem is:
even in a 1960's raised-floor screaming-air-conditioners
screw-the-power-bill machine room, you can't actually run a
half-decent CPU flat out any more without burning it to a crisp.

You can act ignorant and let the PMIC brown you out when it has to.
Or you can start coping in mainline the way that organizations big
enough (and smart enough) to feel the heat in their pocketbooks do in
their pet kernels.  (Boo on Google for not sharing, and props to IBM
for doing their damnedest.)  And guess what?  The system will actually
get simpler, and stabler, and faster, and easier to maintain, because
it'll be based on a real theory of operation with equations and things
instead of a bunch of opaque, undocumented shotgun heuristics.

This hypothetical economic-dispatch scheduler will still _have_
heuristics, of course -- you can't begin to model a modern CPU
accurately on-line.  But they will be contained in _data_ rather than
_code_, and issues of numerical stability will be separated cleanly
from the rule set.  You'll be able to characterize the rule set's
domain of stability, given a conservative set of assumptions about the
feedback paths in the system under control, with the sort of
techniques they teach in the engineering schools that none of us (me
included) seem to have attended.  (I went to school thinking I was
going to be a physicist.  Wishful thinking -- but I was young and
stupid.  What's your excuse?  ;-)

OK, it feels better to have that off my chest.  Apologies to those
readers -- doubtless the vast majority of LKML, including everyone
else in this thread -- for whom it's irrelevant, pseudo-learned
pontification with no patch attached.  And my sincere thanks to Ingo,
Con, and really everyone else CC'ed, without whom Linux wouldn't be as
good as it is (really quite good, all things considered) and wouldn't
contribute as much as it does

Re: GPL-incompatible Module Error Message

2007-04-19 Thread Michael K. Edwards

On 4/19/07, Chris Bergeron <[EMAIL PROTECTED]> wrote:

It just seemed like it might be interesting and I couldn't find anything
to shed light on the error itself in the mailing list logs, and I'm
curious at what's happening.


What's happening is that some kernel developers don't like Linus's
stance on binary-only drivers and are trying to circumvent the norms
of software copyright law using EXPORT_SYMBOL_GPL.  (Why some people
think that the GPL is magically exempt from Lotus v. Borland, Lexmark
v. Static Control, and their analogues in other jurisdiction is beyond
me -- but then I gave up smoking the FSF's parallel-legal-universe
herb some time ago.)

Just s/EXPORT_SYMBOL_GPL/EXPORT_SYMBOL/ throughout the kernel and
you'll be fine -- at a technical level.  But be prepared, when later
changes break extra-volatile quasi-private in-kernel APIs, to keep
both pieces -- and to be shunned by EXPORT_SYMBOL_GPL partisans.

Cheers (IANAL, TINLA),
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Renice X for cpu schedulers

2007-04-19 Thread Michael K. Edwards

On 4/19/07, Gene Heskett <[EMAIL PROTECTED]> wrote:

Having tried re-nicing X a while back, and having the rest of the system
suffer in quite obvious ways for even 1 + or - from its default felt pretty
bad from this users perspective.

It is my considered opinion (yeah I know, I'm just a leaf in the hurricane of
this list) that if X has to be re-niced from the 1 point advantage its had
for ages, then something is basicly wrong with the overall scheduling, cpu or
i/o, or both in combination.  FWIW I'm using cfq for i/o.


I think I just realized why the X server is such a problem.  If it
gets preempted when it's not actually selecting/polling over a set of
fds that includes the input devices, the scheduler doesn't know that
it's a good candidate for scheduling when data arrives on those
devices.  (That's all that any of these dynamic priority heuristics
really seem to do -- weight the scheduler towards switching to
conspicuously I/O bound tasks when they become runnable, without the
forced preemption on lock release that would result from a true
priority inheritance mechanism.)

One way of looking at this is that "fairness-driven" scheduling is a
poor man's priority ceiling protocol for I/O bound workloads, with the
implicit priority of an fd or lock given by how desperately the reader
side needs more data in order to accomplish anything.  "Nice" on a
task is sort of an indirect way of boosting or dropping the base
priority of the fds it commonly waits on.  I recognize this is a
drastic oversimplification, and possibly even a misrepresentation of
the design _intent_; but I think it's fairly accurate in terms of the
design _effect_.

The event-driven, non-threaded design of the X server makes it
particularly vulnerable to "non-interactive behavior" penalties, which
is appropriate to the extent that it's an output device having trouble
keeping up with rendering -- in fact, that's exactly the throttling
mechanism you need in order to exert back-pressure on the X client.
(Trying to exert back-pressure over Linux's local domain sockets seems
to be like pushing on a rope, but that's a different problem.)  That
same event-driven design would prioritize input events just fine --
except the scheduler won't wake the task in order to deliver them,
because as far as it's concerned the X server is getting more than
enough I/O to keep it busy.  It's not only not blocked on the input
device, it isn't even selecting on it at the moment that its timeslice
expires -- so no amount of poor-man's PCP emulation is going to help.

What "more negative nice on the X server than on any CPU-bound
process" seems to do is to put the X server on a hair-trigger,
boosting its dynamic priority in a render-limited scenario (on some
graphics cards!) just enough to cancel the penalty for non-interactive
behavior.  It's forced to share _some_ CPU cycles, but nobody else is
allowed a long enough timeslice to keep the X server off the CPU (and
insensitive to input events) for long.  Not terribly efficient in
terms of context switch / cache eviction overhead, but certainly
friendlier to the PEBCAK (who is clearly putting totally inappropriate
load on a single-threaded CPU by running both a local X server and
non-SCHED_BATCH compute jobs) than a frozen mouse cursor.

So what's the right answer?  Not special-casing the X server, that's
for sure.  If this analysis is correct (and as of now it's pure
speculation), any event-driven application that does compute work
opportunistically in the absence of user interaction is vulnerable to
the same overzealous squelching.  I wouldn't design a new application
that way, of course -- user interaction belongs in a separate thread
on any UNIX-legacy system which assigns priorities to threads of
control instead of to patterns of activity.  But all sorts of Linux
applications have been designed to implicitly elevate artificial
throughput benchmarks over user responsiveness -- that has been the
UNIX way at least since SVR4, and Linux's history of expensive thread
switches prior to NPTL didn't help.

If you want responsiveness when the CPU is oversubscribed -- and I for
one do, which is one reason why I abandoned the Linux desktop once
both Microsoft and Apple figured out how to make hyperthreading work
in their favor -- you should probably think about how to get it
without rewriting half of userspace.  IMHO, dinking around with
"fairness", as if there were any relationship these days between UIDs
or process groups or any other control structure and the work that's
trying to flow through the system, is not going to get you there.

If this were my problem, I might start by attaching urgency to
behavior instead of to thread ID, which demands a scheduler queue
built around a data structure with a cheap decrease-key operation.
I'd figure out how to propagate this urgency not just along lock
chains but also along chains of fds that need flushing (or refilling)
-- even if the reader (or writer) got preempted for unrelat

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-18 Thread Michael K. Edwards

On 4/18/07, Matt Mackall <[EMAIL PROTECTED]> wrote:

For the record, you actually don't need to track a whole NxN matrix
(or do the implied O(n**3) matrix inversion!) to get to the same
result. You can converge on the same node weightings (ie dynamic
priorities) by applying a damped function at each transition point
(directed wakeup, preemption, fork, exit).

The trouble with any scheme like this is that it needs careful tuning
of the damping factor to converge rapidly and not oscillate and
precise numerical attention to the transition functions so that the sum of
dynamic priorities is conserved.


That would be the control theory approach.  And yes, you have to get
both the theoretical transfer function and the numerics right.  It
sometimes helps to use a control-systems framework like the classic
Takagi-Sugeno-Kang fuzzy logic controller; get the numerics right once
and for all, and treat the heuristics as data, not logic.  (I haven't
worked in this area in almost twenty years, but Google -- yes, I do
use Google+brain for fact-checking; what do you do? -- says that
people are still doing active research on TSK models, and solid
fixed-point reference implementations are readily available.)  That
seems like an attractive strategy here because you could easily embed
the control engine in the kernel and load rule sets dynamically.  Done
right, that could give most of the advantages of pluggable schedulers
(different heuristic strokes for different folks) without diluting the
tester pool for the actual engine code.

(Of course, different scheduling strategies require different input
data, and you might not want the overhead of collecting data that your
chosen heuristics won't use.  But that's not much different from the
netfilter situation, and is obviously a solvable problem, if anyone
cares to put that much work in.  The people who ought to be funding
this kind of work are Sun and IBM, who don't have a chance on the
desktop and are in big trouble in the database tier; their future as
processor vendors depends on being able to service presentation-tier
and business-logic-tier loads efficiently on their massively
multi-core chips.  MIPS should pitch in too, on behalf of licensees
like Cavium who need more predictable behavior on multi-core embedded
Linux.)

Note also that you might not even want to persistently prioritize
particular processes or process groups.  You might want a heuristic
that notices that some task (say, the X server) often responds to
being awakened by doing a little work and then unblocking the task
that awakened it.  When it is pinged from some highly interactive
task, you want it to jump the scheduler queue just long enough to
unblock the interactive task, which may mean letting it flush some
work out of its internal queue.  But otherwise you want to batch
things up until there's too much "scheduler pressure" behind it, then
let it work more or less until it runs out of things to do, because
its working set is so large that repeatedly scheduling it in and out
is hell on caches.

(Priority inheritance is the classic solution to the
blocked-high-priority-task problem _in_isolation_.  It is not without
its pitfalls, especially when the designer of the "server" didn't
expect to lose his timeslice instantly on releasing the lock.  True
priority inheritance is probably not something you want to inflict on
a non-real-time system, but you do need some urgency heuristic.  What
a "fuzzy logic" framework does for you is to let you combine competing
heuristics in a way that remains amenable to analysis using control
theory techniques.)

What does any of this have to do with "fairness"?  Nothing whatsoever!
There's work that has to be done, and choosing when to do it is
almost entirely a matter of staying out of the way of more urgent work
while minimizing the task's negative impact on the rest of the system.
Does that mean that the X server is "special", kind of the way that
latency-sensitive A/V applications are "special", and belongs in a
separate scheduler class?  No.  Nowadays, workloads where the kernel
has any idea what tasks belong to what "users" are the exception, not
the norm.  The X server is the canary in the coal mine, and a
scheduler that won't do the right thing for X without hand tweaking
won't do the right thing for other eyeball-driven,
multiple-tiers-on-one-box scenarios either.

If you want fairness among users to the extent that their demands
_compete_, you might as well partition the whole machine, and have a
separate fairness-oriented scheduler (let's call it a "hypervisor")
that lives outside the kernel.  (Talk about two students running gcc
on the same shell server, with more important people also doing things
on the same system, is so 1990's!)  Not that the design of scheduler
heuristics shouldn't include "fairness"-like considerations; but
they're probably only interesting as a fallback for when the scheduler
has no idea what it ought to schedule next.

So why is Ingo's s

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Michael K. Edwards

On 4/17/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

The ongoing scheduler work is on a much more basic level than these
affairs I'm guessing you googled. When the basics work as intended it
will be possible to move on to more advanced issues.


OK, let me try this in smaller words for people who can't tell bitter
experience from Google hits.  CPU clock scaling for power efficiency
is already the only thing that matters about the Linux scheduler in my
world, because battery-powered device vendors in their infinite wisdom
are abandoning real RTOSes in favor of Linux now that WiFi is the "in"
thing (again).  And on the timescale that anyone will actually be
_using_ this shiny new scheduler of Ingo's, it'll be nearly the only
thing that matters about the Linux scheduler in anyone's world,
because the amount of work the CPU can get done in a given minute will
depend mostly on how intelligently it can spend its heat dissipation
budget.

Clock scaling schemes that aren't integral to the scheduler design
make a bad situation (scheduling embedded loads with shotgun
heuristics tuned for desktop CPUs) worse, because the opaque
heuristics are now being applied to distorted data.  Add a "smoothing"
scheme for the distorted data, and you may find that you have
introduced an actual control-path instability.  A small fluctuation in
the data (say, two bursts of interrupt traffic at just the right
interval) can result in a long-lasting oscillation in some task's
"dynamic priority" -- and, on a fully loaded CPU, in the time that
task actually gets.  If anything else depends on how much work this
task gets done each time around, the oscillation can easily propagate
throughout the system.  Thrash city.

(If you haven't seen this happen on real production systems under what
shouldn't be a pathological load, you haven't been around long.  The
classic mechanisms that triggered oscillations in, say, early SMP
Solaris boxes haven't bitten recently, perhaps because most modern
CPUs don't lose their marbles so comprehensively on context switch.
But I got to live this nightmare again recently on ARM Linux, due to
some impressively broken application-level threading/locking "design",
whose assumptions about scheduler behavior got broken when I switched
to an NPTL toolchain.)

I don't have the training to design a scheduler that isn't vulnerable
to control-feedback oscillations.  Neither do you, if you haven't
taken (and excelled at) a control theory course, which nowadays seems
to be taught by applied math and ECE departments and too often skipped
by CS types.  But I can recognize an impending train wreck when I see
it.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread Michael K. Edwards

On 4/17/07, Peter Williams <[EMAIL PROTECTED]> wrote:

The other way in which the code deviates from the original as that (for
a few years now) I no longer calculated CPU bandwidth usage directly.
I've found that the overhead is less if I keep a running average of the
size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
from on CPU one time to on CPU next time) and using the ratio of these
values as a measure of bandwidth usage.

Anyway it works and gives very predictable allocations of CPU bandwidth
based on nice.


Works, that is, right up until you add nonlinear interactions with CPU
speed scaling.  From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room.  Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild.  Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities.  Blecch.

Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application.  (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.)  You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH.  Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
blecch!

Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling?  And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?

It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications.  It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork.  But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department.  :-P

Seriously, though.  If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-16 Thread Michael K. Edwards

On 4/16/07, Peter Williams <[EMAIL PROTECTED]> wrote:

Note that I talk of run queues
not CPUs as I think a shift to multiple CPUs per run queue may be a good
idea.


This observation of Peter's is the best thing to come out of this
whole foofaraw.  Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of "CPU scheduling" with "run queue scheduling" across a complex,
possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.

For instance, there are architectures where several "CPUs"
(instruction stream decoders feeding execution pipelines) share parts
of a cache hierarchy ("chip-level multitasking").  On these machines,
you may want to co-schedule a "real" processing task on one pipeline
with a "cache warming" task on the other pipeline -- but only for
tasks whose memory access patterns have been sufficiently analyzed to
write the "cache warming" task code.  Some other tasks may want to
idle the second pipeline so they can use the full cache-to-RAM
bandwidth.  Yet other tasks may be genuinely CPU-intensive (or I/O
bound but so context-heavy that it's not worth yielding the CPU during
quick I/Os), and hence perfectly happy to run concurrently with an
unrelated task on the other pipeline.

There are other architectures where several "hardware threads" fight
over parts of a cache hierarchy (sometimes bizarrely described as
"sharing" the cache, kind of the way most two-year-olds "share" toys).
On these machines, one instruction pipeline can't help the other
along cache-wise, but it sure can hurt.  A scheduler designed, tested,
and tuned principally on one of these architectures (hint:
"hyperthreading") will probably leave a lot of performance on the
floor on processors in the former category.

In the not-so-distant future, we're likely to see architectures with
dynamically reconfigurable interconnect between instruction issue
units and execution resources.  (This is already quite feasible on,
say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
as many Nios II cores as fit on the chip.)  Restoring task context may
involve not just MMU swaps and FPU instructions (with state-dependent
hidden costs) but processsor reconfiguration.  Achieving "fairness"
according to any standard that a platform integrator cares about (let
alone an end user) will require a fairly detailed model of the hidden
costs associated with different sorts of task switch.

So if you are interested in schedulers for some reason other than a
paycheck, let the distros worry about 5% improvements on x86[_64].
Get hold of some different "hardware" -- say:
 - a Xilinx ML410 if you've got $3K to blow and want to explore
reconfigurable processors;
 - a SunFire T2000 if you've got $11K and want to mess with a CMT
system that's actually shipping;
 - a QEMU-simulated massively SMP x86 if you're poor but clever
enough to implement funky cross-core cache effects yourself; or
 - a cycle-accurate simulator from Gaisler or Virtio if you want a
real research project.
Then go explore some more interesting regions of parameter space and
see what the demands on mainline Linux will look like in a few years.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


do_acct_process bypasses vfs_write?

2007-03-14 Thread Michael K. Edwards

do_acct_process (in kernel/acct.c) bypasses vfs_write and calls
file->f_op->write directly.  It therefore bypasses various sanity
checks, some of which appear applicable (notably inode->i_flock &&
MANDATORY_LOCK(inode)) and others of which do not (oversize request,
access_ok, etc.).  It also neglects to call
fsnotify_modify(file->f_path.dentry) after a successful write, which
may or may not matter.

Perhaps someone more knowledgeable than I could go through vfs_read
and vfs_write, distinguishing between those checks which are only
applicable to requests initiated from userspace and those which should
also be performed for in-kernel uses of f_op->read/write?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-14 Thread Michael K. Edwards

Thanks Alan, this is really helpful -- I'll see if I can figure out
kprobes.  It is not immediately obvious to me how to use them to trace
function calls in userspace, but I'll read up.

On 3/13/07, Alan Cox <[EMAIL PROTECTED]> wrote:

> But on that note -- do you have any idea how one might get ltrace to
> work on a multi-threaded program, or how one might enhance it to
> instrument function calls from one shared library to another?  Or

I don't know a vast amount about ARM ELF user space so no.


This isn't ARM-specific, although it's certainly an ELF issue.  ltrace
uses libelf to inspect the program binary and replace calls to shared
library functions with breakpoints.  When the program hits one of
these breakpoints, ltrace inspects the function arguments, puts the
function call back in place of the breakpoint, single steps through
it, puts the breakpoint back, does some sort of sleight of hand to
ensure that it will break again on function return, and continues.
When it breaks again on function return, ltrace inspects the return
value and then continues.

I suppose I should wrap my head around the kwatch stuff and see if
there's a way of doing this without intrusive manipulation of the text
segment, which seems to be the thing preventing ltrace from being
extensible to multi-threaded programs.  The ELF handling code in
ltrace is sufficiently opaque to me that I doubt I could make it work
on function calls inside shared libraries without completely rewriting
it -- at which point I'd rather go the extra mile with kprobes, I
think.


> better yet, can you advise me on how to induce gdbserver to stream
> traces of library/syscall entry/exits for all the threads in a
> process?  And then how to cram it down into the kernel so I don't take

One way to do this is to use kprobes which will do exactly what you want
as it builds a kernel module. Doesn't currently run on ARM afaik.


There doesn't seem to be an arch/arm/kernel/kprobes.c, and I'm not
sure I have the skill to write one.  On the other hand, ltrace manages
something similar (through libelf), and I can trace ltrace's ptrace
activity with strace, so maybe I can clone that.  :-)


> the hit for an MMU context switch every time I hit a syscall or

Not easily with gdbstubs as you've got to talk to something to decide how
to log the data and proceed. If you stick it kernel side its a lot of
ugly new code and easier to port kprobes over, if you do it remotely as
gdbstubs intends it adds latencies and screws all your timings.


Sure does.  I do have gdbserver working remotely but it's not as
useful as I hoped because real-world multi-threaded code is so full of
latent races that it dies instantly under gdb.  Unless, of course,
that multi-threaded code was systematically developed using gdb and
qemu and other techniques of flushing out racy designs before their
implementation is entrenched.  Not typical of the embedded world, I'm
sorry to say.


gdbstubs is also not terribly SMP aware and for low level work its
sometimes easier to have on gdb per processor if you can get your brain
around it.


That's a trick I don't know.  What do you do, fire up the target
process, put the whole thing to sleep with a SIGSTOP, and then attach
a separate gdb to each thread after they've been migrated and locked
down to the destination CPU?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


f_mapping->host vs. f_path.dentry->d_inode

2007-03-14 Thread Michael K. Edwards

It appears that there are (at least) two ways to get the inode (if
any) associated with an open file: f_mapping->host (pagecache?) and
f_path.dentry->d_inode (dentry cache?).  generic_file_llseek uses
f_mapping->host; everything else in read_write.c uses
f_path.dentry->d_inode.  do_sendfile checks for a null inode on its
input fd but not on its output fd; nothing else in read_write.c checks
for null inode.

Under what circumstances can f_mapping->host and f_path.dentry->d_inode differ?

Under what circumstances can either or both of these pointers be null?

Under what circumstances is it preferable to retrieve the inode from
f_mapping vs. f_path.dentry, either because they differ or because one
or the other is more likely to be warm in cache?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Michael K. Edwards

In case anyone cares, this is a snippet of my work-in-progress
read_write.c illustrating how I might handle f_pos.  Can anyone point
me to data showing whether it's worth avoiding the spinlock when the
"struct file" is not shared between threads?  (In my world,
correctness comes before code-bumming as long as the algorithm scales
properly, and there are a fair number of corner cases to think through
-- although one might be able to piggy-back on the logic in
fget_light.)

Cheers,
- Michael

/*
*  Synchronization of f_pos is not for the purpose of serializing writes
*  to the same file descriptor from multiple threads.  It is solely to
*  protect against corruption of the f_pos field leading to a severe
*  violation of its semantics, such as:
*  - a user-visible negative value on a file type which POSIX forbids
*ever to have a negative offset; or
*  - an unexpected jump from (say) (2^32 - small) to (2^33 - small),
*due to an interrupt between the two 32-bit write instructions
*needed to write out an loff_t on some architectures, leading to
*a delayed overwrite of half of the f_pos value written by another
*thread.  (Applicable to SMP and CONFIG_PREEMPT kernels.)
*
*  Three tiers of protection on f_pos may be needed in order to trade off
*  between performance and least surprise:
*
*1. All f_pos accesses must go through accessors that protect against
*   problems with atomic 64-bit writes on some platforms.  These
*   accessors are only atomic with respect to one another.
*
*2. Those few accesses that cannot handle transient negative values of
*   f_pos must be protected from a race in some llseek implementations
*   (including generic_file_llseek).  Correct application code should
*   never encounter this race, and the syscall use cases that are
*   vulnerable to it are relatively infrequent.  This is a job for an
*   rwlock, although the sense is inverted (readers need exclusive
*   access to a "stalled pipeline", while writers only need to be able
*   to fix things up after the fact in the event of an exception).
*
*3. Applications that cannot handle transient overshoot on f_pos, under
*   conditions where several threads are writing to the same open file
*   concurrently and one of them experiences a short write, can be
*   protected from themselves by an rwsem around vfs_write(v) calls.
*   (The same applies to multi-threaded reads, mutatis mutandis.)
*   When CONFIG_WOMBAT (waste of memory, brain, and time -- thanks,
*   Bodo!) is enabled, this per-struct-file rwsem is taken as necessary.
*/

#define file_pos_local_acquire(file, flags) \
   spin_lock_irqsave(file->f_pos_lock, flags)

#define file_pos_local_release(file, flags) \
   spin_unlock_irqrestore(file->f_pos_lock, flags)

#define file_pos_excl_acquire(file, flags) \
   do {\
   write_lock_irqsave(file->f_pos_rwlock, flags);  \
   spin_lock(file->f_pos_lock);\
   } while (0)

#define file_pos_excl_release(file, flags) \
   do {\
   spin_unlock(file->f_pos_lock);  \
   write_unlock_irqrestore(file->f_pos_rwlock, flags); \
   } while (0)

#define file_pos_nonexcl_acquire(file, flags) \
   do {\
   read_lock_irqsave(file->f_pos_rwlock, flags);   \
   spin_lock(file->f_pos_lock);\
   } while (0)

#define file_pos_nonexcl_release(file, flags) \
   do {\
   spin_unlock(file->f_pos_lock);  \
   read_unlock_irqrestore(file->f_pos_rwlock, flags);  \
   } while (0)

/*
*  Accessors for f_pos (the file descriptor "position" for seekable file
*  types, also of interest as a bytes read/written counter on non-seekable
*  file types such as pipes and FIFOs).  The f_pos field of struct file
*  should be accessed exclusively through these functions, so that the
*  changes needed to interlock these accesses atomically are localized to
*  the accessor functions.
*
*  file_pos_write is defined to return the old file position so that it
*  can be restored by the caller if appropriate.  (Note that it is not
*  necessarily guaranteed that restoring the old position will not clobber
*  a value written by another thread; see below.)  file_pos_adjust is also
*  defined to return the old file position because it is more often needed
*  immediately by the caller; the new position can always be obtained by
*  adding the value passed into the "pos" parameter to file_pos_adjust.
*/

/*
*  Architectures on which an aligned 64-bit read/write is atomic can omit
*  l

Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Michael K. Edwards

On 3/13/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

Michael, please stop spreading this utter bullshit _now_.  You're so
full of half-knowledge that it's not funny anymore, and you try
to insult people knowing a few magniutes more than you left and right.


Thank you Christoph for that informative response to my comments.  I
take it that you consider read_write.c to be code of the highest
quality and maintainability.  If you have something specific in mind
when you write "utter bullshit" and "half-knowledge", I'd love to hear
it.

Now, for those who still care to respond as if improving the kernel
were a goal that you and I can share, a question:  When
generic_file_llseek needs the inode in order to retrieve the current
file size, it goes through f_mapping (the pagecache entry?) rather
than through f_path.dentry (the dentry cache?).  All other inode
retrievals in read_write.c go through f_path.dentry.  Why?  Or is this
a question that can only be asked on linux-fsdevel?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Michael K. Edwards

Clearly f_pos atomicity has been handled differently in the not-so-distant past:

http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg01628.html

And equally clearly the current generic_file_llseek semantics are
erroneous for large offsets, and we shouldn't be taking the inode
mutex in any case other than SEEK_END:

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=100584441922835&w=2

read_write.c is a perfect example of the relative amateurishness of
parts of the Linux kernel.  It has probably gone through a dozen major
refactorings in fifteen years but has never acquired a "theory of
operations" section justifying the errors returned in terms of
standards and common usage patterns.  It doesn't have assertion-style
precondition/postcondition checks.  It's full of bare constants and
open-coded flag tests.  It exposes sys_* and vfs_* and generic_file_*
without any indication of which sanity checks you're bypassing when
you call the inner functions directly.  It leaks performance right and
left with missing likely() macros and function calls that can't be
inlined because there's no interface/implementation split.  And then
you want to tell me that it costs too much to spinlock around f_pos
updates when file->f_count > 1?  Give me a break.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-13 Thread Michael K. Edwards

On 3/13/07, David Miller <[EMAIL PROTECTED]> wrote:

You're not even safe over standard output, simply run the program over
ssh and you suddenly have socket semantics to deal with.


I'm intimately familiar with this one.  Easily worked around by piping
the output through cat or tee.  Not that one should ever write code
for a *nix box that can't cope with stdout being a socket or tty; but
sometimes the quickest way to glue existing code into a pipeline is to
pass /dev/stdout in place of a filename, and there's no real value in
reworking legacy code to handle short writes when you can just drop in
cat.


In short, if you don't handle short writes, you're writing a program
for something other than unix.


Right in one.  You're writing a program for a controlled environment,
whether it's a Linux-only API (netlink sockets) or a Linux-only
embedded product.  And there's no need for that controlled environment
to gratuitously reproduce the overly vague semantics of the *nix zoo.


We're not changing write() to interlock with other parallel callers or
messing with the f_pos semantics in such cases, that's stupid, please
cope, kthx.


This is not what I am proposing, and certainly not what I'm
implementing.  Right now f_pos handling is gratuitously thread-unsafe
even in the common, all writes completed normally case.  writev()
opens the largest possible window to f_pos corruption by delaying the
f_pos update until after the vfs_writev() completes.  That's the main
thing I want to see fixed.

_If_ it proves practical to wrap accessor functions around f_pos
accesses, and _if_ accesses to f_pos from outside read_write.c can be
made robust against transient overshoot, then there is no harm in
tightening up f_pos semantics so that successful pipelined writes to
the same file don't collide so easily.  If read-modify-write on f_pos
were atomic, it would be easy to guarantee that it is in a sane state
any time a syscall is not actually outstanding on that struct file.
There's even a potential performance gain from streamlining the
pattern of L1 cache usage in the common case, although I expect it's
negligible.

Making read-modify-write atomic on f_pos is of course not free on all
platforms.  But once you have the accessors, you can decide at kernel
compile time whether or not to interlock them with spinlocks and
memory barriers or some such.  Most platforms probably needn't bother;
there are much bigger time bombs lurking elsewhere in the kernel.  x86
is a bit dicey, because there's no way to atomically update a 64-bit
loff_t in memory, and so in principle you could get a race on carry
across the 32-bit boundary.  (This file instantaneously jumped from
4GB to 8GB in size?  What the hell?  How long would it take you to
spot a preemption between writing out the upper and lower halves of a
64-bit integer?)

In any case, I have certainly gotten the -ENOPATCH message by now.  It
must be nice not to have to care whether things work in the field if
you can point at something stupid that an application programmer did.
I don't always have that luxury.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-12 Thread Michael K. Edwards

On 3/12/07, Alan Cox <[EMAIL PROTECTED]> wrote:

> Writing to a file from multiple processes is not usually the problem.
> Writing to a common "struct file" from multiple threads is.

Not normally because POSIX sensibly invented pread/pwrite. Forgot
preadv/pwritev but they did the basics and end of problem


pread/pwrite address a miniscule fraction of lseek+read(v)/write(v)
use cases -- a fraction that someone cared about strongly enough to
get into X/Open CAE Spec Issue 5 Version 2 (1997), from which it
propagated into UNIX98 and thence into POSIX.2 2001.  The fact that no
one has bothered to implement preadv/pwritev in the decade since
pread/pwrite entered the Single UNIX standard reflects the rarity with
which they appear in general code.  Life is too short to spend it
rewriting application code that uses readv/writev systematically,
especially when that code is going to ship inside a widget whose
kernel you control.


> So what?  My products are shipping _now_.

That doesn't inspire confidence.


Oh, please.  Like _your_ employer is the poster child for code
quality.  The cheap shot is also irrelevant to the point that I was
making, which is that sometimes portability simply doesn't matter and
the right thing to do is to firm up the semantics of the filesystem
primitives from underneath.


> even funny.  If POSIX mandates stupid shit, and application
> programmers don't read that part of the manual anyway (and don't code
> on that assumption in practice), to hell with POSIX.  On many file

Thats funny, you were talking about quality a moment ago.


Quality means the devices you ship now keep working in the field, and
the probable cost of later rework if the requirements change does not
exceed the opportunity cost of over-engineering up front.  Economy
gets a look-in too, and says that it's pointless to delay shipment and
bloat the application coding for cases that can't happen.  If POSIX
says that any and all writes (except small pipe/FIFO writes, whatever)
can return a short byte count -- but you know damn well you're writing
to a device driver that never, ever writes short, and if it did you'd
miss a timing budget recovering from it anyway -- to hell with POSIX.
And if you want to build a test jig for this code that uses pipes or
dummy files in place of the device driver, that test jig should never,
ever write short either.


> descriptors, short writes simply can't happen -- and code that

There is almost no descriptor this is true for. Any file I/O can and will
end up short on disk full or resource limit exceeded or quota exceeded or
NFS server exploded or ...


Not on a properly engineered widget, it won't.  And if it does, and
the application isn't coded to cope in some way totally different from
an infinite retry loop, then you might as well signal the exception
condition using whatever mechanism is appropriate to the API
(-EWHATEVER, SIGCRISIS, or block until some other process makes room).
And in any case files on disk are the least interesting kind of file
descriptor in an embedded scenario -- devices and pipes and pollfds
and netlink sockets are far more frequent read/write targets.


And on the device side about the only thing with the vaguest guarantees
is pipe().


Guaranteed by the standard, sure.  Guaranteed by the implementation,
as long as you write in the size blocks that the device is expecting?
Lots of devices -- ALSA's OSS PCM emulation, most AF_LOCAL and
AF_NETLINK sockets, almost any "character" device with a
record-structured format.  A short write to any of these almost
certainly means the framing is screwed and you need to close and
reopen the device.  Not all of these are exclusively O_APPEND
situations, and there's no reason on earth not to thread-safe the
f_pos handling so that an application and filesystem/driver can agree
on useful lseek() semantics.


> purports to handle short writes but has never been exercised is
> arguably worse than code that simply bombs on short write.  So if I
> can't shim in an induce-short-writes-randomly-on-purpose mechanism
> during development, I don't want short writes in production, period.

Easy enough to do and gcov plus dejagnu or similar tools will let you
coverage analyse the resulting test set and replay it.


Here we agree.  Except that I've rarely seen embedded application code
that wouldn't explode in my face if I tried it.  Databases yes, and
the better class of mail and web servers, and relatively mature
scripting languages and bytecode interpreters; but the vast majority
of working programmers in these latter days do not exercise this level
of discipline.


> Sure -- until the one code path in a hundred that handles the "short
> write" case incorrectly gets traversed in production, after having
> gone untested in a development environment that used a different
> filesystem that never happened to trigger it.

Competent QA and testing people test all the returns in the manual as
well as all the returns they can find in the cod

Re: sys_write() racy for multi-threaded append?

2007-03-12 Thread Michael K. Edwards

On 3/12/07, Bodo Eggert <[EMAIL PROTECTED]> wrote:

On Mon, 12 Mar 2007, Michael K. Edwards wrote:
> That's fine when you're doing integration test, and should probably be
> the default during development.  But if the race is first exposed in
> the field, or if the developer is trying to concentrate on a different
> problem, "spectacular crash and burn" may do more harm than good.
> It's easy enough to refactor the f_pos handling in the kernel so that
> it all goes through three or four inline accessor functions, at which
> point you can choose your trade-off between speed and idiot-proofness
> -- at _kernel_ compile time, or (given future hardware that supports
> standardized optionally-atomic-based-on-runtime-flag operations) per
> process at run-time.

CONFIG_WOMBAT

Waste memory, brain and time in order to grant an atomic write which is
neither guaranteed by the standard nor expected by any sane programmer,
just in case some idiot tries to write to one file from multiple
processes.

Warning: Programs expecting this behaviour are buggy and non-portable.


OK, I laughed out loud at this.  But I think you're missing my point,
which is that there's a time to be hard-core about code quality and
there's a time to be hard-core about _product_ quality.  Face it, all
products containing software more or less suck.  This is because most
programmers write crap code most of the time.  The only way to cope
with this, outside the confines of the European defense industry and
other niches insulated from economic reality, is to make the
production environment gentler on _application_ code than the
development environment is.  Hence CONFIG_WOMBAT.  (I like that name.
I'm going to use it in my patch, with your permission.  :-)

Writing to a file from multiple processes is not usually the problem.
Writing to a common "struct file" from multiple threads is.  99.999%
of the time it will work, because you're only writing as far as VFS
cache and then bumping f_pos, and your threads are probably on the
same processor anyway.  0.001% of the time the second thread will see
a stale f_pos and clobber the first write.  This is true even on file
types that can never return a short write.  If you remember to open
with O_APPEND so the pos argument to vfs_write is silently ignored, or
if the implementation underlying vfs_write effectively ignores the pos
argument irrespective of flags, you're OK.  If the pos argument isn't
ignored, or if you ever look at the result of a relative seek on any
fd that maps to that struct file, you're screwed.

(Note to the alert reader:  yes, this means shell scripts should
always use >> rather than > when routing stdout and/or stderr to a
file.  You're just as vulnerable to interleaving due to stdio
buffering issues as you are when stdio and stderr are sent to the tty,
and short writes may still be a problem if you are so foolish as to
use a filesystem that generates them on anything short of a
catastrophic error, but at least you get O_APPEND and sane behavior on
ftruncate().)


> Frankly, I think that unless application programmers poke at some sort
> of magic "I promise to handle short writes correctly" bit, write()
> should always return either the full number of bytes requested or an
> error code.

If you asume that you won't have short writes, your programs may fail on
e.g. solaris. There may be reasons for linux to use the same semantics at
some time in the future, you never know.


So what?  My products are shipping _now_.  Future kernels are
guaranteed to break them anyway because sysfs is a moving target.
Solaris is so not in the game for my kind of embedded work, it's not
even funny.  If POSIX mandates stupid shit, and application
programmers don't read that part of the manual anyway (and don't code
on that assumption in practice), to hell with POSIX.  On many file
descriptors, short writes simply can't happen -- and code that
purports to handle short writes but has never been exercised is
arguably worse than code that simply bombs on short write.  So if I
can't shim in an induce-short-writes-randomly-on-purpose mechanism
during development, I don't want short writes in production, period.

In my world, GNU/Linux is not a crappy imitation Solaris that you get
to pay out the wazoo for to Red Hat (and get no documentation and
lousy tech support that doesn't even cover your hardware).  It's a
full-source-code platform on which you can engineer robust industrial
and consumer products, because you can control the freeze and release
schedule component-by-component, and you can point fix anything in the
system at any time.  If, that is, you understand that the source code
is not the software, and that you can't retrofit stability and
security overnight onto code that was written with no thought of
anythi

Re: sys_write() racy for multi-threaded append?

2007-03-12 Thread Michael K. Edwards

On 3/12/07, Bodo Eggert <[EMAIL PROTECTED]> wrote:

Michael K. Edwards <[EMAIL PROTECTED]> wrote:
> Actually, I think it would make the kernel (negligibly) faster to bump
> f_pos before the vfs_write() call.

This is a security risk.


other process:
unlink(secrest_file)

Thread 1:
write(fd, large)
(interrupted)

Thread 2:
fseek(fd, -n, relative)
read(fd, buf)



I don't entirely follow this.  Which process is supposed to be secure
here, and what information is supposed to be secret from whom?  But in
any case, are you aware that f_pos is clamped to the inode's
max_bytes, not to the current file size?  Thread 2 can seek/read past
the position at which Thread 1's write began, whether f_pos is bumped
before or after the vfs_write.


BTW: The best thing you can do to a program where two threads race for
writing one fd is to let it crash and burn in the most spectacular way
allowed without affecting the rest of the system, unless it happens to
be a pipe and the number of bytes written is less than PIPE_MAX.


That's fine when you're doing integration test, and should probably be
the default during development.  But if the race is first exposed in
the field, or if the developer is trying to concentrate on a different
problem, "spectacular crash and burn" may do more harm than good.
It's easy enough to refactor the f_pos handling in the kernel so that
it all goes through three or four inline accessor functions, at which
point you can choose your trade-off between speed and idiot-proofness
-- at _kernel_ compile time, or (given future hardware that supports
standardized optionally-atomic-based-on-runtime-flag operations) per
process at run-time.

Frankly, I think that unless application programmers poke at some sort
of magic "I promise to handle short writes correctly" bit, write()
should always return either the full number of bytes requested or an
error code.  If they do poke at that bit, the (development) kernel
should deliberately split writes a few percent of the time, just to
exercise the short-write code paths.  And in order to find out where
that magic bit is, they should have to read the kernel code or ask on
LKML (and get the "standard lecture").

Really very IEEE754-like, actually.  (Harp, harp.)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-09 Thread Michael K. Edwards

I apologize for throwing around words like "stupid".  Whether or not
the current semantics can be improved, that's not a constructive way
to characterize them.  I'm sorry.

As three people have ably pointed out :-), the particular case of a
pipe/FIFO isn't seekable and doesn't need the f_pos member anyway
(it's effectively always O_APPEND).  That's what I get for checking
against standards documents at 3AM.  Of course, this has nothing to do
with the point that led me to comment on pipes/FIFOs (which was that
there exist file types that never return 0
The behavior of lseek() on devices which are incapable of seeking is
implementation-defined. The value of the file offset associated with
such a device is undefined.


Tracking f_pos accurately when writes from multiple threads hit the
same fd (pipe or not) isn't portable, but I recall situations where it
would have been useful.  And if f_pos has to be kept at all in the
uncontended case, it costs you little or nothing to do it in a
thread-safe manner -- as long as you don't overconstrain the semantics
such that you forbid the transient overshoot associated with a short
write.  In fact, unless there's something I've missed, increasing
f_pos before entering vfs_write() happens to be _faster_ than the
current code for common load patterns, both single- and multi-threaded
(although getting the full benefit in the multi-threaded case will
take some fiddling with f_count placement).

I say it costs "little or nothing" only because altering an loff_t
atomically is not free.  But even on x86, with its inability to
atomically modify any 64-bit entity in memory, an uncontended spinlock
on a cacheline already in L1 is so cheap that making the f_pos changes
atomic will (I think) be lost in the noise.

In any case, rewriting read_write.c is proving interesting.  I'll let
you all know if anything comes of it.  In the meantime, thanks for
your (really quite friendly under the circumstances) comments.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-09 Thread Michael K. Edwards

On 3/8/07, Benjamin LaHaise <[EMAIL PROTECTED]> wrote:

Any number of things can cause a short write to occur, and rewinding the
file position after the fact is just as bad.  A sane app has to either
serialise the writes itself or use a thread safe API like pwrite().


Not on a pipe/FIFO.  Short writes there are flat out verboten by
1003.1 unless O_NONBLOCK is set.  (Not that f_pos is interesting on a
pipe except as a "bytes sent" indicator  -- and in the multi-threaded
scenario, if you do the speculative update that I'm suggesting, you
can't 100% trust it unless you ensure that you are not in
mid-read/write in some other thread at the moment you sample f_pos.
But that doesn't make it useless.)

As to what a "sane app" has to do: it's just not that unusual to write
application code that treats a short read/write as a catastrophic
error, especially when the fd is of a type that is known never to
produce a short read/write unless something is drastically wrong.  For
instance, I bomb on short write in audio applications where the driver
is known to block until enough bytes have been read/written, period.
When switching from reading a stream of audio frames from thread A to
reading them from thread B, I may be willing to omit app
serialization, because I can tolerate an imperfect hand-off in which
thread A steals one last frame after thread B has started reading --
as long as the fd doesn't get screwed up.  There is no reason for the
generic sys_read code to leave a race open in which the same frame is
read by both threads and a hardware buffer overrun results later.

In short, I'm not proposing that the kernel perfectly serialize
concurrent reads and writes to arbitrary fd types.  I'm proposing that
it not do something blatantly stupid and easily avoided in generic
code that makes it impossible for any fd type to guarantee that, after
10 successful pipelined 100-byte reads or writes, f_pos will have
advanced by 1000.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-09 Thread Michael K. Edwards

On 3/8/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Dont even try, you *cannot* do that, without breaking the standards, or
without a performance drop.

The only safe way would be to lock the file during the whole read()/write()
syscall, and we dont want this (this would be more expensive than current)
Dont forget 'file' may be some sockets/tty/whatever, not a regular file.


I'm not trying to provide full serialization of concurrent
multi-threaded read()/write() in all exceptional scenarios.  I am
trying to think through the semantics of pipelined I/O operations on
struct file.  In the _absence_ of an exception, something sane should
happen -- and when you start at f_pos == 1000 and write 100 bytes each
from two threads (successfully), leaving f_pos at 1100 is not sane.


Standards are saying :

If an error occurs, file pointer remains unchanged.



From 1003.1, 2004 edition:



This volume of IEEE Std 1003.1-2001 does not specify the value of the
file offset after an error is returned; there are too many cases. For
programming errors, such as [EBADF], the concept is meaningless since
no file is involved. For errors that are detected immediately, such as
[EAGAIN], clearly the pointer should not change. After an interrupt or
hardware error, however, an updated value would be very useful and is
the behavior of many implementations.

This volume of IEEE Std 1003.1-2001 does not specify behavior of
concurrent writes to a file from multiple processes. Applications
should use some form of concurrency control.


The effect on f_pos of an error during concurrent writes is therefore
doubly unconstrained.  In the absence of concurrent writes, it is
quite harmless for f_pos to have transiently contained, at some point
during the execution of write(), an overestimate of the file position
after the write().  In the presence of concurrent writes (let us say
two 100-byte writes to a file whose f_pos starts at 1000), it is
conceivable that the second write will succeed at f_pos == 1100 but
the first will be short (let us say only 50 bytes are written),
leaving f_pos at 1150 and no bytes written in the range 1050 to 1099.
That would suck -- but the standard does not oblige you to avoid it
unless the destination is a pipe or FIFO with O_NONBLOCK clear, in
which case partial writes are flat out verboten.


You cannot know for sure how many bytes will be written, since write() can
returns a count that is different than buflen.


Of course (except in the pipe/FIFO case).  But if it does, and you're
writing concurrently to the fd, you're not obliged to do anything sane
at all.  If you're not writing concurrently, the fact that you
overshot and then fixed it up after vfs_write() returned is _totally_
invisible.  f_pos is local to the struct file.


So you cannot update fpos before calling vfs_write()


You can speculatively update it, and in the common case you don't have
to touch it again.  That's a win.


About your L1 'miss', dont forget that multi-threaded apps are going to
atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of
syscall. And you were concerned about multi-threaded apps, didnt you ?


That does indeed interfere with the optimization for multi-threaded
apps.  Which doesn't mean it isn't worth having for single-threaded
apps.  And if we get to the point that that atomic_dec_and_test is the
only thing left (in the common case) that touches the struct file
after a VFS operation completes, then we can evaluate whether f_count
ought to be split out of the struct file and put somewhere else.  In
fact, if I understand the calls inside vfs_write() correctly, f_count
is (usually?) the only member of struct file that is written to during
a call to sys_pwrite64(); so moving it out of struct file and into
some place where it has to be kept cache-coherent anyway would also
improve the performance on SMP of distributed pwrite() calls to a
process-global fd.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-08 Thread Michael K. Edwards

On 3/8/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Absolutely not. We dont want to slow down kernel 'just in case a fool might
want to do crazy things'


Actually, I think it would make the kernel (negligibly) faster to bump
f_pos before the vfs_write() call.  Unless fget_light sets fput_needed
or the write doesn't complete cleanly, you won't have to touch the
file table entry again after vfs_write() returns.  You can adjust
vfs_write to grab f_dentry out of the file before going into
do_sync_write.  do_sync_write is done with the struct file before it
goes into the aio_write() loop.  Result: you probably save at least an
L1 cache miss, unless the aio_write loop is so frugal with L1 cache
that it doesn't manage to evict the struct file.

Patch to follow.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sys_write() racy for multi-threaded append?

2007-03-08 Thread Michael K. Edwards

On 3/8/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Nothing in the manuals says that write() on same fd should be non racy : In
particular file pos might be undefined. There is a reason pwrite() exists.

Kernel doesnt have to enforce thread safety as standard is quite clear.


I know the standard _allows_ us to crash and burn (well, corrupt
f_pos) when two threads race on an fd, but why would we want to?
Wouldn't it be better to do something at least slightly sane, like add
atomically to f_pos the expected number of number of bytes written,
then do the write, then fix it up (again atomically) if vfs_write
returns an unexpected pos?


Only O_APPEND case is specially handled (and NFS might fail to handle this
case correctly)


Is it?  How?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sys_write() racy for multi-threaded append?

2007-03-08 Thread Michael K. Edwards

from sys_write():

   file = fget_light(fd, &fput_needed);
   if (file) {
   loff_t pos = file_pos_read(file);
   ret = vfs_write(file, buf, count, &pos);
   file_pos_write(file, pos);
   fput_light(file, fput_needed);
   }

Surely that's racy when two threads write to the same fd and the first
vfs_write call blocks or is preempted.  Or does fget_light take some
per-file lock that I'm not seeing?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/5] signalfd v2 - signalfd core ...

2007-03-08 Thread Michael K. Edwards

On 3/8/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

So anybody who would "go with the Berkeley crowd" really shows a lot of
bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's
so long ago that I don't think there is any point in us trying to fix it
any more.


Well, they did invent the socket, which sucks less than a lot of other
things.  You would prefer STREAMS, perhaps?  :-)  My point is that
this is a message-oriented problem, not a stream-oriented problem, and
read() just isn't a very good interface for passing structured
messages around.  I agree that the details of recvmsg() ancillary data
are fairly grotty (optimization based on micro-benchmarks, as usual,
back in the PDP/VAX days), but the concept isn't that bad; you let
netlink sockets into the kernel, didn't you?


(But if somebody makes recvmgs a general VFS interface and makes it just
work for everything, I'd probably take the patch as a cleanup. I really
think it should have been a "struct file_operations" thing rather than
being a socket-only thing.. But since you couldn't portably use it
anyway, the thing is pretty moot)


sendmsg()/recvmsg() to a file makes perfect sense, if you put the
equivalent of an llseek tuple into ancillary data.  And it's also IMHO
the right way to do file AIO -- open up a netlink socket to the entity
that's scheduling the IOs for a given filesystem, use the SCM_RIGHTS
mechanism to do indirect opens (your credentials could come over an
AF_UNIX socket from a completely separate "open server" process), and
multiplex all the AIO to that filesystem across the netlink socket.

If adding this to VFS is something you would seriously consider, I'll
do the work.  But I will need some coaching to work around my relative
inexperience with the Linux core code and my lack of taste.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/5] signalfd v2 - signalfd core ...

2007-03-08 Thread Michael K. Edwards

On 3/8/07, Davide Libenzi  wrote:

The reason for the special function, was not to provide a non-blocking
behaviour with zero timeout (that just a side effect), but to read the
siginfo. I was all about using read(2) (and v1 used it), but when you have
to transfer complex structures over it, it becomes hell. How do you
cleanly compat over a f_op->read callback for example?


Make it a netlink socket and fetch your structures using recvmsg().
siginfo_t belongs in ancillary data.

The UNIX philosophy is "everything's a file".  The Berkeley philosophy
is "everything's a socket, except for files, which are feeble
mini-sockets".  I'd go with the Berkeley crowd here.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

No, I just checked, and Intel's own optimization manual makes it clear
that you should be careful. They talk about performance penalties due to
resource constraints - which makes tons of sense with a core that is good
at handling its own resources and could quite possibly use those resources
better to actually execute the loads and stores deeper down the
instruction pipeline.


Certainly you should be careful -- and that usually means leaving it
up to the compiler.  But hinting to the compiler can help; there may
be an analogue of the (un)likely macros waiting to be implemented for
loop prefetch.  And while out-of-order execution and fancy hardware
prefetch streams greatly reduce the need for explicit prefetch in
general code, there's no substitute for the cache _bypassing_
instructions when trying to avoid excessive cache eviction in DDoS
situations.

For instance, if I wind up working on a splay-tree variant of Robert
Olsson's trie/hash work, I'll try to measure the effect of using SSE2
non-temporal stores to write half-open connections to the leaves of
the tree.  That may give some additional improvement in the ability to
keep servicing real load during a SYN flood.


So it's not just 3DNow! making AMD look bad, or Intel would obviously
suggest people use it out of the wazoo ;)


Intel puts a lot of effort into educating compiler writers about the
optimal prefetch insertion strategies for particular cache
architectures.  At the same time, they put out the orange cones to
warn people off of hand-tuning prefetch placement using
micro-benchmarks.  People did that when 3DNow! first came out, with
predictable consequences.


> XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching.
Calling that "getting it right" is ludicrous. If anything, it gets things
so wrong that prefetching is *required* for good performance.


That's not an accident.  Hardware prefetch units cost a lot in power
consumption.  Omitting the hardware prefetch unit and drastically
simplifying the pipeline is how they got a design whose clock they
could crank into the stratosphere and still run on battery power.  And
in the network processor space, they can bolt a myriad of on-chip
microengines and still have some prayer of accurately simulating the
patterns of internal bus cycles.  Errors in simulation can still be
fixed up with prefetch instruction placement to put memory accesses
from the XScale core into phases where the data path processors aren't
working so hard.

Moreover, because they're embedded targets and rarely have to run
third-party binaries originally compiled for older cores, it didn't
really cost them anything to say, "Sorry, this chip's performance is
going to suck if your compiler's prefetch insertion isn't properly
tuned."  The _only_ cost is a slightly less dense instruction stream.
That's not trivial but it's manageable; you budget for it, and the
increase in I-cache power consumption is more than made up for by the
reduction in erroneous data prefetches (hardware prefetch gets it
wrong a substantial fraction of the time).


I'm talking about real CPU's with real memory pipelines that already do
prefetching in hardware. The better the core is, the less the prefetch
helps (and often the more it hurts in comparison to how much it helps).


The more sophisticated the core is, the less software prefetch
instructions help.  But more sophisticated isn't always better; it
depends on your target applications.


But if you mean "doesn't try to fill the TLB on data prefetches", then
yes, that's generally the right thing to do.


AOL.


> (Oddly, Prescott seems to have initiated a page table walk on DTLB miss
> during software prefetch -- just one of many weird Prescott flaws.)

Netburst in general is *very* happy to do speculative TLB fills, I think.


Design by micro-benchmark.  :-)  They set out to push the headline MHz
and the real memory bandwidth to the limit in Prescott, and they
succeeded (data at
http://www.digit-life.com/articles2/rmma/rmma-p4.html).  At a horrible
cost in power per clock, and no gain in real application performance.
So NetBurst died a horrible death, and now we have "Intel Core" -- P6
warmed over, with caches sized such that for most applications the
second core ought to be used solely to soak up control path overheads.

Windows runs better on dual-core machines because the NT kernel will
happily eat an entire core doing memory bookkeeping.  Linux could take
a hint here and use the second core largely for interrupt handling and
force-priming the L2 cache on task switch.  (Prefetch instructions
aren't much use here, precisely because they give up on DTLB miss.)
Any kernel code paths that are known to stall a lot because of
usually-cold-cache access patterns (TCP connection establishment, for
instance) can also be punted over to the second core.  If you're
feeling industrious, use non-temporal memory accesses judi

Re: f_owner.lock and file->pos updates

2007-03-07 Thread Michael K. Edwards

I wrote:

I didn't see any clean way to intersperse overwrites and appends to a
record-structured file without using vfs_llseek, which steps on f_pos.


The context, of course, is an attempt to fix -ENOPATCH with regard to
the netlink-based AIO submission scheme I outlined a couple of days
ago.  :-)

Maybe f_pos should be advanced atomically by the number of bytes
expected to be read/written, before entering the vfs_(read|write)(|v)
call?  And then if the read/write doesn't complete normally, f_pos
should be decremented by the number of bytes we failed to read/write?
Or do we have to make absolutely, positively sure that sampling f_pos
from another thread never returns any value outside (before)..(before
+ bytes read/written)?  If so, the only way to cure the worst symptom
of the append race appears to be to hold a per-fd lock for the
duration of the sys_(read|write).

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: f_owner.lock and file->pos updates

2007-03-07 Thread Michael K. Edwards

On 3/7/07, Alan Cox <[EMAIL PROTECTED]> wrote:

The right way IMHO would be to do the work that was done for pread/pwrite
and implement preadv/pwritev. The moment you want to do atomic things
with the file->f_pos instead of doing it with a local passed pos value it
gets ugly.. why do you need to do it with f_pos ?


I didn't see any clean way to intersperse overwrites and appends to a
record-structured file without using vfs_llseek, which steps on f_pos.
Actually, we may already have a problem with append races in
sys_write/sys_writev.  If it's possible for two threads to write() to
the same file in different threads (both intending to append), they
may wind up passing the same "pos" value into vfs_write().  Or does
fget_light/fput_light do some sort of locking that I'm not seeing?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote

Yeah, I'm not at all surprised. Any implementation of "prefetch" that
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes
them weaker for *actual* prefetching) will generally have a hard time with
a NULL pointer. Exactly because it will try to do a totally unnecessary
TLB fill - and since most CPU's will not cache negative TLB entries, that
unnecessary TLB fill will be done over and over and over again..


Data prefetch instructions should indeed avoid page table walks.
(Instruction prefetch mechanisms often do induce table walks on ITLB
miss.)  Not just because of the null pointer case, but because it's
quite normal to run off the end of an array in a loop with an embedded
prefetch instruction.  If you have an extra instruction issue unit
that shares the same DTLB, and you know you will really want that
data, you can sometimes use it to force DTLB preloads by doing an
actual data fetch from the foreseeable page.  This is potentially one
of the best uses of chip multi-threading on an architecture like Sun's
Niagara.

(I don't think Intel's hyper-threading works for this purpose; the
DTLB is shared but the entries are marked as owned by one thread or
the other.  HT can be used for L2 cache prefetching, although the
results so far seem to be mixed:
http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf)


In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly
   the above kinds of things to make sure that you don't try to prefetch
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do
software prefetching, because hw can usually simply do better without it.


Not the XScale -- it performs quite poorly without prefetch, as people
who have run ARMv5-optimized binaries on it can testify.  From the
XScale Core Developer's Manual:


The Intel XScale(r) core has a true prefetch load instruction (PLD).
The purpose of this instruction is to preload data into the data and
mini-data caches. Data prefetching allows hiding of memory transfer
latency while the processor continues to execute instructions. The
prefetch is important to compiler and assembly code because judicious
use of the prefetch instruction can enormously improve throughput
performance of the core. Data prefetch can be applied not only to
loops but also to any data references within a block of code. Prefetch
also applies to data writing when the memory type is enabled as write
allocate

The Intel XScale(r) core prefetch load instruction is a true prefetch
instruction because the load destination is the data or mini-data
cache and not a register. Compilers for processors which have data
caches, but do not support prefetch, sometimes use a load instruction
to preload the data cache. This technique has the disadvantages of
using a register to load data and requiring additional registers for
subsequent preloads and thus increasing register pressure. By
contrast, the prefetch can be used to reduce register pressure instead
of increasing it.

The prefetch load is a hint instruction and does not guarantee that
the data will be loaded. Whenever the load would cause a fault or a
table walk, then the processor will ignore the prefetch instruction,
the fault or table walk, and continue processing the next instruction.
This is particularly advantageous in the case where a linked list or
recursive data structure is terminated by a NULL pointer. Prefetching
the NULL pointer will not fault program flow.


People's prejudices against prefetch instructions are sometimes
traceable to the 3DNow! prefetch(w) botch, which some processors
"support" as no-ops and others are too aggressive about (Opteron
prefetches are reputed to be "strong", i. e., not dropped on DTLB
miss).  XScale gets it right.  So do most Pentium 4's using the SSE
prefetches, according to the IA-32 optimization manual.  (Oddly,
Prescott seems to have initiated a page table walk on DTLB miss during
software prefetch -- just one of many weird Prescott flaws.)  I'm
guessing Pentium M and its descendants (Core Solo and Duo) get it
right but I'm having a hell of a time finding out for sure.  Can any
of the x86 experts answer this?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


f_owner.lock and file->pos updates

2007-03-07 Thread Michael K. Edwards

Suppose I want to create an atomic llseek+writev operation.  Is this
more or less sufficient:

   ssize_t ret = -EBADF;
   file = fget_light(fd, &fput_needed);
   if (file) {
   if (unlikely(origin > 2)) {
   ret = -EINVAL;
   } else {
   write_lock_irq(&file->f_owner.lock);
   pos = vfs_llseek(file, ((loff_t) offset_high <<
32) | offset_low, origin);
   ret = (ssize_t)pos;
   if (likely(ret >= 0)) {
   ret = vfs_writev(file, vec, vlen, &pos);
   file_pos_write(file, pos);
   }
   write_unlock_irq(&file->f_owner.lock);
   }
   fput_light(file, fput_needed);
   }

Or is this the wrong sort of lock to be using to protect against
having file->pos altered by another thread executing through the same
code?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-03-04 Thread Michael K. Edwards

On 3/4/07, Kyle Moffett <[EMAIL PROTECTED]> wrote:

Well, even this far into 2.6, Linus' patch from 2003 still (mostly)
applies; the maintenance cost for this kind of code is virtually
zilch.  If it matters that much to you clean it up and make it apply;
add an alarmfd() syscall (another 100 lines of code at most?) and
make a "read" return an architecture-independent siginfo-like
structure and submit it for inclusion.  Adding epoll() support for
random objects is as simple as a 75-line object-filesystem and a 25-
line syscall to return an FD to a new inode.  Have fun!  Go wild!
Something this trivially simple could probably spend a week in -mm
and go to linus for 2.6.22.


Or, if you want to do slightly more work and produce something a great
deal more useful, you could implement additional netlink address
families for additional "event" sources.  The socket - setsockopt -
bind - sendmsg/recvmsg sequence is a well understood and well
documented UNIX paradigm for multiplexing non-blocking I/O to many
destinations over one socket.  Everyone who has read Stevens is
familiar with the basic UDP and "fd open server" techniques, and if
you look at Linux's IP_PKTINFO and NETLINK_W1 (bravo, Evgeniy!) you'll
see how easily they could be extended to file AIO and other kinds of
event sources.

For file AIO, you might have the application open one AIO socket per
mount point, open files indirectly via the SCM_RIGHTS mechanism, and
submit/retire read/write requests via sendmsg/recvmsg with ancillary
data consisting of an lseek64 tuple and a user-provided cookie.
Although the process still has to have one fd open per actual open
file (because trying to authenticate file accesses without opening fds
is madness), the only fds it has to manipulate directly are those
representing entire pools of outstanding requests.  This is usually a
small enough set that select() will do just fine, if you're careful
with fd allocation.  (You can simply punt indirectly opened fds up to
a high numerical range, where they can't be accessed directly from
userspace but still make fine cookies for use in lseek64 tuples within
cmsg headers).

The same basic approach will work for timers, signals, and just about
any other event source.  Userspace is of course still stuck doing its
own state machines / thread scheduling / however you choose to think
of it.  But all the important activity goes through socketcall(), and
the data and control parameters are all packaged up into a struct
msghdr instead of the bare buffer pointers of read/write.  So if
someone else does come along later and design an ultralight threading
mechanism that isn't a total botch, the actual data paths won't need
much rework; the exception handling will just get a lot simpler.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-03-04 Thread Michael K. Edwards

Please don't take this the wrong way, Ray, but I don't think _you_
understand the problem space that people are (or should be) trying to
address here.

Servers want to always, always block.  Not on a socket, not on a stat,
not on any _one_ thing, but in a condition where the optimum number of
concurrent I/O requests are outstanding (generally of several kinds
with widely varying expected latencies).  I have an embedded server I
wrote that avoids forking internally for any reason, although it
watches the damn serial port signals in parallel with handling network
I/O, audio, and child processes that handle VoIP signaling protocols
(which are separate processes because it was more practical to write
them in a different language with mediocre embeddability).  There's a
lot of things that can block out there, not just disk I/O, but the
only thing a genuinely scalable server process ever blocks on (apart
from the odd spinlock) is a wait-for-IO-from-somewhere mechanism like
select or epoll or kqueue (or even sleep() while awaiting SIGRT+n, or
if it genuinely doesn't suck, the thread scheduler).

Furthermore, not only do servers want to block rather than shove more
I/O into the plumbing than it can handle without backing up, they also
want to throttle the concurrency of requests at the kernel level *for
the kernel's benefit*.  In particular, a server wants to submit to the
kernel a ton of stats and I/O in parallel, far more than it makes
sense to actually issue concurrently, so that efficient sequencing of
these requests can be left to the kernel.  But the server wants to
guide the kernel with regard to the ratios of concurrency appropriate
to the various classes and the relative urgency of the individual
requests within each class.  The server also wants to be able to
reprioritize groups of requests or cancel them altogether based on new
information about hardware status and user behavior.

Finally, the biggest argument against syslets/threadlets AFAICS is
that -- if done incorrectly, as currently proposed -- they would unify
the AIO and normal IO paths in the kernel.  This would shackle AIO to
the current semantics of synchronous syscalls, in which buffers are
passed as bare pointers and exceptional results are tangled up with
programming errors.  This would, in turn, make it quite impossible for
future hardware to pipeline and speculatively execute chains of AIO
operations, leaving "syslets" to a few RDBMS programmers with time to
burn.  The unimproved ease of long term maintenance on the kernel (not
to mention the complete failure to make the writing of _correct_,
performant server code any easier) makes them unworthy of
consideration for inclusion.

So, while everybody has been talking about cached and non-cached
cases, those are really total irrelevancies.  The principal problem
that needs solving is to model the process's pool of in-flight I/O
requests, together with a much larger number of submitted but not yet
issued requests whose results are foreseeably likely to be needed
soon, using a data structure that efficiently supports _all_ of the
operations needed, including bulk cancellation, reprioritization, and
batch migration based on affinities among requests and locality to the
correct I/O resources.  Memory footprint and gentle-on-real-hardware
scheduling are secondary, but also important, considerations.  If you
happen to be able to service certain things directly from cache,
that's gravy -- but it's not very smart IMHO to put that central to
your design process.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-03-02 Thread Michael K. Edwards

On 3/2/07, Davide Libenzi  wrote:

For threadlets, it might be. Now think about a task wanting to dispatch N
parallel AIO requests as N independent syslets.
Think about this task having USEDFPU set, so the FPU context is dirty.
When it returns from async_exec, with one of the requests being become
sleepy, it needs to have the same FPU context it had when it entered,
otherwise it won't prolly be happy.
For the same reason a schedule() must preserve/sync the "prev" FPU
context, to be reloaded at the next FPU fault.


And if you actually think this through, I think you will arrive at (a
subset of) the conclusions I did a week ago: to keep the threadlets
lightweight enough to schedule and migrate cheaply, they can't be
allowed to "own" their own FPU and TLS context.  They have to be
allowed to _use_ the FPU (or they're useless) and to _use_ TLS (or
they can't use any glibc wrapper around a syscall, since they
practically all set the thread-local errno).  But they have to
"quiesce" the FPU and stash any thread-local state they want to keep
on their stack before entering the next syscall, or else it'll get
clobbered.

Keep thinking, especially about FPU flags, and you'll see why
threadlets spawned from the _same_ threadlet entrypoint should all run
in the same pool of threads, one per CPU, while threadlets from
_different_ entrypoints should never run in the same thread (FPU/TLS
context).  You'll see why threadlets in the same pool shouldn't be
permitted to preempt one another except at syscalls that block, and
the cost of preempting the real thread associated with one threadlet
pool with another real thread associated with a different threadlet
pool is the same as any other thread switch.  At which point,
threadlet pools are themselves first-class objects (to use the snake
oil phrase), and might as well be enhanced to a data structure that
has efficient operations for reprioritization, bulk cancellation, and
all that jazz.

Did I mention that there is actually quite a bit of prior art in this
area, which makes a much better guide to the design of round wheels
than micro-benchmarks do?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-28 Thread Michael K. Edwards

On 2/28/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

130 lines skipped...


Yeah, I edited it down a lot before sending it.  :-)


I have only one question - wasn't it too lazy to write all that? :)


I'm pretty lazy all right.  But occasionally an interesting problem
(and revamping AIO is very interesting) makes me think, and what
little thinking I do is always accompanied by writing.  Once I've
thought something through to the point that I think I understand the
problem, I've even been known to attempt a solution.  Not always,
though; more often, I find a new interesting problem, or else I am
forcibly reminded that I should be spending my little store of insight
on revenue-producing activity.

In this instance, there didn't seem to be any harm in sending my
thoughts to LKML as I wrote them, on the off chance that Ingo or
Davide would get some value out of them in this design cycle (which
any code I eventually get around to producing will miss).  So far,
I've gotten some rather dismissive pushback from Ingo and Alan (who
seem to have no interest outside x86 and less understanding than I
would have thought of what real userspace code looks like), a "why
preach to people who know more than you do" from Davide, a brief aside
on the dominance of x86 from Oleg, and one off-list "keep up the good
work".  Not a very rich harvest from (IMHO) pretty good seeds.

In short, so far the "Linux kernel community" is upholding its
reputation for insularity, arrogance, coding without prior design,
lack of interest in userspace problems, and inability to learn from
the mistakes of others.  (None of these characterizations depends on
there being any real insight in anything I have written.)  Linus
himself has a very different reputation -- plenty of arrogance all
right, but genuine brilliance and hard work, and sincere (if cranky)
efforts to explain the "theory of operations" underlying central
design choices.  So far he hasn't commented directly on anything I
have had to say; it will be interesting to see whether he tells me to
stop annoying the pros and to go away until I have some code to
contribute.

Happy hacking,
- Michael

P. S.  I do think "threadlets" are brilliant, though, and reading
Ingo's patches gave me a much better idea of what would be involved in
prototyping Asynchronously Executed I/O Unit opcodes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-27 Thread Michael K. Edwards

On 2/27/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

I think what you are not hearing, and what everyone else is saying
(INCLUDING Linus), is that for most programmers, state machines are
much, much harder to program, understand, and debug compared to
multi-threaded code.  You may disagree (were you a MacOS 9 programmer
in another life?), and it may not even be true for you if you happen
to be one of those folks more at home with Scheme continuations, for
example.  But it is true that for most kernel programmers, threaded
programming is much easier to understand, and we need to engineer the
kernel for what will be maintainable for the majority of the kernel
development community.


State machines are much harder to write without going through a real
on-paper design phase first.  But multi-threaded code is much harder
for a team of average working coders to write correctly, judging from
the numerous train wrecks that I've been called in to salvage over the
last ten years or so.

The typical 50-250KLoC multi-threaded C/C++/Java application, even if
it's been shipping to customers for several years, is littered with
locking constructs yet routinely corrupts shared data structures.
Change the number of threads in a pool, adjust the thread priorities,
or move a couple of lines of code around, and you're very likely to
get an unexplained deadlock.  God help you if you try to use a
debugger on it -- hundreds of latent race conditions will crop up that
didn't happen to trigger before because the thread didn't get
preempted there.

The only programming languages that I have seen widely used in US
industry (so Lisps and self-consciously functional languages are out)
in which mere mortals write passable multi-threaded applications are
Visual Basic and Python.  That's partly because programmers in these
languages are not in the habit of throwing pointers around; but if
that were all there was to it, Java programmers would be a lot more
successful than they are at actually writing threaded programs rather
than nibbling cautiously around the edges with EJB.  It also helps a
lot that strings are immutable; but again, Java shares this property.
No, the big difference is that VB and Python dicts and arrays are
thread-safed by the language runtime, and Java collections are not.
So while there may be all sorts of pointless and dangerous
mis-locking, it's "protecting" somewhat higher-level data structures.

What does this have to do with the kernel?  Well, if you're going to
create Yet Another Micro^WLightweight-Threading Construct for AIO, it
would be mighty nice not to be slinging bare pointers around until the
IO is actually complete and the kernel isn't going to be touching the
data buffer any more.  It would also be mighty nice to have a
thread-safe "request pool" data structure on which actions like bulk
cancellation and iteration over a subset can operate.  (The iterator
returned by, say, a three-sided query on a RCU priority queue may
contain _stale_ information, but never _inconsistent_ information.)

I recognize that this is more object-oriented snake oil than kernel
programmers usually tolerate, but it would really help AIO-threaded
programs suck less.  It is also very much in the Unix tradition --
what are file descriptors and fd_sets if not object-oriented design?
And if following the socket model was good enough for epoll and
netlink, why not for threadlet pools?

In the best of all possible worlds, AIO would look just like the good
old socket-bind-listen-accept model, except that I/O is transacted on
the "listen" socket as long as it can be serviced from cache, and
accept() only gets a new connection when a delayed I/O arrives.  The
object hiding inside the fd returned by socket() would be the
"threadlet pool", and the object hiding inside each fd returned by
accept() would be a threadlet.  Only this time you do it right and
make errno(fd) be a vsyscall that returns a threadlet-local error
state, and you assign reasonable semantics to operations on an fd that
has already encountered an exception.  Much like IEEE 754, actually.

Anyway, like I said, good threaded code is quite rare.  On the other
hand, I have seen plenty of reasonably successful event-loop
programming in C and C++, mostly in MacOS and Windows and PalmOS GUIs
where the OS deals with event queues and event handler registration.
It's not the world's most CPU-efficient strategy because of all those
function pointers and virtual methods, but those costs are dwarfed by
the GUI bounding boxes and repaints and things anyway.  More to the
point, writing an event-loop framework for other people to use
involves extensive APIs that are stable in the tiniest details and
extensively documented.  Not, perhaps, Plan A for the Linux kernel
community.  :-)

Happily, a largely event-driven userspace framework can easily be
stacked on top of a threaded kernel -- as long as they're the right
kind of threads.  The right kind of threads do not proliferate malloc
aren

Re: GPL vs non-GPL device drivers

2007-02-26 Thread Michael K. Edwards

On 2/25/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

On 2/26/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:
> I know it's fun to blame everything on Redmond, but how about a
> simpler explanation?

Says the master of conspiracy.


Yes, I rather chuckled at the irony as I wrote that one.  :-)  But
there is a difference.  I've provided you with independent sources for
the basic data -- Nimmer and Corbin and dozens of appellate opinions
for the universal contract nature of copyright licenses, guidestar.org
for Form 990s, links that you can follow a couple of hops to two
actual letters of opinion that Moglen has written suggesting "GPL
circumvention" techniques, facts that you can verify about the
financial dealings surrounding the formation and acquisition of Cygnus
Solutions, the campaign to squeeze OpenSSL out of the GNUniverse, and
the unreproducibility of commercial cross-compilers built around GCC.
You don't have to believe a damn thing I say -- read the law and
follow the money trail for yourself.

If you care to know more about how the racket works, you can do your
own damn homework and see if you reach the same conclusions I did two
years ago.  Subsequent events -- the forced merger of OSDL into the
Linux Foundation, the flowering of the SFLC into the Software Freedom
Conservancy, the sunset of Oracle on Sun and rise of Unbreakable
Linux, the hocus-pocus around "license compatibility" as first Intel
and HP, then Sun, knuckle under, switch to the GPL, and start pressing
IBM and Oracle to do the same, the war of the "patent pledges" -- fit
into the same pattern.  I was disgusted then, and I haven't seen any
reason to become less so.  I don't actually enjoy this sort of
muckraking -- it's dirty work and I have better things to do.

Watch for more posturing about Microsoft/Novell -- funny how the
distro that regularly pays Eben Moglen to give keynote speeches has
been charging by the seat for years, but the distro that does a deal
with the _other_ devil is threatened with being written out of GPL v3.
Watch for -- but wait, I gave up ranting for Lent.  Watch for
anything you like, keep scanning the horizon for Moby Ballmer and his
secret deals to keep you dual-booting into Windows for your
frag-fests.  When your pet shark in law professor's clothing decides
_your_ arm would be tasty next, don't come crying to me.  This really
is the absolute last I have to say on this topic in a public forum, in
2007 anyhow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-26 Thread Michael K. Edwards

On 2/25/07, Alan <[EMAIL PROTECTED]> wrote:

> Busy-wait loops were a rhetorical flourish, I grant you.

Thats a complicated fancy way of saying you were talking rubbish ?


No, it's a way of saying "yes, there are deliberate performance limits
in the driver code, but they're harder to explain than busy-wait
loops".  Ask anyone who has worked inside a peripheral device company,
especially one that sells into the OEM/ODM market.  It is
_standard_practice_ to fab a single chip design that runs as fast as
you know how, and then dumb it down in firmware to meet the spec that
the board vendor is willing to pay for.  If you don't, those guys will
steal you blind, diverting chips intended (and priced) for low-margin
commodity mobos into the high-margin gamer market.

There is, however, a gross error in my latest explanation of why ATI
and NVidia don't provide full driver source code.  Really an
embarrassing oversight.  Wouldn't blame you for ripping me a new one
over it.  Pity you were busy taking cheap shots at my tortured syntax.
Here's what I forgot:

Macrovision.

Not that any of the stuff about market segmentation, retail margins,
and the FCC isn't true, but it's not the scariest thing in a PC
graphics vendor's world.  The scariest thing in their world is the
possibility that it will become widely known how to defeat the
"Macrovision in, Macrovision out" mechanism (and the equivalent for
DVD playback and HDMI).

Just so you know I'm not making this up:  I know where the "defeat
Macrovision" bits are on certain devices, and if I told you, Jack
Valenti would personally come to my house and ream me out with a
red-hot poker.  (Alan, that's a special kind of "talking rubbish"
known as "comic hyperbole".  I learned it from your Monty Python.  :-)
Seriously though, those bits are in there, they're software
controlled (or were when I last fiddled with set-tops), and there are
large security deposits and large threats of being shut out of the
MPEG/DVD and DVI/HDMI patent pools riding on the graphics card makers'
promises to keep them hidden.  Not that they're that hard to reverse
engineer -- but they're not going to help you.

You're going to say this cat is already out of the bag; a quarter of
the teenagers in developed countries have the MaD sKiLlZ to rip DVDs
and pass them around on BitTorrent like so much 1970s kiddie porn.
And maybe that's so; but imagine next year's family PC with the HDMI
port attached to the 62" HDTV, dual-booting pirated Windows for
pirated first-person shooters and Linux for pirated DVDs, both of them
conveniently pre-installed by the neighborhood store-front beige-box
assembler.  That's the MPAA's worst nightmare -- and frankly, I'm not
too keen on it either.  I like seeing old movies lovingly restored and
published on DVD.  I'd like to see them come out someday in 16:9 720p,
preferably while my eyes are still good enough to tell the difference.

Anyway, that's more than enough purple prose.  Believe what you want to believe.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-25 Thread Michael K. Edwards

On 2/25/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

Fundamentally a kernel thread is just its
EIP/ESP [on x86, similar on other architectures] - which can be
saved/restored in near zero time.


That's because the kernel address space is identical in every
process's MMU context, so the MMU doesn't have to be touched _at_all_.
Also, the kernel very rarely touches FPU state, and even when it
does, the FXSAVE/FXRSTOR pair is highly optimized for the "save state
just long enough to move some memory around with XMM instructions"
case.  (I know you know this; this is for the benefit of less
experienced readers.)  If your threadlet model shares the FPU state
and TLS arena among all threadlets running on the same CPU, and
threadlets are scheduled in bursts belonging to the same process (and
preferably the same threadlet entrypoint), then you will get similarly
efficient userspace threadlet-to-threadlet transitions.  If not, not.


scheduling only relates to the minimal context that is in the CPU. And
most of that context we save upon /every system call entry/, and restore
it upon every system call return. If it's so expensive to manipulate,
why can the Linux kernel do a full system call in ~150 cycles? That's
cheaper than the access latency to a single DRAM page.


That would be the magic of shadow register files.  When the software
does things that hardware expects it to do, everybody wins.  When the
software tries to get clever based on micro-benchmarks, everybody
loses.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-25 Thread Michael K. Edwards

On 2/25/07, Alan <[EMAIL PROTECTED]> wrote:

> of other places too.  Especially when the graphics chip maker explains
> that keeping their driver source code closed isn't really an attempt
> to hide their knowledge under a bushel basket.  It's a defensive
> measure against having their retail margins destroyed by nitwits who
> take out all the busy-wait loops and recompile with -O9 to get another
> tenth of a frame per second, and then pretend ignorance when
> attempting to return their slagged GPU at Fry's.

Wrong as usual...


If this is an error, it's the first one _you've_ caught me in.  Or did
I miss something conformant to external reality in your earlier
critiques?


The Nvidia driver and ATI drivers aren't full of magical delay loops and
if there was a way to fry those cards easily in hardware the virus folks
would already be doing it. The reverse engineering teams know what is in
the existing code thank you. Creating new open source drivers from it is
however hard because of all the corner cases.


Several years ago I worked on a MIPS-based set-top prototype mated to
a graphics+HDTV board from a major PC 3-D vendor.  We had full
documentation and a fair amount of sample code under NDA.  We were on
the vendor's board spin 52 -- 52! and they'd sometimes spun the chip a
couple times internally between released boards -- before it could be
persuaded to do the documented thing with regard to video textures.
In the meantime, we frotzed a lot of boards and chips before we
decided to stick a triple-size fan to the damn thing with thermal
grease and to avoid taking any chances with new VRAM timings except in
the oversized chassis with the jet-engine fans.  Maybe things have
gotten better since then, but I doubt it.

Busy-wait loops were a rhetorical flourish, I grant you.  But software
interlocks on data movement that protect you against some "corner
case" you're unlikely to think of on your own are the norm, as is
software-assisted clock scaling guided by temperature sensors in half
a dozen places on chip and package.  You can drive enough watts
through a laptop GPU to fry an egg on it -- which is not kind to the
BGA bonds.  Yes, I have killed a laptop this way -- one that's
notorious for popping its GPU, but it's no accident that the last
thing I did to it was run a game demo that let me push my luck on the
texture quality.

It's also quite common, or used to be, for the enforcement of limits
on monitor sync timings to be done in software, and it's quite
possible to kill an underdesigned monitor or grossly exceed regulatory
emissions limits by overdriving its resolution and frame rate.  (I've
never actually blown one up myself, but I've pushed my luck
overdriving a laptop dock's DVI and gotten some lovely on-screen
sparklies and enough ultrasonics to set the neighbor's dog to
whining.)  Viruses that kill your monitor may be urban legend, but
it's a can of worms that a smart graphics vendor doesn't want to be
liable for opening.  The FCC also frowns on making it too easy for
hobbyists to turn a Class B device into a two-block-radius FM jammer.


You will instead find that both vendors stopping providing Linux related
source code at about the point they got Xbox related contracts. A matter
which has been pointed out to various competition and legal authorities
and for now where it seems to lie.


I know it's fun to blame everything on Redmond, but how about a
simpler explanation?  The technology and market for 3-D graphics is
now sufficiently mature to allow revenue maximization through market
segmentation -- in other words, charging some people more than others
for the same thing because they're willing and able to pay extra.  The
volumes are also high enough to test chips as they come out of fab and
bin them by maximum usable clock speed, just like Intel has done with
its CPUs for the last decade.  Blow a couple of dozen fuses to set a
chip ID, laser trim a couple of resistors to set a default clock
multiplier, and sell one for ten times what you get for the other.
It's the only way to survive in a mature competitive environment.

Now, suppose the silicon process for GPUs has stabilized to where chip
yields have greatly improved, and maybe 80% of the chips that work at
all are good enough to go in any but the top-of-the-line gamer
specials.  So where do they get the chips for a motherboard that sells
for $69 at Fry's?  They artificially bin them by target spec, blow
those fuses and trim those resistors, and charge what the market will
bear, niche by niche.  The driver picks up on the chip ID and limits
its in-system performance to the advertised spec, and everyone goes
home happy.

The graphics chip vendors are not utterly stupid.  They have seen what
has happened to the retail distribution of x86 CPUs, in which
overclocker types take six mid-range chips home, see which one they
can clock into the stratosphere for 24 hours without actually setting
the motherboard on fire (they don't mind a little toxic 

Re: GPL vs non-GPL device drivers

2007-02-25 Thread Michael K. Edwards

On 2/25/07, Pavel Machek <[EMAIL PROTECTED]> wrote:

Ok, but this is not realistic. I agree that if Evil Linker only adds
two hooks "void pop_server_starting(), void pop_server_stopping()", he
can get away with that.

But... how does situation change when Evil Linker does #include
 from his
binary-only part?


There is no such thing as a "GPL header file".  There is an original
work of authorship, that is, your POP server.  There is a modified
work of authorship (not exactly a "derivative work", but let's call it
an Annotated Edition in order to bring it into the "derivative works"
category), that is, your POP server as altered by the Evil Linker in a
coherent and readable way.  There is an independent work of
authorship, the Evil Linker's program.  And there is a claim that the
independent work of authorship infringes on the original author's
copyright in the POP server.

If the sole factual basis for that claim is that the Evil Linker's
program contains #include "what_you_said.h", then it's not going to
fly in court (IANAL, TINLA).  The #include directive itself is not
protectable expression, and anything that winds up in the Evil
Linker's binary is either a "method of operation" or "fair use" under
a "minimum practical amount of copying needed to accomplish a
sanctioned purpose" standard.  Interoperation, even competitive
interoperation and circumvention of partner licensing programs, is a
sanctioned purpose under US law.  You have to go pretty far out of
your way to find a case like Atari v. Nintendo in which the court
ruled that the competitor had been too lazy or venal in its reverse
engineering methodology not to be entitled to copy the bits needed to
interoperate.


I believe situation in this case changes a lot... And that's what
embedded people are doing; I do not think they are creating their own
headers or their own inline functions where headers contain them.


Remember, I am not defending people who hack the kernel and then don't
release the source code to their hacked kernel.  (I'm not really
defending people who hack the kernel and write a closed-source
application locked to those hacks, either, although I am saying that
calling such tactics "illegal" or "GPL violation" is irrelevant and
wrong-headed.)  And when what they in fact did was to riffle shuffle
together two drivers from Linus's tarball and change the names of the
symbolic constants to match their hardware interface (itself doubtless
a half-assed clone of someone else's), there's no excuse for GPLing
the result.

But a 20KLoC 3-D graphics driver that happens to #include
 is not thereby a "derivative work" of the kernel,
no matter how many entrypoints are labeled EXPORT_SYMBOL_GPL or
provided as inline functions.  And under the Lexmark standard,
MODULE_LICENSE("GPL") with a disclaimer in the doco is by far the
sanest way to deal with the EXPORT_SYMBOL_GPL mind games, and likely
(IANAL, TINLA) to be endorsed by any court in the US and probably lots
of other places too.  Especially when the graphics chip maker explains
that keeping their driver source code closed isn't really an attempt
to hide their knowledge under a bushel basket.  It's a defensive
measure against having their retail margins destroyed by nitwits who
take out all the busy-wait loops and recompile with -O9 to get another
tenth of a frame per second, and then pretend ignorance when
attempting to return their slagged GPU at Fry's.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-24 Thread Michael K. Edwards

On 2/24/07, Davide Libenzi  wrote:

Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
that ate bread and CPUs for the last 15 years?


Sure I used Google -- for clickable references so that lurkers can
tell I'm not making these things up as I go along.  Ingo and Alan have
obviously forgotten more about x86en than I will ever know, and I'm
carrying coals to Newcastle when I comment on pros and cons of XMM
memcpy.

But although the latest edition of the threadlet patches actually has
quite good internal documentation and makes most of its intent clear
even to a reader (me) who is unfamiliar with the code being patched,
it lacks "theory of operations".  How is an arch maintainer supposed
to adapt this interface to a completely different CPU, with different
stuff in pt_regs and different cost profiles for blown pipelines and
reloaded coprocessor state?  What are the hidden costs of this
particular style of M:N microthreading, and will they explode when
this model escapes out of the microbenchmarks and people who don't
know CPUs inside and out start using it?  What standard
thread-pool-management use cases are being glossed over at kernel
level and left to Ulrich (or implementors of JVMs and other bytecode
machines) to sort out?

At some level, I'm just along for the ride; nobody with any sense is
going to pay me to design this sort of thing, and the level of effort
involved in coding an alternate AIO implementation is not something I
can afford to expend on non-revenue-producing activities even if I did
have the skill.  Maybe half of my quibbles are sheer stupidity and
four out of five of the rest are things that Ingo has already taken
account in v4 of his patch set.  But that would leave one quibble in
ten that has some substance, which might save some nasty rework down
the line.  Even if everything I ask about has a simple explanation,
and for Alan and Ingo to waste time spelling it out for me would
result in nothing but an accelerated "theory of operation" document,
would that be a bad thing?

Now I know very little about x86_64 other than that 64-bit code not
only has double-size integer registers to work with, it has twice as
many of them.  So for all I know the transition to pure-64-bit 2-4
core x 2-4 thread/core systems, which is going to be 90% or more of
the revenue-generating Linux market over the next few years, makes all
of my concerns moot for Ingo's purposes.  After all, as long as Linux
stays good enough to keep Oracle from losing confidence and switching
to Darwin or something, the 100 or so people who earn invites to the
kernel summit have cush jobs for life.

The rest of us would perhaps like for major proposed kernel overhauls
to be accompanied by some kind of analysis of their impact on arches
that live elsewhere in CPU parameter space.  That analysis might
suggest small design refinements that make Linux AIO scale well on the
class of processors I'm interested, too.  And I personally would like
to see Ingo get that Turing award for designing AIO semantics that are
as big an advance over the past as IEEE 754 was over its predecessors.
He'd have to earn it, though.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-24 Thread Michael K. Edwards

On 2/23/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> This is a fundamental misconception. [...]

> The scheduler, on the other hand, has to blow and reload all of the
> hidden state associated with force-loading the PC and wherever your
> architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> either). [...]

please read up a bit more about how the Linux scheduler works. Maybe
even read the code if in doubt? In any case, please direct kernel newbie
questions to http://kernelnewbies.org/, not [EMAIL PROTECTED]


This is not the first kernel I've swum around in, and I've been
mucking with the Linux kernel since early 2.2 and coding assembly for
heavily pipelined processors on and off since 1990.  So I may be a
newbie to your lingo, and I may even be a loud-mouthed idiot, but I'm
not a wet-behind-the-ears undergrad, OK?

Now, I've addressed the non-free-ness of a TLS swap elsewhere; what
about function pointers in state machines (with or without flipping
"supervisor mode" bits)?  Just because loading the PC from a data
register is one opcode in the instruction stream does not mean that it
is not quite expensive in terms of blown pipeline state and I-cache
stalls.  Really fast state machines exploit PC-relative branches that
really smart CPUs can speculatively execute past (after a few
traversals) because there are a small number of branch targets
actually hit.  The instruction prefetch / scheduler unit actually
keeps a table of PC-relative jump instructions found in I-cache, with
a little histogram of destinations eventually branched to, and
speculatively executes down the top branch or two.  (Intel Pentiums
have a fairly primitive but effective variant of this; see
http://www.x86.org/articles/branch/branchprediction.htm.)

More general mechanisms are called "branch target buffers" and US
Patent 6609194 is a good hook into the literature.  A sufficiently
smart CPU designer may have figured out how to do something similar
with computed jumps (add pc, pc, foo), but odds are high that it cuts
out when you throw function pointers around.  Syscall dispatch is a
special and heavily optimized case, though -- so it's quite
conceivable that a well designed userland switch/case state machine
that makes syscalls will outperform an in-kernel state machine data
structure traversal.  If this doesn't happen to be true on today's
desktop, it may be on tomorrow's desktop or today's NUMA monstrosity
or embedded mega-multi-MIPS.

There can also be other reasons why tabulated PC-relative jumps and
immediate PC loads are faster than PC loads from data registers.
Take, for instance, the Transmeta Crusoe, which (AIUI) used a trick
similar to the FX!32 x86 emulation on Alpha/NT.  If you're going to
"translate" CISC to RISC on the fly, you're going to recognize
switch/case idioms (including tabulated PC-relative branches), and fix
up the translated branch table to contain offsets to the
RISC-translated branch targets.  So the state transitions are just as
cheap as if they had been compiled to RISC in the first place.  Do it
with function pointers, and the the execution machine is going to have
to stall while it looks up the text location to see if it has it
translated in I-cache somewhere.  Guess what:  the PIV works the same
way (http://www.karbosguide.com/books/pcarchitecture/chapter12.htm).

Are you starting to get the picture that syslets -- clever as they
might have been on a VAX -- defeat many of the mechanisms that CPU and
compiler architects have negotiated over decades for accelerating real
code?  Especially now that we have hyper-threaded CPUs (parallel
instruction decode/issue units sharing almost all of their cache
hierarchy), you can almost treat the kernel as if it were microcode
for a syscall coprocessor.  If you try to migrate application code
across the syscall boundary, you may perform well on micro-benchmarks
but you're storing up trouble for the future.

If you don't think this kind of fallout is real, talk to whoever had
the bright idea of hijacking FPU registers to implement memcpy in
1996.  The PIII designers rolled over and added XMM so
micro-optimizers would get their dirty mitts off the FPU, which it
appears that Doug Ledford and Jim Blandy duly acted on in 1999.  Yes,
you still need to use FXSAVE/FXRSTOR when you want to mess with the
XMM stuff, but the CPU is smart enough to keep a shadow copy of all
the microstate that the flag states represent.  So if all you do
between FXSAVE and FXRSTOR is shlep bytes around with MOVAPS, the
FXRSTOR costs you little or nothing.  What hurts is an FXRSTOR from a
location that isn't the last location you FXSAVEd to, or an FXRSTOR
after actual FP arithmetic instructions have altered status flags.

The preceding may contain errors in detail -- I am neither a CPU
architect nor an x86 compiler writer nor even a serious kernel hacker.
But hopefully it's at least food for thought.  If not, you know where
the "ignore this prolix nitwit" key is to be found on yo

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Michael K. Edwards

On 2/23/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:

which costs you a D-cache stall.)  Now put an sprintf with a %d in it
between a couple of the syscalls, and _your_ arch is hurting.  ...


er, that would be a %f.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-23 Thread Michael K. Edwards

Russell, thanks again for offering to look at this; the more oopses
and soft lockups I see on this board, the more I think you're right
and we have an IRQ handling race.

Here's the struct irqchip setup:

/* mask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_mask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1 << irq));
}

/* unmask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_unmask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1 << irq));
}

/* ack to CPU interrupts and also individual timer interrupts */
static void mv88w8xx8_mask_ack_irq(unsigned int irq)
{
   mv88w8xx8_mask_irq(irq);

   if (irq < IRQ_TIMER1 || irq > IRQ_TIMER4) return;

   /* write 0 to clear interrupt and  re-enable further interrupts */
   MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1<<(irq-4)));
}

static struct irqchip mv88w8xx8_chip = {
   .ack= mv88w8xx8_mask_ack_irq,
   .mask   = mv88w8xx8_mask_irq,
   .unmask = mv88w8xx8_unmask_irq,
};

/**
* called by core.c to  initialize the IRQ module
*/
void mv88w8xx8_init_irq(void)
{
   int irq;

   for (irq = 0; irq < NR_IRQS; irq++) {
   set_irq_chip(irq, &mv88w8xx8_chip);
   set_irq_handler(irq, do_level_IRQ);
   set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
   }
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Michael K. Edwards

I wrote:

(On a pre-EABI ARM, there is even a substantial
cache-related penalty for encoding the syscall number in the syscall
opcode, because you have to peek back at the text segment to see it,
which costs you a D-cache stall.)


Before you say it, I'm aware that this is not directly relevant to TLS
switch costs, except insofar as the "arch-dependent syscalls"
introduced for certain parts of ARM TLS handling carry the same
overhead as any other syscall.  My point is that the system impact of
seemingly benign operations is not always predictable even to the arch
experts, and therefore one should be "parsimonious" (to use Kahan's
word) in defining what semantics programmers may rely on in
performance-critical situations.

If you arrange things so that threadlets are scheduled as much as
possible in bursts that share the same processor context (process
context, location in program text, TLS arena, FPU state -- basically
everything other than stack and integer registers), you are giving
yourself and future designers the maximum opportunity for exploiting
hardware optimizations.  This would be a good thing if you want
threadlets to be performance-competitive with state machine designs.

If you still allow application programmers to _use_ shared processor
state, in the knowledge that it will be clobbered on threadlet switch,
then threadlets can use most of the coding style with which
programmers of event-driven frameworks are familiar.  This would be a
good thing if you want threadlets to get wider use than the innards of
three or four databases and web servers.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] dynamic resizing dentry hash using RCU

2007-02-23 Thread Michael K. Edwards

On 2/23/07, Zach Brown <[EMAIL PROTECTED]> wrote:

I'd love to see a generic implementation of RCU hashing that
subsystems can then take advantage of.  It's long been on the fun
side of my todo list.  The side I never get to :/.


There's an active thread on netdev about implementing an RCU hash.
I'd suggest a 2-left (or possibly even k-left) hash for statistical
reasons discussed briefly there, and in greater depth in a paper by
Michael Mitzenmacher at
www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/iproute.ps.
Despite his paper's emphasis on hardware parallelism, there's a bigger
win associated with Poisson statistics and decreasing occupation
fraction (and therefore collision probability) in successive hashes.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Michael K. Edwards

Thanks for taking me at least minimally seriously, Alan.  Pretty
generous of you, all things considered.

On 2/23/07, Alan <[EMAIL PROTECTED]> wrote:

That example touches back into user space, but doesnt involve MMU changes
or cache flushes, or tlb flushes, or floating point.


True -- on an architecture where a change of TLS does not
substantially affect the TLB and cache, which (AIUI) it does on most
or all ARMs.  (On a pre-EABI ARM, there is even a substantial
cache-related penalty for encoding the syscall number in the syscall
opcode, because you have to peek back at the text segment to see it,
which costs you a D-cache stall.)  Now put an sprintf with a %d in it
between a couple of the syscalls, and _your_ arch is hurting.  Deny
the userspace programmer the use of the FPU in threadlets, and they
become a lot less widely applicable -- and a lot flakier in a
non-wizard's hands, given that people often cheat around the small
number of x86 integer registers by using FP registers when copying
memory in bulk.


errno is thread specific if you use it but errno is as I said before
entirely a C library detail that you don't have to suffer if you don't
want to. Avoiding that saves a segment register load - which isn't too
costly but isn't free.


On your arch, it's a segment register -- and another
who-knows-how-many pages to migrate along with the stack and pt_regs.
On ARM, it's a coprocessor register that is incorrectly emulated by
most JTAG emulators (so bye-bye JTAG-assisted debugging and
profiling), or possibly a register stolen from the general purpose
register set.  On some MIPSes I have known you probably can't
implement TLS safely without a cache flush.

If you tell people up front not to touch TLS in threadlets -- which
means not to use routines from  and  -- then
implementors may have enough flexibility to make them perform well on
a wide range of architectures.  Alternately, if there are some things
that threadlet users will genuinely need TLS for, you can tell them
that all of the threadlets belonging to process X on CPU Y share a TLS
context, and therefore things like errno can't be trusted across a
syscall -- but then you had better make fairly sure that threadlets
aren't preempted by other threadlets in between syscalls.  Similar
arguments apply to FPU state.

IEEE 754.  Harp, harp.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Michael K. Edwards

On 2/23/07, Alan <[EMAIL PROTECTED]> wrote:

> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures?  Maybe not in a database or a

We don't care. We don't have to care. The kernel threadlets don't execute
in user space and don't do FP.


Blocked threadlets go back out to userspace, as new threads, after the
first blocking syscall completes.  That's how Ingo described them in
plain English, that's how his threadlet example would have to work,
and that appears to be what his patches actually do.


> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO?  I have no idea
> what a %cr3 is, but if you don't fence off thread-local stuff from the

How about you go read the intel architecture manuals then you might know
more.


Y'know, there's more to life than x86.  I'm no MMU expert, but I know
enough about ARM TLS and ptrace to have fixed ltrace -- not that that
took any special wizardry, just a need for it to work and some basic
forensic skill.  If you want me to go away completely or not follow up
henceforth on anything you write, say so, and I'll decide what to do
in response.  Otherwise, you might consider evaluating whether there's
a way to interpret my comments so that they reflect a perspective that
does not overlap 100% with yours rather than total idiocy.


Last time I checked glibc was in userspace and the interface for kernel
AIO is a matter for the kernel so errno is irrelevant, plus any
threadlets doing system calls will only be living in kernel space anyway.


Ingo's original example code:

long my_threadlet_fn(void *data)
{
  char *name = data;
  int fd;

  fd = open(name, O_RDONLY);
  if (fd < 0)
  goto out;

  fstat(fd, &stat);
  read(fd, buf, count)
  ...

out:
  return threadlet_complete();
}

You're telling me that runs entirely in kernel space when open()
blocks, and doesn't touch errno if fstat() fails?  Now who hasn't read
the code?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-23 Thread Michael K. Edwards

OK, having skimmed through Ingo's code once now, I can already see I
have some crow to eat.  But I still have some marginally less stupid
questions.

Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM.  Does that mean they
share thread-local storage with the userspace thread, have
thread-local storage of their own, or have no thread-local storage
until NPTL asks for it?

When the kernel zeroes the userspace stack pointer in
cachemiss_thread(), presumably the allocation of a new userspace stack
page is postponed until that thread needs to resume userspace
execution (after completion of the first I/O that missed cache).  When
do you copy the contents of the threadlet function's stack frame into
this new stack page?

Is there anything in a struct pt_regs that is expensive to restore
(perhaps because it flushes a pipeline or cache that wasn't already
flushed on syscall entry)?  Is there any reason why the FPU context
has to differ among threadlets that have blocked while executing the
same userspace function with different stacks?  If the TLS pointer
isn't in either of these, where is it, and why doesn't
move_user_context() swap it?

If you set out to cancel one of these threadlets, how are you going to
ensure that it isn't holding any locks?  Is there any reasonable way
to implement a userland finally { } block so that you can release
malloc'd memory and clean up application data structures?

If you want to migrate a threadlet to another CPU on syscall entry
and/or exit, what has to travel other than the userspace stack and the
struct pt_regs?  (I am assuming a quiesced FPU and thread(s) at the
destination with compatible FPU flags.)  Does it make sense for the
userspace stack page to have space reserved for a struct pt_regs
before the threadlet stack frame, so that the entire userspace
threadlet state migrates as one page?

I now see that an effort is already made to schedule threadlets in
bursts, grouped by PID, when several have unblocked since the last
timeslice.  What is the transition cost from one threadlet to another?
Can that transition cost be made lower by reducing the amount of
state that belongs to the individual threadlet vs. the pool of
cachemiss threads associated with that threadlet entrypoint?

Generally, is there a "contract" that could be made between the
threadlet application programmer and the implementation which would
allow, perhaps in future hardware, the kind of invisible pipelined
coprocessing for AIO that has been so successful for FP?

I apologize for having adopted a hostile tone in a couple of previous
messages in this thread; remind me in the future not to alternate
between thinking about code and about the FSF.  :-)  I do really like
a lot of things about the threadlet model, and would rather not see it
given up on for network I/O and NUMA systems.  So I'm going to
reiterate again -- more politely this time -- the need for a
data-structure-centric threadlet pool abstraction that supports
request throttling, reprioritization, bulk cancellation, and migration
of individual threadlets to the node nearest the relevant I/O port.

I'm still not sold on syslets as anything userspace-visible, but I
could imagine them enabling a sort of functional syntax for chaining
I/O operations, with most failures handled as inline "Not-a-Pointer"
values or as "AEIOU" (asynchronously executed I/O unit?) exceptions
instead of syscall-test-branch-syscall-test-branch.  Actually working
out the semantics and getting them adopted as an IEEE standard could
even win someone a Turing award.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Alan <[EMAIL PROTECTED]> wrote:

> to do anything but chase pointers through cache.  Done right, it
> hardly even branches (although the branch misprediction penalty is a
> lot less of a worry on current x86_64 than it was in the
> mega-superscalar-out-of-order-speculative-execution days).  It's damn

Actually it costs a lot more on at least one vendors processor because
you stall very long pipelines.


You're right; I overreached there.  I haven't measured branch
misprediction penalties in dog's years (I focus more on system latency
issues these days), so I'm just going on rumor.  If your CPU vendor is
still playing the tune-for-SpecINT-at-the-expense-of-real-code game
(*cough* Itanic *cough*), get another CPU vendor -- while you still
can.


> threadlets promise that they will not touch anything thread-local, and
> that when the FPU is handed to them in a specific, known state, they
> leave it in that same state.  (Some of the flags can be

We don't use the FPU in the kernel except in very weird cases where it
makes an enormous performance difference. The threadlets also have the
same page tables so they have the same %cr3 so its very cheap to switch,
basically a predicted jump and some register loads


Do you not understand that real user code touches FPU state at
unpredictable (to the kernel) junctures?  Maybe not in a database or a
web server, but in the GUIs and web-based monitoring applications that
are 99% of the potential customers for kernel AIO?  I have no idea
what a %cr3 is, but if you don't fence off thread-local stuff from the
threadlets you are just begging for end-user Heisenbugs and a place in
the dustheap of history next to Symbolics LISP.


> Do me a favor.  Do some floating point math and a memcpy() in between
> syscalls in the threadlet.  Actually fiddle with errno and the FPU

We don't have an errno in the kernel because its a stupid idea. Errno is
a user space hack for compatibility with 1970's bad design. So its not
relevant either.


Dude, it's thread-local, and the glibc wrapper around most synchronous
syscalls touches it.  If you don't instantiate a new TLS context (or
whatever the right lingo for that is) for every threadlet, you are
TOAST -- if you let the user call stuff out of  (let alone
) from within the threadlet.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Alan <[EMAIL PROTECTED]> wrote:

> Oh yeah?  For IRIX in 1991?  Or for that matter, for Linux/ARM EABI
> today?  Tell me, how many of what sort of users do you support

Solaris (NTL - very large ISP/Telco), Dec server 5000 (for fun), Irix (and
linux cross for Irix removal), MIPS embedded (including the port to Linux
of Algorithmics toolchain) for Sonix then 3COM routers.


My list of GNUs maintained is about the same:  SunOS 4.x, Solaris 2.x,
IRIX, ConvexOS, embedded MIPS and ARM and x86.  I've used, but didn't
maintain, GCC for embedded PowerPC and m68k, and until I found a
distro I could more or less trust to be point fixable, I did my own
desktop/server Linux toolchains for x86, PowerPC, and x86_64.  The
only one for which I resorted to coughing up the university's money to
the FSF was IRIX, and that's because it had to reach functional parity
with the Sun and Convex boxes pronto.


It's not a hard problem. gcc 2.x wasn't too hot on MIPS but it worked
although the Irix compiler generated vastly better code (and AFAIK still
does).


It worked until you tried a 64-bit target or stressed the floating
point.  I had one of the first R4400s that ever left SGI, in an Indigo
with the IndigoVideo board when it was still in alpha.  Part of the
horse-trade between the university and the start-up I worked for was
that they got to run batch jobs on the thing when I wasn't physically
at the keyboard.  I had built several experimental toolchains for the
thing but concluded rapidly that I didn't want to tech-support that
shit.  Best $5K of someone else's money I ever spent.


There are folks who maintain cross devel chains for just about every
Linux platform specifically for testing and while it isn't a small job
they do seem to be coping quite happily.


Er, I'm one of them.  :-)  When the ARM-based device I'm currently
working on first ships as an out-of-form-factor prototype to OEM
customers, it will be accompanied by a complete toolchain, kernel, and
userland, built from scratch using crosstool and ptxdist and extensive
patches I wrote, all of them to be contributed upstream because I
convinced my client that it's the right thing to do.  This includes
the latest upstream editions of each userland component, a gdbserver
that has been tested on multi-threaded soft-float NPTL binaries, the
first (public) ltrace to work correctly on Linux/ARM in at least three
years, the first (public) strace to understand ALSA ioctls, and
infrastructure for unit testing and system latency analysis.

It will be delivered as a set of git repositories with the complete
development history and tracking branches for outside projects, and
the only bits that aren't open source will be those encumbered by
inescapable trade secret agreements with chip vendors.  With the
exception of those closed binaries, everything from soup to nuts is
exactly reproducible from source on any Linux distro with a moderately
current native toolchain and autotools.  Before the first unit ships
retail, these git repositories will be carefully scrubbed of
encumbered material and opened to the public.  Pull one git repository
and run one script, and a few hours later you have your own JFFS2
image that you can burn to flash using facilities we will leave in
U-Boot for end-users' benefit.

Absolutely everything in the system can be point-fixed and recompiled
by the end user, with results as predictable and reproducible as I
know how to make them for myself.  Updates from third-party upstreams
can be merged using the tools that I believe to be best-in-class for
the purpose and use myself, daily.  No binary ever ships without
passing through an autobuild and unit test framework that I provide as
part of the end user source code release.  That's my personal standard
of release quality.  Now tell me, how does that compare to your
employer's?


> CodeSourcery and MontaVista and Red Hat stay in business?  Not with
> the quality of their code or their customer service, I'll tell you
> that -- although Mark Mitchell is probably the best release manager

Lots of people would disagree with you about that (and independant
surveys would back the disagreement), like they would disagree with you
about most things.


I have never particularly feared being in the minority, as long as I'm
right.  :-)  But seriously, if you haven't heard the complaints about
unreproducibility of Red Hat toolchains going back to the "GCC 2.96"
debacle, you haven't been listening -- and MontaVista became notorious
in the industry for deliberately mucking with kernel APIs as a vendor
lock-in tactic.  (They appear to have reformed substantially since the
2.4.x days.)  I don't know Mark personally but he appears to be as
open about CodeSourcery's processes and priorities as any toolchain
vendor has ever been, and GCC 4.1.2 looks like it's going to be as
stable as any upstream GCC release has ever been and perform decently
as well, so I don't have much to complain about in that department.
YMMV.

Re: GPL vs non-GPL device drivers

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Alan <[EMAIL PROTECTED]> wrote:

> compiler people caught on to the economic opportunity.  Ever pay $5K
> for a CD full of GNU binaries for a commercial UNIX?  I did, after

No because I just downloaded them. Much easier and since they are GPL I
was allowed to do so, then rebuilt them all which took about 30 minutes
of brain time and a day of CPU time.


Oh yeah?  For IRIX in 1991?  Or for that matter, for Linux/ARM EABI
today?  Tell me, how many of what sort of users do you support
singlehandedly in an environment where you are a minority architecture
and everyone else takes the GNU tools for granted?  God knows I've got
better things to do with my time than roll the compiler-flag dice
again and again trying to get a sketchy GCC port not to ICE or, worse,
generate subtly broken code.  If it's so bloody easy, how do
CodeSourcery and MontaVista and Red Hat stay in business?  Not with
the quality of their code or their customer service, I'll tell you
that -- although Mark Mitchell is probably the best release manager
that any GNU project has ever had.





Oh, please.  Steve Ballmer doesn't waste his time trying to explain to
overpaid GNU/Linux Morlocks (among which I number myself) how the
legal and economic underpinnings of their industry work and why they
shouldn't RAPE THEMSELVES playing stupid EXPORT_SYMBOL_GPL games and
cryptographically signing kernel modules.  But don't worry -- neither
do I beyond a couple of weeks after something really dunderheaded sets
me off, and in the meantime, you know where to find your keyboard's
"stick my fingers in my ears and shout la-la-la-I-can't-hear-you" key.
:-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-22 Thread Michael K. Edwards

On 2/22/07, D. Hazelton <[EMAIL PROTECTED]> wrote:

> If you take the microsoft windows source code and compile it yourself
> believe me you will get sued if you ship the resulting binaries and you
> will lose in court.


"misappropriation of trade secrets" as well as copyright infringement


But that's because it is *WINDOWS*, which, unless specifically granted to you,
does not include a transfer of the right to distribute in *ANY* form. Every
PC manufacturer that wants to distribute Windows on new machines they produce
*MUST* sign an agreement with M$. As I have never seen any of those
agreements I cannot state what the terms are and whether they are different
for each company holding such a license.


contract in personam, creating causes of action for breach of contract
(for which remedies of specific performance are available) and
strengthening the case for misappropriation of trade secrets


And unless you've signed a licensing agreement over the source code to
Windows, you're more than likely to have another lawsuit on your hands for
possessing it.


Not for possessing it.  For acquiring it unlawfully, and for doing
things with it that violate M$ copyright.



> I would also note that the FSF makes no claim about dynamic v static
> linking, merely about derivative works - which is the boundary the law is
> interested in. Indeed the GPLv2 was written in the days where dynamic
> linking was quite novel which is one reason the LGPL talks about


The FSF makes lots of such claims, all the time, and Eben Moglen uses
them to finesse his letter-of-opinion / affidavit racket, along with
the fork/exec fetish.  Fluendo.  Vidomi.  XCode.  Tornado.  NeXT.
Progress Software.


> "For example, if you distribute copies of the library, whether gratis
>  or for a fee, you must give the recipients all the rights that we gave
>  you.  You must make sure that they, too, receive or can get the source
>  code.  If you link a program with the library, you must provide
>  complete object files to the recipients so that they can relink them
>  with the library, after making changes to the library and recompiling
>  it.  And you must show them these terms so they know their rights."

Eh? Complete *object* files so that after making changes and recompiling they
can relink it? Umm... I don't know about you, but that makes me laugh. What
is the purpose of providing "Complete Object Files" to everyone if they are
just going to recompile and relink the library?


Complete object files for the rest of the non-GPL application.  This
applies principally to static linking.  It also makes it much easier
to reverse engineer the application, because unless the person
jockeying the linker is really, really good, all of the interfaces
between the application components are visible with nm and objdump,
and you can use tools like ltrace to watch the calling sequences
between modules.  Selective symbol stripping and obfuscation on
partially linked binaries takes skill.


> Flex is more complex because the resulting binary contains both compiled
> work of yours and a support library of FSF owned code (-lfl).


No, flex is simpler because libfl is obviously a separate work of
authorship with a stable external interface, and the application that
links against it is not a derivative work of any of the creative
expression inside libfl.


Copyright *doesn't* extend to compiled code. It cannot, because compiled code
is a machine generated translation. A machine generated translation isn't the
product of a creative process. And you can also provide all the routines
normally provided by the support library. This means that the support library
is *NOT* a necessary part of the system.


That's rubbish.  Copyright in compiled code is very nearly identical
to copyright in the source code from which it was generated; see
references in Lexmark, especially the seminal Altai case.  Copyright
in silicon is _not_ identical to copyright in the RTL from which it
was synthesized -- the term of protection for a "mask work" is limited
to 10 years -- 2 if it's not registered properly.  This prima facie
bizarre situation reflects the difference in national origin and
lobbying power between software and hardware makers, as well as the
greater difficulty of extracting the theory of operation of a complex
chip using an electron microscope.  The chip design oligopoly is bound
by a web of contractual covenants not to do this anyway, and they like
being able to snap up key people from upstart maverick companies and
rip off their designs without pesky legal interference.


> The non
> computing analogy here is the difference between using a paint program to
> create a work, and using a paint program to create a work but also
> including other artwork (eg clipart).


Not really.  Using stock images to illustrate a manuscript requires
license to copy and distribute them but rarely, if ever, creates a
derivative work.


Yes, but in both cases the result is *CLEARLY* the resul

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

Secondly, even assuming lots of pending requests/async-threads and a
naive queueing model, an open request will eat up resources on the
server no matter what.


Another fundamental misconception.  Kernel AIO is not for servers.
One programmer in a hundred is working on a server codebase, and one
in a thousand dares to touch server plumbing.  Kernel AIO is for
clients, especially when mated to GUIs with an event delivery
mechanism.  Ask yourself why the one and only thing that Windows NT
has ever gotten right about networking is I/O completion ports.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> It is not a TUX anymore - you had 1024 threads, and all of them will
> be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> a machine.

maybe it will, maybe it wont. Lets try? There is no true difference
between having a 'request structure' that represents the current state
of the HTTP connection plus a statemachine that moves that request
between various queues, and a 'kernel stack' that goes in and out of
runnable state and carries its processing state in its stack - other
than the amount of RAM they take. (the kernel stack is 4K at a minimum -
so with a million outstanding requests they would use up 4 GB of RAM.
With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)


This is a fundamental misconception.  The state machine doesn't have
to do anything but chase pointers through cache.  Done right, it
hardly even branches (although the branch misprediction penalty is a
lot less of a worry on current x86_64 than it was in the
mega-superscalar-out-of-order-speculative-execution days).  It's damn
near free -- but it's a pain in the butt to code, and it has to be
done either in-kernel or in per-CPU OS-atop-the-OS dispatch threads.

The scheduler, on the other hand, has to blow and reload all of the
hidden state associated with force-loading the PC and wherever your
architecture keeps its TLS (maybe not the whole TLB, but not nothing,
either).  The only way around this that I can think of is to make
threadlets promise that they will not touch anything thread-local, and
that when the FPU is handed to them in a specific, known state, they
leave it in that same state.  (Some of the flags can be
unspecified-but-don't-touch-me.)  Then you can schedule threadlets in
bursts with negligible transition cost from one to the next.

There is, however, a substantial setup cost for a burst, because you
have to put the FPU in that known state and lock out TLS access (this
is user code, after all).  If the wrong process is in foreground, you
also need to switch process context at the start of a burst; no
fandangos on other processes' core, please, and to be remotely useful
the threadlets need access to process-global data structures and
synchronization primitives anyway.  That's why you need for threadlets
to have a separate SCHED_THREADLET priority and at least a weak
ordering by PID.  At which point you are outside the feature set of
the O(1) scheduler as I understand it, and you might as well schedule
them from the next tasklet following the softirq dispatcher.


> My tests show that with 4k connections per second (8k concurrency)
> more than 20k connections of 80k total block in tcp_sendmsg() over
> gigabit lan between quite fast machines.

yeah. Note that you can have a million sleeping threads if you want, the
scheduler wont care. What matters more is the amount of true concurrency
that is present at any given time. But yes, i agree that overscheduling
can be a problem.


What matters is that a burst of I/O responses be scheduled efficiently
without taking down the rest of the box.  That, and the ability to
cancel no-longer-interesting I/O requests in bulk, without leaking
memory and synchronization primitives all over the place.  If you
don't have that, this scheme is UNUSABLE for network I/O.


btw., what is the measurement utility you are using with kevents ('ab'
perhaps, with a high -c concurrency count?), and which webserver are you
using? (light-httpd?)


Do me a favor.  Do some floating point math and a memcpy() in between
syscalls in the threadlet.  Actually fiddle with errno and the FPU
rounding flags.  Watch it slow to a crawl and/or break floating point
arithmetic horribly.  Understand why no one with half a brain uses
Java, or any other language which cuts FP corners for the sake of
cheap threads, for calculations that have to be correct.  (Note that
Kahan received the Turing award for contributions to IEEE 754.  If his
polemic is too thick, read
http://www-128.ibm.com/developerworks/java/library/j-jtp0114/.)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86 hardware and transputers (Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3)

2007-02-22 Thread Michael K. Edwards

On 2/22/07, Oleg Verych <[EMAIL PROTECTED]> wrote:

> Yes, I will go back and read the code for myself.  This will take me
> some time because I have only a hand-waving level of knowledge about
> task_structs and pt_regs, and have largely avoided the dark corners of
> the x86 architecture.

This architecture was brought to us by windows on our screens. And it
took years (a decade?) for them to use all hardware features:

  (IA-32, i386) --> (MS Windows NT,9X)

Yet you must still do much system programming to use that features.


Actually, this architecture was brought to us largely by WordPerfect,
VisiCalc, and IEEE754.  Nobody really cares what bloody operating
system the thing is running; they cared then, and care now, about
being able to write reports and memos that print cleanly and to build
spreadsheets that calculate correctly.  Both of these things are made
much more practical by predictable floating point semantics, which
meant at first that you had to write your own floating point library.
The first (and for a long time the only) piece of hardware to put
_usable_ hardware floating point within reach of the desktop was the
Intel 80387.  Usable, not because it was more accurate than the soft
version (it wasn't, actually quite the reverse), but because it got
the exception semantics right.

The '387 is what made the PC architecture the _only_ choice for the
corporate desktop, pretty much immediately upon its release in 1987.
My first corporate job was porting an electric utility's in-house
revenue requirements application -- written in Fortran with assembly
display routines -- from a Prime mini to the PC.  I still have the
nice leather coffee coaster with the Prime logo on my desk.  The rest
of Prime is dead and forgotten, largely because of the '387.


transputers were (AFAIK) completely orthogonal to any today's x86 CPU
architecture -- hardware parallelism, special programming language and
technique to match this hardware. And they were chosen to work on Mars in
mid-90th, while crowd wanted more stupid windows on cheap CPUs.


Y'know, what goes around comes around.  The HyperTransport CPU
interconnect from AMD that all the overclocker types are so excited
about is just the Transputer ports warmed over, with a modern cache
coherency protocol stacked on top.  Worked on one of those too, on the
Intel HyperCube -- it's not my fault that the i860 (predecessor of the
Itanic, in a way) lost its marbles so comprehensively on IRQ that you
couldn't do anything I/O intensive with it.

My first government job (at NASA) started with a crash course in Occam
and explicit parallelism, but it was so blindingly obvious that this
had no future outside its little niche that I looked around for other
stuff to do.  The adjacent console belonged to a Symbolics LISP
machine -- also clearly a box with no future, since the only
applications for it that still mattered (expert systems) were in the
last stages of being ported to a C-based expert system engine
developed down the hall (which is now open source, and which I have
had occasion to use for many purposes since).  I was a Mac weenie at
the time, so I polished my C skills working on the Mac port of CLIPS
and writing a genetic algorithm engine.  Had I stuck to the
Transputer, I would probably know a lot more about faking NUMA using a
cache coherency protocol than I do today.


Thus, i think, you are thinking mostly on hardware level, while it's
longstanding software problem, i.e. to use x86 (:.


I don't think much about hardware or software.  I think about shipping
products in volume at positive gross margin, even when what's coming
out of my fingertips is source code and shell commands.  That's why
I've worked mostly on widgets with ARMs inside in recent years.  But
I'm kinda bored with that, and Niagara or Octeon may wind up cheap in
volume if somebody with extra fab capacity scoops up the wreckage of
Sun or Cavium, so I'm here harassing people more competent than I
about what it takes to make them programmable by mere mortals.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

the syslet/threadlet framework has been derived from Tux, which one can
accuse of may things, but which i definitely can not accuse of being
slow. It has no relationship whatsoever to Solaris 2.0 or later.


So how well does Tux fare on a NUMA box?  The Solaris 2.0 reference
was not about the origins of the code, it was about the era when SMP
first became relevant to the UNIX application programmer.  I remember
an emphasis on scheduler scalability then, too.  It took them quite a
while to figure out that having an efficient scheduler is of little
use if you are scheduling the wrong things on the wrong CPUs in the
wrong order and thereby thrashing continuously.  By that time we had
given up and gone back to message passing via the network stack, which
was the one kernel component that could figure out how to get state
from one CPU to another without taking all of its clothes off and
changing underwear in between.  Sound familiar?


Your other mail showed that you have very basic misunderstandings about
how threadlets work, on which you based a string of firm but incorrect
conclusions. In this discussion i'm mostly interested in specific
feedback about syslets/threadlets - thankfully we are past the years of
"unless Linux does generic technique X it will stay a hobby OS forever"
type of time-wasting discussions.


Well, maybe I'll let someone else weigh in about whether I understood
threadlets well enough to provide feedback worth reading.  As for the
"hobby OS forever" bit, that's an utter misrepresentation of my
comments and criticism.  Linux is now good enough for Oracle to have
more or less abandoned Sun for Linux.  That's as good as it needs to
be, as far as Oracle and IBM are concerned.  The question is now
whether it will ever get substantially _better_, so that you can do
something constructive with a NUMA box or a 64-core MIPS without
having the resources of an Oracle or a Google to build an
OS-atop-the-OS.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] As for threadlets, making them kernel threads is not such a good
> design feature, O(1) scheduler or not.  You take the full hit of
> kernel task creation, on the spot, for every threadlet that blocks.
> [...]

this is very much not how they work. Threadlets share the same basic
infrastructure with syslets and they do /not/ take a 'full hit of kernel
thread creation' when they block. Please read the announcements, past
discussions on lkml and the code about how it works.


Sorry, you're right, I jumped to a conclusion here without reading the
implementation.  I read too much into your statement that "threadlets,
when they block, are regular kernel threads".  So tell me, at what
stage is NPTL going to need a new TLS context for errno and all that?
Immediately when the threadlet first blocks, right?  At most you can
delay the bulk page copies with CoW MMU tricks (about which I cannot
begin to match your knowledge).  But you can't let any code run that
might touch errno -- or FPU state or anything else you need to
preserve for when you do carry on with the threadlet -- until you've
at least told the hardware to fault on write.

Yes, I will go back and read the code for myself.  This will take me
some time because I have only a hand-waving level of knowledge about
task_structs and pt_regs, and have largely avoided the dark corners of
the x86 architecture.  But I think my point still stands: allowing
code inside threadlets to use the usual C library wrappers around the
usual synchronous syscalls is going to mean that the threadlet context
is fairly heavyweight, both in memory and CPU/MMU state.  This means
that you can't pull it cheaply over to whatever CPU core happened to
process the device I/O that delivered the data it wanted.

If the cost of threadlet migration isn't negligible, then you can't
just write code that initiates a zillion threadlets in a loop -- not
if you want to be able to exploit NUMA or even SMP efficiently.  You
have to distribute the threadlet initiation among parallel threads on
all the CPUs -- at which point you are back where you started, with
the application explicitly partitioned among CPU-locked dispatch
threads.  Any programming team prepared to cope with that is probably
going to stick to the userspace state machine they probably already
have for managing delayed I/O responses.


syslets are not meant to be directly exposed to application coders.
Syslets (like many of my previous mails stated) are meant as building
blocks to higher-level AIO interfaces such as in glibc or libaio. Then
user-space can build its state-machine based on syslet-implemented
glibc/libaio. In that specific role they are a very fast and scalable
mechanism.


With all due respect -- and I know how successful you have been with
major new kernel features in the past -- I think that's a cop-out.
That's like saying that floating point is not meant to be directly
exposed to application coders.  Sure, the details of the floating
point _pipeline_ are essentially opaque; but I don't have to package
up a string of floating point operations into a "floatlet" and send it
off to the FPU.  From the point of view of the integer unit, I move
things into and out of floating point registers, and in between I
issue instructions to the FPU to do arithmetic on its registers.  If
something goes mildly wrong (say, underflow), and I've told the FPU to
carry on under those circumstances, it does.  If something goes bad
wrong, or if underflow happens and I've told the FPU that my code
won't do the right thing on underflow, I get an exception.  That's it.

If the FPU decides to pipeline things, that's not my problem; when I
get to the operation that wants to pull a result back over to the
integer unit, I stall until it's ready.  And in a cleverer, more
tightly interlocked processor that issues some things in parallel and
speculatively executes others, exceptions may not be deliverable until
long after the arithmetic op that produced them (you might not wind up
taking that branch).  Hence other state may have to be rewound so that
the exception is delivered with everything else in the state it would
have been in if the processor weren't so damn clever.

You don't have to invent all that pipelining and migration and
speculative execution stuff up front for AIO.  But if you don't stick
to what is "reasonable and necessary as well as parsimonious" when
defining what's allowed inside a threadlet, you won't leave
implementations any future flexibility.  And "want speed? use syslets
instead" is no answer, especially if you tell me that syslets wrapped
in glibc are only really useful for short-circuiting chained states in
a state machine.  I. Don't. Want. To. Write. State. Machines. Any.
More.

Cheers,
- Michael

Oh, and while I haven't written a kernel or an RDBMS, I have written
some fairly serious non-blocking I/O code (without resorting to
threads; one socket and thousands of i

Re: GPL vs non-GPL device drivers

2007-02-21 Thread Michael K. Edwards

On 2/21/07, D. Hazelton <[EMAIL PROTECTED]> wrote:

Actually, on re-reading the GPL, I see exactly why they made that pair of
exceptions. Where it's quite evident that a small to mid scale parsers that
could have been written *without* the use of Bison is clearly a
non-derivative work - Bison was not required, but was used as a mean of
expediency. When you reach a large scale parser, such as one that meets all
requirements to act as the parser for an ANSI C99 compiler, Bison stops being
expedient - it'd likely take just as much time to hand craft the parser as it
would to debug the Bison input. However, it makes maintaining the parser
easier.


I repeat, that is not what "derivative work" means.  Do not try to
reason about the phrase "derivative work" without reference to its
definition in statute and its legal history in appellate decisions.
You will get it wrong every time.  You have not "recast, transformed,
or adapted" the _creative_expression_ in Bison or Flex in the course
of using it; so whether or not you have infringed the FSF's copyright,
you have NOT created a "derivative work".

If Bison and Flex were offered under the "bare" GPL, and you used them
to build a parser, and the FSF sued you for offering that parser to
other people on terms other than the GPL's, you don't defend by
claiming license under the GPL with regard to your parser.  You attack
the claim of "copying" itself by saying that the _creative_expression_
copied from Bison/Flex into your parser is "de minimis" under the
Altai Abstraction-Filtration-Comparison test.  You also point out
that, even if it weren't "de minimis", it's "fair use"; that's a
complex four-factor test, but the main thing is that you're not
interfering with the FSF's ability to monetize its copyright in
Bison/Flex.

If you have any sense, you will strenuously argue that the various
"special exceptions" for things like Bison/Flex and GNU Classpath are
_not_ part of the offer of contract contained in the GPL, any more
than the Preamble is.  They're declarations of intent on the part of
the copyright holder, and can be used to _estop_ the FSF from making
the copyright infringement claim against you in court in the first
place.  They promised you they wouldn't, not as part of the contract
terms, but as part of the inducement to form a contract with them by
acceptance through conduct.


But the fact is that it's the small to medium scale parsers that have a lower
ratio of original to GPL'd code that are at risk of being declared derivative
works. All of this because the GPL contains the following text in section 0:
"The act of running the Program is not restricted, and the output from the
Program is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does."


I'm sorry, but that "ratio test" has nothing to do with whether
something is a derivative work or not.  It comes up mostly in
evaluating a "fair use" defense, and it's the ratio of the amount of
creative expression _copied_ to the amount of creative expression in
the _original_work_ that matters.  Just because you're writing a
49-volume encyclopedia does not give you the right to copy two thirds
of my one-page essay.  Those weasel-words about "depending on what the
Program does" are nonsense.  It depends on what _creative_expression_
from the Program winds up in the output.


That clause, to me, seems specifically tailored to cover programs such as
Bison and Flex. (and is the reason that I try to use Byacc and when I need
to, write tokenizers by hand) This frankly stinks of attempts to cover all
possible code. (I actually started studying Copyright law in my free time
because I was wondering how legal the GPL was and was also puzzled by some
major online entities actually complaining about it)


I've written tokenizers in at least six different languages to date,
including Fortran.  It's a pain in the butt.  The last thing I want to
be worrying about when I write a tokenizer is whether somebody else's
peanut butter is winding up in my chocolate.  So I will only use a
tool which, like bison and flex, is accompanied by a promise not to
claim that its output infringes the tool author's copyright or is
otherwise encumbered in any way.  M$ VC++ is actually more trustworthy
in that respect than G++.  If you don't believe (as I do) that it is
arrant nonsense to claim that the use of interface headers or linking
of any kind creates a derivative work, then you must believe that any
programs compiled with G++ can only be distributed under the terms of
the GPL.  libstdc++ is GPL, not LGPL.


Since I tailor the license I apply to code I produce to meet the needs of the
person or entity I am writing it for, I've never run into this.


That's kind of silly.  I use the GPL for my own from-scratch programs
all the time.  It's quite a sensible set of terms for source code
created as a side effect of a consu

Re: GPL vs non-GPL device drivers

2007-02-21 Thread Michael K. Edwards

On 2/21/07, D. Hazelton <[EMAIL PROTECTED]> wrote:

On Wednesday 21 February 2007 19:28, Michael K. Edwards wrote:
> I think you just misread.  I said that the Evil Linker has cheerfully
> shipped the source code of the modified POP server.  He may not have
> given you the compiler he compiled it with, wihout which the source
> code is a nice piece of literature but of no engineering utility; but
> that's the situation the GPL is DESIGNED TO PRODUCE.

Actually, if memory serves, when you license a work under the GPL, part of the
terms is that you have to provide the source and any scripts needed to
produce a functioning executable.


Absolutely.  But not the toolchain, just the "scripts".  This is part
historical, since after all GNU got started when Sun started to
monetize its toolchain independently of its O/S, RMS got annoyed
enough to kick off another cloning project, and more competent
compiler people caught on to the economic opportunity.  Ever pay $5K
for a CD full of GNU binaries for a commercial UNIX?  I did, after
deciding that getting all their shit to compile was more Morlock-work
than I was up to.  So like I say, it's part historical -- RMS didn't
want to owe me a copy of Sun's toolchain along with that CD -- but
it's no accident that it's still in there, because THAT'S HOW CYGNUS
MADE MEGABUCKS.


As a side note: The distinct wording of the GPL actually *invalidates* the
GNU/FSF claim that dynamically linking a work with, say, the readline
library,  means the work is a derivative of said library. The GPL states (in
clause 0) that the license only covers copying, modification and
distribution. Unless they are confusing "Linking" with "copying" or "creating
a derivative work" the claim is invalid - because, as it has been shown, a
mechanical process such as compilation or linking *cannot* create a
derivative work.


Of course.  The FSF smokescreen around "derivative work" exists not
only to frighten potential commercial users of GPL libraries but to
squelch people like Eric A. Young (principal OpenSSL author) who have
the presumption to retain their own copyrights.  Eric has a quite
solid case (IMHO, IANAL) against the FSF and Eben Moglen personally
under at least three different U. S. antitrust and racketeering laws,
and it would be really entertaining to watch him take it up.


Related to that... Though a parser generated by Bison and a tokenizer
generated by Flex both contain large chunks of GPL'd code, their inclusion in
the source file that is to be compiled is mechanical - the true unique work
is in writing the file that is processed by the tool to produce the output.
Since the aggregation of the GPL'd code into the output source is done
mechanically - via mechanical translation (which is what compilation is as
well) - the result is *not* and under US copyright law *cannot* be a
derivative work. What this means is that the GNU/FSF "special" terms applied
to parsers generated by Bison and tokenizers generated by Flex is
unnecessary - they are granting you a right you already have.


Half true.  It's not a derivative work exactly, but it could
conceivably be held to infringe against the copyright in Flex/Bison,
if you could prove that the amount of _creative_expression_ copied
into the output exceeds a "de minimis" standard and doesn't constitute
a "fair use".  Those nifty photomosaics would probably infringe
against the photographers' copyright in the photos they're made up of,
if they weren't licensed through the graphic industry's "stock photo
archive" mechanism.  You could probably defend on "fair use" with
respect to Flex/Bison and the vanilla GPL, since the fact that I can
get some random program with a parser in it from you without needing
my own copy of bison doesn't cost the FSF anything.  But it's a
gamble, especially if that random program competes with something the
FSF makes $$$ off of; and I'd want that extra bit of estoppel in my
back pocket.

The LGPL is a very different story.  It's not just GPL with extra
estoppel, it's a booby trap.  It goes a lot farther to put over its
own perverse definition of "derivative work", and it tries to compel
you to provide all the "data and utility programs needed for
reproducing the executable from it".  I don't use the LGPL for my own
work, I wouldn't touch it with a ten-foot pole if it didn't have the
"GPL upgrade" clause in it, and I advise my employers and consulting
clients to go through the "GPL (v2 only!) upgrade" rigmarole with
respect to anything they so much as recompile.  They don't all take
that advice, but that's not my problem.


Anyway, it's been fun watching this thread. If I've made a mistake somewhere
in there, let me know - IA

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:

You won't be able to do it later if you don't design for it now.
Don't reinvent the square wheel -- there's a model to follow that was
so successful that it has killed all alternate models in its sphere.
Namely, IEEE 754.  But please try not to make pipeline flushes suck as
much as they did on the i860.


To understand why I harp on IEEE 754 as a sane model for pipelined
AIO, you might consider reading (at least parts of):
   http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

People who write industrial-strength floating point programs rely on
the IEEE floating point semantics to avoid having to check every
result of every arithmetic step to see whether it is a valid input to
the next step.  NaNs and +/-0 and overflows and all that jazz are
essential to efficient coding of things like matrix inversion, because
the only alternative is simply to fail.  But in-line indications of an
exceptional result aren't enough, because it may or may not have been
a coding error, and you may need fine-grained control over which
"failure conditions" are within the realm of the expected and which
are not.

Here's a quotable bit from Kahan and Darcy's polemic:

To achieve Floating-Point Predictability:
Limit programmers' choices to what is reasonable and necessary as
well as parsimonious, and
Limit language implementors'choices so as always to honor the
programmer's choices.
To do so, language designers must understand floating-point well
enough to validate their determination of "what is reasonable and
necessary," or else must entrust that determination to someone else
with the necessary competency.


Now I ain't that "someone else", when it comes to AIO pipelining.  But
they're out there.  Figure out how to create an AIO model that honors
the RDBMS programmer's choices efficiently on a NUMA box without
making him understand NUMA, and you really will have created something
for the ages.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

threadlets (and syslets) are parallel contexts and they behave so -
queuing and execution semantics are then ontop of that, implemented
either by glibc, or implemented by the application. There is no
'pipeline' of requests imposed - the structure of pending requests is
totally free-form. For example in threadlet-test.c i've in essence
implemented a 'set of requests' with the submission site only interested
in whether all requests are done or not - but any stricter (or even
looser) semantics and ordering can be used too.


In short, you have a dataflow model with infinite parallelism,
implemented using threads of control mapped willy-nilly onto the
underlying hardware.  This has not proven to be a successful model in
the past.


in terms of AIO, the best queueing model is i think what the kernel uses
internally: freely ordered, with barrier support. (That is equivalent to
a "queue of sets", where the queue are the barriers, and the sets are
the requests within barriers. If there is no barrier pending then
there's just one large freely-ordered set of requests.)


That's a big part of why Linux scales poorly for workloads that
involve a large volume of in-flight I/O transactions.  Unless you
essentially lock one application thread to each CPU core, with a
complete understanding of its cache sharing and latency relationships
to all the other cores, and do your own userspace I/O scheduling and
dispatching state machine -- which is what all industrial-strength
databases and other sorts of transaction engines currently do -- you
get the same old best-effort context-thrashing scheduler we've had
since Solaris 2.0.

Let's do something genuinely better this time, OK?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

pthread_cancel() [if/once threadlets are integrated into pthreads] ought
to do that. A threadlet, if it gets moved to an async context, is a
full-blown thread.


The fact that you are proposing pthread_cancel as a model for how to
abort an unfinished threadlet suggests -- to me, and I would think to
anyone who has ever written code that had no choice but to call
pthread_cancel -- that you have not thought about this part of the
problem.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

threadlets, when they dont block, are just regular user-space function
calls - so no need to schedule or throttle them. [*]


Right.  That's a great design feature.


threadlets, when they block, are regular kernel threads, so the regular
O(1) scheduler takes care of them. If MMU trashing is of any concern
then syslets should be used to implement the most performance-critical
events: under Linux a kernel thread that does not exit out to user-space
does not do any TLB switching at all. (even if there are multiple
processes active and their syslets intermix)


As far as I am concerned syslets by themselves are a dead letter,
because you can't do any of the things that potential application
coders need to do with them.  As for threadlets, making them kernel
threads is not such a good design feature, O(1) scheduler or not.  You
take the full hit of kernel task creation, on the spot, for every
threadlet that blocks.  You don't fence off the threadlet from any of
the stuff that it ought to be fenced off from for thread-safety
reasons, so you don't have much choice but to create a new TLS arena
for it (which you need anyway for errno), so they have horrible MMU
and memory overhead.  You can't dispatch them inexpensively, while the
data delivered by a softirq is still hot in cache, to traverse 1-3
lines of userspace code and make the next syscall.  So they're just
not usable for any of the things that a real AIO application actually
does.


throttling of outstanding async contexts is most easily done by
user-space - you can see an example in threadlet-test.c, but there's
also fio/engines/syslet-rw.c. v2 had a kernel-space throttling mechanism
as well, i'll probably reintroduce that in later versions.


You're telling me that scheduling parallel I/O is the kernel's job but
throttling it is userspace's job?


[*] although certain more advanced scheduling tactics like the detection
of frequently executed threadlet functions and their pushing out to
separate contexts is possible too - but this is an optional add-on
and for later.


You won't be able to do it later if you don't design for it now.
Don't reinvent the square wheel -- there's a model to follow that was
so successful that it has killed all alternate models in its sphere.
Namely, IEEE 754.  But please try not to make pipeline flushes suck as
much as they did on the i860.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-21 Thread Michael K. Edwards

Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
to us, at least on an ARM target ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-21 Thread Michael K. Edwards

I think you just misread.  I said that the Evil Linker has cheerfully
shipped the source code of the modified POP server.  He may not have
given you the compiler he compiled it with, wihout which the source
code is a nice piece of literature but of no engineering utility; but
that's the situation the GPL is DESIGNED TO PRODUCE.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Nuno Silva <[EMAIL PROTECTED]> wrote:

I can see that your argument is all about the defenition of a
"derivative work".


Far from it.  Try reading to the end.


We all know that #include  is mostly non copyrightable, so I
mostly agree that some - very very simple - modules may not need to
include the source when distributing the resulting module.ko. (need =
from a legal standpoint... The intended spirit of the GPL is another story)


The "intended spirit of the GPL" is very different from what you think
it is, as $674 million of Red Hat stock can testify.  It is also
utterly irrelevant except when the circumstances surrounding someone's
acceptance of the GPL indicate that the two parties negotiated more or
less directly before settling on its terms.


In this context what do you think about porting Linux to another arch?
Does the people porting the OS needs to distribute the source with the
[compiled] kernel?


Of course.  They're distributing a derivative work of the kernel, or
perhaps even (for legal purposes) distributing Linus's work of
authorship with trivial editorial changes that do not create a new
copyrightable work.  They need license to do so, and the only license
on offer is GPL v2, which conditions the license on distribution of
source code.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-21 Thread Michael K. Edwards

On 2/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

I believe this threadlet concept is what user-space will want to use for
programmable parallelism.


This is brilliant.  Now it needs just four more things:

1) Documentation of what you can and can't do safely from a threadlet,
given that it runs in an unknown thread context;

2) Facilities for manipulating pools of threadlets, so you can
throttle their concurrency, reprioritize them, and cancel them in
bulk, disposing safely of any dynamically allocated memory,
synchronization primitives, and so forth that they may be holding;

3) Reworked threadlet scheduling to allow tens of thousands of blocked
threadlets to be dispatched efficiently in a controlled, throttled,
non-cache-and-MMU-thrashing manner, immediately following the softirq
that unblocks the I/O they're waiting on; and

4) AIO vsyscalls whose semantics resemble those of IEEE 754 floating
point operations, with a clear distinction between a) pipeline state
vs. operands, b) results vs. side effects, and c) coding errors vs.
not-a-number results vs. exceptions that cost you a pipeline flush and
nonlocal branch.

When these four problems are solved (and possibly one or two more that
I'm not thinking of), you will have caught up with the state of the
art in massively parallel event-driven cooperative multitasking
frameworks.  This would be a really, really good thing for Linux and
its users.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-21 Thread Michael K. Edwards

Actually, it's quite clear under US law what a derivative work is and
what rights you need to distribute it, and equally clear that
compiling code does not make a "translation" in a copyright sense.
Read Micro Star v. Formgen -- it's good law and it's funny and
readable.

I've drafted summaries from a couple of different angles since VJ
requested a "translation into English", and I think this is the most
coherent (and least foaming-at-the-mouth) I've crafted yet.  It was
written as an answer to a private query to this effect:  "I write a
POP server and release it under the GPL.  The Evil Linker adds some
hooks to my code, calls those hooks (along some of the existing ones)
from his newly developed program, and only provides recipients of the
binaries with source code for the modified POP server.  His code
depends on, and only works with, this modified version of my POP
server.  Doesn't he have to GPL his whole product, because he's
combined his work with mine?"

This is a fundamental misconception.  A <> is not a "work
of authorship".  Copyright is about "works of authorship" and cannot
be used to allow or disallow behavior based on whether you have
<> two things at an engineering level to make a product.  A
contract can be used to allow or disallow, and assign penalties to,
all sorts of things, and the GPL is an "offer of contract"; but its
plain text does _not_ disallow this <> -- largely because
the drafter was trying to put one over on you and me by pretending that
he could do that without recourse to contract law.

The fact that your Evil Linker's program will not do anything
interesting without your program is no more relevant than the fact
that Borland's spreadsheet program will not do anything interesting
without a spreadsheet document loaded.  Borland's interest lay in
making their macro language compatible with Lotus's so that users
didn't have to rewrite their documents from scratch.  The Evil
Linker's interest lies in making their program compatible with other
clients of your POP server so they don't have to rewrite your POP
server from scratch.  Borland won in court, and so will the Evil
Linker.  IANAL, TINLA.

Now, Borland _almost_ lost at the Supreme Court level.  Why?  Because
while they had a good case that it wasn't practical to copy the 1-2-3
macro language without copying its entire user interface, that gets
awfully close to the sort of expression that copyright is supposed to
protect.  You can take a picture of a skyscraper and sell copies of
that picture, not because it isn't in some sense an infringement on
the architect's copyright, but because it's "fair use" -- mostly
because it doesn't interfere with the architect's ability to make
money licensing _architectural_ replicas of her work.  When you take a
screenshot of a spreadsheet, you're on safe ground; but if you use
that screenshot to clone the spreadsheet, you're pushing your luck.

Borland won, sort of, when the Supremes split 4-4 (one was out sick or
recused or something).  If you want to know why, you can get hold of a
transcript of the oral argument before the Supreme Court, which is
mostly in plain English and about half debate between the Justices
about where they ought to draw the line.  For an example where
screenshots can be over the line, and where even unlicensed
distribution of data files can be held to infringe the copyright on
the program that reads them, read Micro Star v. Formgen (9th Circuit).
That involved a very different theory though, infringement on the
"characters and mise en scene" of a fictional work (Duke Nukem 3D),
and will not avail you against the Evil Linker.  All of this stuff is
covered in Lexmark v. Static Control (6th Circuit, cert. denied) --
the law of the land throughout the U. S. of A.

But wait, you say -- the Evil Linker modified, copied, and distributed
my POP server too!  That makes him subject to the terms of the GPL.
And you're right; but to understand what that means, you're going to
need to understand how a lawsuit for copyright infringement works.
The very, very, very concise version is:

You claim "copyright infringement".
He claims "copyright license" -- "acceptance through conduct" of a
"valid offer of contract".
You claim conduct outside the "scope of the license".
He claims the terms about distributing modified versions together with
source code are "covenants of return performance", which he duly
performed.
You claim the license covers the whole <>,
including his application.
He points out that <> is explicitly defined
in GPL Section 0 to be a "derivative work under copyright law", and
that while the paraphrase following this overstates the extent of the
"derivative works" category, a raft of case law says that his program
is not a "derivative work" of yours.  Furthermore, it would be
"contrary to the public interest" to allow a "contract of adhesion in
rem" to disallow the "universal industry practice" of <>,
for engineering purposes, many differently licensed works on comm

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Robert Hancock <[EMAIL PROTECTED]> wrote:

How do you propose to do this? Drivers can get loaded and unloaded at any
time. If you have a device generating spurious interrupts on a shared IRQ
line, there's no way you can use any device on that line until that interrupt
is shut off. Requiring all drivers to be loaded before any of them can use
interrupts is simply not practical.


Of course not.  But dealing with a stuck IRQ line by locking up isn't
very practical either.  IRQ sharing is stupid yet universal, and it
happens all the time that a device that has been sitting there minding
its own business since power-up, with no driver to drive it, decides
to assert its IRQ.  Maybe it just got hot-plugged, maybe it just got
its first dribble of input, whatever.  Other devices on the shared IRQ
are screwed (or at least semi-screwed; you could periodically
re-enable the IRQ long enough to make a run through the ISR chain
servicing the other devices).  But if you run "lspci" (or whatever)
and load a driver for the newly awake device, everything goes back to
normal.

For devices compiled into the kernel, you shouldn't have to play these
games.  If, that is, there were three stages of driver initialization,
called in successive passes:

1) installing an ISR with a fallback STFU path (device-specific but
not dependent on any particular pre-existing chip state), quiescing it
if you know how and registering for the IRQ if you know which it is;

2) going through the chip's soft-reset-wake-up-shut-up cycle and
populating driver data structures, possibly correcting the IRQ
registration along the way;

3) ready-as-we'll-ever-be, bring on the interrupts.

You probably can't help enabling the IRQ briefly during 2) so that you
can do tests like Russell's loopback.  But it's a needless gamble to
do that without doing 1) for all compiled-in drivers and platform
devices first, in a previous discovery pass.  And it's stupid to do 3)
in the same pass as 2), because you'll just open race condition
windows that will only bite when an all-the-way-live device raises its
IRQ at a moment when the writer of the wake-up-shut-up code wasn't
expecting it.  All code has bugs and they're only a problem when they
bite in the field.


If a system has a device that generates interrupts before they're enabled,
and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers where
the boot ROM can leave the chip generating interrupts on bootup.)


You don't need quirks if your driver initialization is bomb-proof to
begin with.  Devices that are quiet on power-up are purely
coincidental and should not be construed.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

This can't happen because when __do_irq unmasks the interrupt source,
the CPU mask is set, thereby preventing any further interrupt exceptions
being taken.  This is done precisely to prevent this situation happening.

If you are seeing recursion for the same interrupt (two or more stack
frames containing asm_do_IRQ for that very same IRQ) then your interrupt
handling is buggy, plain and simple.


Imaginable.  I'll look at the mask/unmask code.  Thanks.


I don't doubt that it is on the same IRQ line - I have such setups here
and it works perfectly - multiple 8250 UARTs connected to a single
level-triggered interrupt input which also happens to be shared with
a SCSI host chip as well.  Absolutely no problems.


Can you do me a favor?  In the sys_open("/dev/console") path, turn on
the right bits in that second uart's IER, then insert a sleep in
request_irq or something (wherever seems best based on that
backtrace), and feed enough characters into the second UART during
that sleep to generate an IRQ.  Do you not get the same soft lockup?


I still say that your understanding is completely flawed.  Moreover,
you haven't read what I've said about the ordering of initialisation,
the stress on when we disable interrupts for the ports, etc.


Well, all I can say is that that's a real backtrace and it shouldn't
be hard to reproduce if it's anything other than a broken interrupt
controller or broken code called by the __do_irq postamble.  I don't
see any platform-provided unmask routines in that backtrace, but maybe
it got inlined; I'll go back and check.


You're actually *not* helping.  You're causing utter confusion through
misunderstanding, but it seems you're not open to the possibility that
your understanding is flawed.


Still open, though it's a pity you're more interested in my flawed
understanding that in the possibility that the kernel could be
systematically made more robust against hardware bugs and coding
errors by the simple expedient of putting all the ISRs in before
turning on any IRQ that might be shared.  Or are you telling me that's
already been done?  (Yes, I am aware that this interacts
entertainingly with hot-plug PCI.  Yes, I am aware that there is a
limit to how much software can fix stupid hardware.  But surely there
is room for an emergency IRQ suppressor to let chip initialization
code kick in and force the hardware to a known state.)


I'm offering to look through your code and point you at the source of
your issue for free.  Please don't throw that offer away without first
considering that maybe I have a clue about what's going on here.


I appreciate that offer, and I hope to take advantage of it as soon as
I have the source code at my fingertips (not just the chat log where I
recorded the backtrace).


... which showed the port being opened well after system initialisation
of devices, including all serial ports - including disabling of their
interrupt source at the IER, has been completed.


Now that you mention it, the backtrace I sent is the
serial8250_startup one, not the serial8250_init one.  Sorry, this
one's probably an artifact of brain damage specific to this UART.  I
need to dig through a different account to find the init-path example;
but in either case, we're getting a new interrupt during the __do_irq
postamble.  If you're telling me that that shouldn't happen, what
should the backtrace for a soft lockup due to a stuck level-triggered
IRQ look like on ARM?


Yes, and it's the same for any serial console with functioning break
support.  You'll find it in Documentation/sysrq.txt, though it does
misleadingly say "PC style standard serial ports only" whereas the
reality is "where possible".


Thank you very much; this will help me get to the bottom of some other
chip-support nastiness on this device.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

And for those reading along at home, _surely_ you understand the
meanings of "ambiguities in an offer of contract must be construed
against the offeror", "'derivative work' and 'license' are terms of
art in copyright law", and "not a valid limitation of scope".  If not,
I highly recommend to you that master of exposition who goes by "Op.
Cit.".

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

that's what I figured yes, as you're obviously not interested in
convincing anyone of your opinions, otherwise you wouldn't mind
repeating yourself when someone asks you a simple question.


No, dear, I'm just not interested in convincing you if you can't be
bothered to look back in the thread and Google a bit.  Think of it as
a penny ante, which is pretty cheap in a card game with billion-dollar
table stakes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

Hang on, you're actually debating that you have to abide by conditions
of a license before you can copy a copyright work?

Please, tell us the names of these appellate court decisions so that
we can read them and weep.


Can we put the gamesmanship on "low" here for a moment?  Ask yourself
which is more likely: am I a crank who spends years researching the
legal background of the GPL solely for the purpose of ranting
incoherently on debian-legal and LKML, or am I a mid-career embedded
software developer with an obsessive streak who has come to a
realization how dangerous this whole EXPORT_SYMBOL_GPL business is to
his niche in the industry?  Do I have an irrational hatred for people
named Eben, or am I sick to the heart of seeing decent young kids'
beliefs about the law perverted by a racketeer in attorney's clothing
with (at an informed guess) somewhere between $300K and $2M a year in
easy money from his web of "non-profit" shell companies?  I may be
_wrong_ -- but I'm not _witless_.

Now, if you want to play games, get yourself a copy of this
new-fangled invention called a "web browser", and Google for each set
of capitalized words with a "v." in between that has appeared in my
posts to LKML in the last few days.  For extra credit, follow links to
older posts, cleverly signposted with "http://";.  Repeat ad nauseam.
When you have an argument to offer that isn't already a blenderized
equine, preferably associated with a citation to one of those shiny
URL thingies with an "edu" or a "findlaw" in it, or even one of those
phrases with the magic "v.", I'm all ears.  Good morning -- and if I
don't see ya, good afternoon, good evening, and good night.

- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

I think something else is going on here.  I think you're getting
an interrupt for the UART, and another interrupt is also pending.


Correct.  An interrupt for the other UART on the same IRQ.


When the UART interrupt is handled, it is masked at the interrupt
controller, and the CPU mask is dropped.


Correct.


The second interrupt comes in, and when you go to disable that
source, you inadvertently re-enable the UART interrupt, despite it
still being serviced.


Incorrect.  An attempt has been made to service the interrupt using
the only ISR currently in the chain for that IRQ -- the ISR for the
first UART.  That attempt was not successful, and when __do_irq
unmasks the interrupt source preparatory to exiting interrupt context,
__irq_svc is dispatched anew.


This leads to the UART interrupt again triggering an IRQ.


Right.  The _second_ UART's interrupt.  There's another problem with
these UARTs having to do with the implementor's inability to read and
follow a bog-standard twenty-year-old spec without asking software to
fix up corner cases, but that's another backtrace for another day.


Please show your interrupt controller (mask, unmask, mask_ack)
handling functions corresponding with the interrupt which your
UART is connected to.


Don't have 'em handy; I'll be happy to post them when I do, perhaps
later today.  I would hope they're pretty generic, though; it's a
Feroceon core pretending to be an ARM926EJ-S, hooked to the usual
half-assed Marvell imitation of an ARM licensed functional block.
Trust me for the moment, it's the same IRQ line.


This shows that you don't actually have an understanding of the Linux
kernel boot, especially in respect of serial devices.  At boot, devices
are detected and initialised to a safe state, where they will not
spuriously generate interrupts.


Sorry, 'taint so.  Not unless the chip support droid has put the right
stuff in arch/arm/mach-foo.  LKML is littered with the fall-out of the
decision to trust whoever jumped to main() to have left the hardware
in a sane state.  If you don't enjoy this sort of forensics (which I
for one do not, especially not when there is a project deadline
looming and a Heisenbug starts firing 9 times out of 10), you might
consider systematically installing ISRs that know how to shut
everything up before turning on any interrupt sources at all.

As I said, this is not going to happen overnight, and is not even
particularly in the economic interest of people who get paid by the
hour to wear bringup wizard hats.  That category currently includes
me, but I am intensely bored with this game and aspire to greater
things.


When a userspace program opens a serial port, which can only happen
once the kernel boot has completed (ergo, devices have been initialised
and placed in a safe state) the interrupts are claimed, and enabled
at the source.


As you can see from the console dump I posted (which begins with
"Freeing init memory: 92K" and ends with do_exit -> init -> sys_open,
which is obviously sys_open("/dev/console")), this happens long before
userspace comes into the picture.  Our 8250.c has some nasty hacks in
it but otherwise this call chain is from a very nearly vanilla
2.6.16.recent.

We've already worked around this on our board, and the whole kit and
kaboodle will eventually be posted to linux-arm-kernel in tidy patches
when my client lets me spend billable hours on it (immediately after
the damn thing passes its first functional test, long before it
ships).  I'm not asking for anyone's help except in the
let's-all-help-one-another spirit.  I'm trying to help with root cause
analysis of Frederik's (Jose's?) fandango on core.  If it's not
relevant, my apologies; and although it goes without saying, I salute
you for both the serial driver and the ARM port.

Now please take a second look at the backtrace before toasting me
lightly again.  Mmm'kay?  Oh, and by the way -- is there an Alt-SysRq
equivalent on an ARM serial console?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

Of course, now you're going to argue that there's no such thing as an
"incompatible license" or "mere aggregation" and that these are just
words that were made up for the GPL, so they can be ignored.. another
pointless semantic argument because it doesn't change the very simple
fact that you don't have any rights to copy the work unless you have a
license and you don't have a license if you fail to abide the
conditions of the license.


I don't have to argue these points, because they're obvious to anyone
who cares to do their own homework.  Appellate court decisions _are_
the law, my friend; read 'em and weep.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:
> Bah.  Show us a citation to treaty, statute, or case law, anywhere in
> the world, Mr. Consensus-Reality.

It's a given.. are you seriously contending that if you combine two
copyright works you are not obliged to conform with the conditions of
the license on one of them when making a copy of the combined work?


There is no legal meaning to "combining" two works of authorship under
the Berne Convention or any national implementation thereof.  If you
"compile" or "collect" them, you're in one area of law, and if you
create a work that "adapts" or "recasts" (or more generally "derives
from") them, you're in another area of law.  A non-exclusive license
can authorize one of these categories of conduct and not the other, or
can slice and dice the scope of the license along almost any axis of
law or fact.  But what it can't do is use the phrase "derivative work"
passim -- a legal term of art invented by US judges and legislators to
corral a class of infringing works and treat them differently from
compilations -- and pretend that it also encompasses compilations
(copyrightable and otherwise; there is an originality threshold here
too) and use by reference.


If you are just arguing about the term, what term do you find more
appropriate?  Compilation?


If the amount of "adapting, translating, and recasting" done to the
pre-existing works crosses a minimum threshold of creative expression
(not "sweat of the brow", at least not under Feist), then you've
created a derivative work.  If the amount of "selecting and arranging"
done to create a compilation crosses a similar threshold of creative
expression, you've created a copyrightable compilation or collective
work.  If neither, then all you've done is copy and distribute.
That's how the law works.  IANAL, TINLA.


You guys seem to love pointless semantic arguments.  Are you always in
violent agreement?


Perhaps pointless if you were the sole audience, since you seem
disinclined to evaluate the accuracy of your beliefs given novel
information backed by extensive citations to the primary literature.
Not pointless if it disabuses other people, with less deeply ingrained
errors of understanding, of the notion that they can trust certain
very heavily financially interested parties to tell them the truth
about what the law says.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

> setup_irq() is where things go wrong, at least for us, at least on
> 2.6.16.x.  Interrupts are not disabled at the point in request_irq()
> when the interrupt controller is poked to enable the IRQ source.  If
> you're lucky, and you're on an architecture where the UART interrupt
> is properly level-triggered, and the worst thing that happens when you
> attempt to service an interrupt that isn't yours is that it stays on,
> then you get a soft lockup with two or three recursive __irq_svc hits
> in the backtrace.  If you're not lucky you do a fandango on core.

That should not happen if your interrupt handling is correct - okay, you
might get an interrupt at that point, but while servicing that interrupt
the source will be disabled on the interrupt controller.


Right.  But as soon as you turn the source back on, in the postamble
of the interrupt dispatch handler, it fires again.  At least on ARM,
that gives you recursive hits to __irq_svc and a couple of nested
calls within it.  Here's a backtrace (embedded in a chat log with some
commentary):

6:42 PM me: we have definitely confirmed that the serial ISR is
failing to clear the interrupt and the (presumably level-triggered)
IRQ is firing again on exit from the ISR.

6:43 PM The reason that __do_softirq is usually the last function
entrypoint in the backtrace before the __irq_svc associated with the
timer is that it is the first place where interrupts are enabled
during the IRQ dispatcher postamble.

6:44 PM Here is a backtrace from a case where the timer interrupt hit
during the perpetually firing ISR instead of during the dispatch code
surrounding it (which is not visible in backtraces)

6:45 PM [ 54.23] Freeing init memory: 92K
[ 52.24] rcu_do_batch: rcu node is 0xC03D7540, callback is 0xC00864C8
[ 52.24] rcu_do_batch: rcu node is 0xC02CCDA0, callback is 0xC006E7E4
[ 52.25] rcu_do_batch: rcu node is 0xC03D7730, callback is 0xC00864C8
[ 52.26] rcu_do_batch: rcu node is 0xC03D7920, callback is 0xC00864C8
[ 51.24] BUG: soft lockup detected on CPU#0!
[ 52.24] [] (dump_stack+0x0/0x14) from []
(softlockup_tick+0xa8/0xe8)
[ 52.24] [] (softlockup_tick+0x0/0xe8) from []
(run_local_timers+0x18/0x1c)
[ 52.24] r8 = 00010105 r7 = 0005 r6 =  r5 = 
[ 52.24] r4 = C0299B40
[ 52.24] [] (run_local_timers+0x0/0x1c) from
[] (update_process_times+0x50/0x7c)
[ 52.24] [] (update_process_times+0x0/0x7c) from
[] (timer_tick+0xc4/0xe0)
[ 52.24] r6 =  r5 = C029DB48 r4 = C029DB48
[ 52.24] [] (timer_tick+0x0/0xe0) from []
(mv88w8xx8_timer_interrupt+0x30/0x68)
[ 52.24] r6 =  r5 = C029DB48 r4 = C024775C
[ 52.24] [] (mv88w8xx8_timer_interrupt+0x0/0x68) from
[] (__do_irq+0xf0/0x140)
[ 52.24] r5 =  r4 = C0204280
6:46 PM [ 52.24] [] (__do_irq+0x0/0x140) from
[] (do_level_IRQ+0x70/0xc8)
[ 52.24] [] (do_level_IRQ+0x0/0xc8) from []
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DB48 r5 = C0240E24 r4 = 0005
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0020 r5 = C029DB7C r4 = 
[ 52.24] [] (__do_irq+0x0/0x140) from []
(do_level_IRQ+0x70/0xc8)
[ 52.24] [] (do_level_IRQ+0x0/0xc8) from []
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DBFC r5 = C0240F5C r4 = 000B
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0800 r5 = C029DC30 r4 = 
[ 52.24] [] (__do_softirq+0x0/0xd8) from []
(irq_exit+0x48/0x5c)
[ 52.24] r6 = C029DC94 r5 = C0240E24 r4 = 0005
[ 52.24] [] (irq_exit+0x0/0x5c) from []
(asm_do_IRQ+0x11c/0x134)
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0820 r5 = C029DCC8 r4 = 
[ 52.24] [] (setup_irq+0x0/0x15c) from []
(request_irq+0xa4/0xd0)
[ 52.24] r7 =  r6 =  r5 = 000B r4 = C0C1B5C0
[ 52.24] [] (request_irq+0x0/0xd0) from []
(serial_link_irq_chain+0x264/0x2a0)
[ 52.24] [] (serial_link_irq_chain+0x0/0x2a0) from
[] (serial8250_startup+0x2f4/0x4f0)
[ 52.24] [] (serial8250_startup+0x0/0x4f0) from
[] (uart_startup+0x164/0x48c)
[ 52.24] [] (uart_startup+0x0/0x48c) from []
(uart_open+0x1a8/0x238)
[ 52.24] [] (uart_open+0x0/0x238) from []
(tty_open+0x1cc/0x390)
[ 52.24] [] (tty_open+0x0/0x390) from []
(chrdev_open+0x1e4/0x220)
[ 52.24] [] (chrdev_open+0x0/0x220) from []
(__dentry_open+0x13c/0x294)
[ 52.24] r8 = C028E2A0 r7 = C0077C60 r6 = C0C29B94 r5 = 
[ 52.24] r4 = C02CC300
[ 52.24] [] (__dentry_open+0x0/0x294) from []
(nameidata_to_filp+0x34/0x48)
[ 52.24] [] (nameidata_to_filp+0x0/0x48) from
[] (do_filp_open+0x44/0x4c)
[ 52.24] r4 = 0002
[ 52.24] [] (do_filp_open+0x0/0x4c) from []
(do_sys_open+0x50/0x94)
[ 52.24] r5 =  r4 = 0002
[ 52.24] [] (do_sys_open+0x0/0x94) from []
(sys_open+0x24/0x28)
[ 52.24] r8 =  r7 = 00

Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

If you can't even agree on that the legal concept of a combined work
exists then you're obviously too far from reality for anyone to reason
with.


Bah.  Show us a citation to treaty, statute, or case law, anywhere in
the world, Mr. Consensus-Reality.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, linux-os (Dick Johnson) <[EMAIL PROTECTED]> wrote:

FWIW. A license is NOT a contract in the United States, according to
contract law. A primary requirement of a contract is an agreement. A
contract cannot, therefore, be forced. Licenses, on the other hand,
can be forced upon the user of the licensed material.


Wrong.  Acceptance through conduct has been integral to contract law
in common-law countries since the days of writs in Chancery, and is
part of the codification of the difference between contracts "in
personam" and "in rem".  Allow me to recommend Kevin Teeven's "A
History of the Anglo-American Common Law of Contract".  It is settled
law throughout the Western world that non-exclusive licenses of
copyright need not be formalized, or even put in writing.  Licenses
cannot in any sense be forced on anyone; they are simply a defense
against an action for tort, a conditional waiver of the right to sue,
and cannot even be introduced as evidence by a plaintiff.


A license is a document that states the conditions under which an
item may be used. A prerequisite of the licensor is that he/she/they
have a legal right to control the thing being licensed. When a licensed
item has its license modified by a party, not the original licensor,
it is quite possible that such attempts to control the item are
invalid (moot). Lawyers like this because it gives them work since
the final resolution of such a action can old be determined by a
court!


Wrong again.  A copyright license is a term in an otherwise valid
written, oral, or implied offer of contract, with certain limitations
of scope and certain conditions and covenants of return performance,
waiving the right to sue for the statutory tort of copyright
infringement.  Read Nimmer on Copyright, or follow the links in this
paragraph (another self-quotation from two years ago,
http://lists.debian.org/debian-legal/2005/01/msg00621.html):


Same difference, legally.  Non-exclusive license has a longer history
in patent cases than in copyright, and copyright cases frequently
point to patent cases as precedent.  The commonly cited Supreme Court
precedent that a non-exclusive patent license is "a mere waiver of the
right to sue" is a 1927 case (De Forest Radio Telephone v. United
States, http://laws.findlaw.com/us/273/236.html ), which in turn cites
Robinson on Patents -- so it was evidently already well established by
then, at least with respect to patents.  Everex Systems v Cadtrak (aka
in re CFLC) 1996, for instance, cites De Forest in concluding that
such a license constitutes significant continuing performance
(settling, as far as I am concerned, the question about whether GPL
release is a "one-shot" act with no continuing performance -- it's
not).  For an example that all this applies to copyright, see Jacob
Maxwell v. Veeck 1997 ( http://laws.findlaw.com/11th/962636opa.html ),
which brings in re CFLC over to the copyright arena.


Please do not bother to trot out Webster's definition or medieval uses
of the word "license", or the theory of unilateral license with regard
to trespass and third-party beneficiaries.  These are concepts
different from "license" as used in the phrase "non-exclusive
copyright license", and just happen to be spelled the same.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
> What we've seen on our embedded ARM is that enabling an interrupt that
> is shared between multiple UARTs, at a stage when you have not set up
> all the data structures touched by the ISR and softirq, can have
> horrible consequences, including soft lockups and fandangos on core.

Incorrect.  We have:

1. registered an interrupt handler at this point.
2. disabled interrupts (we're under the spin lock)


setup_irq() is where things go wrong, at least for us, at least on
2.6.16.x.  Interrupts are not disabled at the point in request_irq()
when the interrupt controller is poked to enable the IRQ source.  If
you're lucky, and you're on an architecture where the UART interrupt
is properly level-triggered, and the worst thing that happens when you
attempt to service an interrupt that isn't yours is that it stays on,
then you get a soft lockup with two or three recursive __irq_svc hits
in the backtrace.  If you're not lucky you do a fandango on core.


So, no interrupt will be seen by the CPU since the interrupt is masked.


The interrupt would need to be masked for the entire duration of the
outer loop that calls serial8250_init() or the equivalent for all
platform devices that share the IRQ.


The test is intentionally designed to be safe from the interrupt
generation point of view.


But its context is not.  Shared IRQ lines are a _problem_.  You cannot
safely enable an IRQ until all devices that share it have had their
ISRs installed, unless you can absolutely guarantee at a hardware
level that the unitialized ones cannot assert the IRQ line.  That does
not apply to any device that might have been touched by the bootloader
or the early init code, especially a UART.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

What we've seen on our embedded ARM is that enabling an interrupt that
is shared between multiple UARTs, at a stage when you have not set up
all the data structures touched by the ISR and softirq, can have
horrible consequences, including soft lockups and fandangos on core.
You will be vulnerable to this unless you lock out the interrupt
source (at the interrupt controller or, if you have to, globally)
across the UART registration process in your platform's
arch/mach-dependent core.c, in which case the TX irq test will of
course fail.  Roll-your-own SoC UARTs with bugs or "extended features"
in IRQ enabling and delivery make things worse.

I would love to see this disentangled in a maintainable way.  It's
such a nasty problem (especially given that bootloaders and early boot
code frequently turn on one or more UARTs and leave them in an unknown
state) that all we've been able to do so far is hack around it.  I'll
send an example patch when we've more or less isolated it, but it will
be of limited use to you unless you have the exact set of UART warpage
we do.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Alan <[EMAIL PROTECTED]> wrote:

> jurisdiction.  Copyright infringement is a statutory tort, and the
> only limits to contracting away the right to sue for this tort are
> those provided in the copyright statute itself.  A contract not to sue
> for tort is called a "license".)

I'd insert large quantities of "In the USA" in the above and probably
some "not what I've heard from a lawyer" cases.


Name ANY counterexample in the entire history of copyright, anywhere
in the world.  I've sifted through the past couple of decades of US
appellate history until I'm blue in the face, and reviewed Canadian
and British and German and French and EU statutes and decisions, and
read Nimmer on Copyright and Corbin on Contracts and historical
analyses of copyright law going back to the Statute of Anne.  And yes,
I too have conversed with attorneys and other individuals with legal
educations, in the US and Belgium and Brazil.

You have been lied to.  You have been hoodwinked.  You have neglected
to inform yourself about the simplest truths.  The GPL is not a
"copyright-based license", anywhere in the developed world.  There.
Is.  No.  Such.  Thing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-18 Thread Michael K. Edwards

Eensy weensy follow-up.  No screed.  Well, maybe just a _little_ screed.

On 2/18/07, Oleg Verych <[EMAIL PROTECTED]> wrote:

Ulrich Drepper is known to be against current FSF's position on glibc
licence changing.


Will that stop the FSF?  Will it stop Red Hat, MontaVista, or
CodeSourcery?  Even if Ulrich tells the FSF to stuff it where the sun
don't shine, and hosts his new fork on Google Code along with the rest
of the GNU corpus, the moment he or you or any of us wants to run any
binary that was compiled against a newer FSF glibc, we're screwed.
All it takes is for that binary to have a post-v3-switch symbol in its
link table (which can easily happen _without_ any change in the
application source code).

It's all very well to say we can live without anything that ships as a
binary, but commercial applications atop glibc are a big part of how
Linux left the hobbyist ghetto, and I don't want to go back there.  Do
you?  Or do you really want to spend the rest of your precious time on
Earth shimming clean-room reverse-engineered crap into
glibc-GPLv2-fork to keep Oracle running on a distro that Moglen
doesn't have by the 'nads?  Not that it really matters whether you're
using Oracle or MySQL -- an RDBMS, like any racing thoroughbred, goes
lame if you look at it funny, and once you've gotten burned once in
production by the unreproducibility of the golden bits, you won't try
it again.

The existence of GNU-free userlands is nice for us embedded folk but
ultimately useless for desktops and servers.  Talk to the people who
have spent untold person-years scrubbing bashisms out of Debian's
/etc/init.d.  Talk to the people who have ported their
industrial-scale multithreaded apps from linuxthreads (with its
annoying non-POSIX behaviors) to NPTL (with its equally annoying POSIX
behaviors).  Talk to the people struggling to package Gnome and KDE
for anything other than glibc, libstdc++, and (worst of all) ld.so.2.
Portability ain't all it's cracked up to be.

If you think "embrace, extend, destroy" is tattooed on Ballmer's
forehead, you should take a good look at an RMS word-portrait
sometime.  And imagine a nice private chat with the
if-you-can't-beat-'em-join-'em club -- from IBM and HP to Apple, Wind
River, and Sun.  Say what you will about Steve Jobs, he's a survivor
-- but the ex-CEOs of all the rest will give you an earful about being
shot out of the saddle by the "free software" mafia and replaced with
people who know how to do a deal with the devil and call it a
come-to-Jesus moment in the press release.

Intel is keeping mum -- they've made an industry out of playing both
ends against the middle, and they've got a compiler that can more or
less do Linux anyway, so they don't really care.  Google doesn't care
either -- they've got more cash than they can spend and can afford to
fork internally and go their merry way.  The only heavyweight that had
refused to get on board until very recently was Oracle.  Not just
because Larry Ellison likes to fly solo, either -- read
http://www.internetnews.com/dev-news/article.php/3614721.  Then read
http://www.internetnews.com/dev-news/article.php/3655261.  Larry, too,
is a survivor.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 05/11] syslets: core code

2007-02-18 Thread Michael K. Edwards

On 2/18/07, Davide Libenzi  wrote:

Clets would execute in userspace, like signal handlers,


or like "event handlers" in cooperative multitasking environments
without the Unix baggage


but under the special schedule() handler.


or, better yet, as the next tasklet in the chain after the softirq
dispatcher, since I/Os almost always unblock as a result of something
that happens in an ISR or softirq


In that way chains happens by the mean of
natural C code, and access to userspace variables happen by the mean of
natural C code too (not with special syscalls to manipulate userspace
memory).


yep.  That way you can exploit this nice hardware block called an MMU.


I'm not a big fan of chains of syscalls for the reasons I
already explained,


to a kernel programmer, all userspace programs are chains of syscalls.  :-)


but at least clets (or whatever name) has a way lower
cost for the programmer (easier to code than atom chains),


except you still have the 80% of the code that is half-assed exception
handling using overloaded semantics on function return values and a
thread-local errno, which is totally unsafe with fibrils, syslets,
clets, and giblets, since none of them promise to run continuations in
the same thread context as the submission.  Of course you aren't going
to use errno as such, but that means that async-ifying code isn't
s/syscall/aio_syscall/, it's a complete rewrite.  If you're going to
design a new AIO interface, please model it after the only standard
that has ever made deeply pipelined, massively parallel execution
programmer-friendly -- IEEE 754.


and for the kernel (no need of all that atom handling stuff,


you still need this, but it has to be centered on a data structure
that makes request throttling, dynamic reprioritization, and bulk
cancellation practical


no need of limited cond/jump interpreters in the kernel,


you still need this, for efficient handling of speculative execution,
pipeline stalls, and exception propagation, but it's invisible to the
interface and you don't have to invent it up front


and no need of nightmare compat code).


compat code, yes. nighmare, no.  Just like kernel FP emulation on any
processor other than an x86.  Unimplemented instruction traps.  x86 is
so utterly the wrong architecture on which to prototype this it isn't
even funny.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

On second thought, let's not deconstruct this.  It's too much work,
and it's a waste of time.  Because if you can't read "anything other
people wrote is fair game, but what we write is sacred; our strategy
is to cajole when we can and strong-arm when we can't, and the law be
damned" into that, no amount of verbiage from me is going to change
your mind.


Er, that would be, "cajole when we must and strong-arm whenever we
can".  That didn't stay on the floor long enough to pick up any germs;
doesn't count.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

This screed is the last that I am going to pollute LKML with, at least
for a while.  I'll write again if and when I have source code to
contribute, and if my off-topic vitriol renders my technical
contributions (if and when) unwelcome, I'll understand.  FSF
skulduggery is not very relevant to the _engineering_ of the kernel,
but it is (or ought to be) relevant to people's beliefs about whether
EXPORT_SYMBOL_GPL is the right thing to do.

On 2/17/07, Trent Waddington <[EMAIL PROTECTED]> wrote:

On 2/18/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote:
> If you can
> read that and still tolerate the stench of the FSF's argument that
> linking against readline means they 0wn your source code, you have a
> stronger stomach than I.

Such a strange attitude.. to go to all this effort to quote carefully
and correctly one set of people and to then total misconstrue the
words of another.


Dammit, the world at large has been far too nice to these people for
far too long.  There's no sin in people getting rich and/or famous,
but how they did it deserves some scrutiny.  The FSF is leading
thousands of idealistic young people all over the world up the garden
path by bullshitting about the nature of US law, and if Moglen at
least isn't making bank doing it, it isn't for lack of trying.  For
starters, read http://lists.debian.org/debian-legal/2005/07/msg00531.html
(another self-reference that doesn't need flowing in).  Google around
for his letters of opinion (not pro bono, I assure you) to Fluendo and
Vidomi.  How do they smell to you?

The GPL is big money, folks; the FSF General Counsel's designing "GPL
circumvention" schemes for other people's software -- and estopping
away his ability to contest them in court, one letter of opinion (and
one hefty lump-sum fee) at a time -- isn't a tenth of it.  Ask the
OSDL, who (I am very sorry to report) funneled $4M of certain hardware
makers' money to the SFLC to bankroll the expansion of Moglen's
protection racket.  Yes, protection racket.  What other phrase could
possibly describe the Software Freedom Conservancy?

Speaking of protection rackets, how about Moglen's plaintive comments,
back in the day, to the effect of (not a direct quote):  "The
conditions of the GPL can't touch Red Hat's new trademark policy,
subscription agreement, and ISV support lock-ins because they aren't
about copyright.  The GPL, as we all know, is a creature of copyright
law.  Even though the GPL says plainly, 'You must cause any work that
you distribute or publish, that in whole or in part contains or is
derived from the Program or any part thereof, to be licensed as a
whole at no charge to all third parties under the terms of this
License', that can't possibly be taken as an 'entire agreement'
clause, because the GPL isn't an offer of contract.  Don't like having
to pay per seat for RHEL?  Sorry, we can't help you."

This would be the same Red Hat that bought Cygnus Solutions in 1999 at
an estimated price of 674 MILLION dollars (in stock, of course).  That
made some individual Cygnus stockholders rich; see
http://www.salon.com/tech/feature/1999/11/18/red_hat/print.html.  Want
to bet Moglen held a Cygnus share or two, along with (via the FSF)
control over all the source code Cygnus had ever produced?  How does
it smell now?

There's yet more money in the toolchain oligopoly that Moglen and
Redmond tacitly share, and in embedded targets generally.  (Have you
ever asked yourself how the XCode and Tornado IDEs happened?  Have you
ever tried to obtain their source code?  Now do you understand why the
FSF fetishizes the fork/exec boundary?)  Redmond is not the FSF's
enemy; the phantom menace called "software patents" is, because they
protect other software makers from the FSF's volunteer army of reverse
engineers.  All those crocodile tears over TiVo, just because they
forked GCC, put the lower layers of MPEG into silicon, and wrote their
own damn DRM in RTL, instead of toeing Moglen's line on "give me naked
ripped media or give me death".  (No, I don't have first-hand
knowledge of any of these dealings; do your own research if you want
the sordid details.)

The FSF doesn't bother with kernels.  The HURD (nee Alix) is RMS's pet
project, and there are some charming young fellows who truly believe
it's going to win time trials and cure cancer someday, but I doubt
anyone in the know ever put cash money into it.  Some kernels are
better than others but once they work they're pretty interchangeable
for desktop and low-grade network server workloads.  They do, however,
take more engineering skill than the world's most prolific cloner of
other people's interfaces (RMS, in case that isn't blindingly obvious)
has ever been able to muster.  (RMS's skill exceed

Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

On 2/17/07, Neil Brown <[EMAIL PROTECTED]> wrote:

Suppose someone created a work of fiction titled - for example -
"Picnic at Hanging Rock".  And suppose further that this someone left
some issues unresolved at the end of the story, leaving many readers
feeling that they wanted one more chapter to complete the story and
give them a sense of closure.

Suppose that a number of independent individuals wrote such a chapter
that in very different ways completed the story.

[snip]

They are derived works because they borrow the characters, the setting,
the theme, etc of the original work, and build on it.


Very well put.  That doctrine is sometimes known as "mise en scene",
and is every bit as applicable to software as to any other sort of
creative work.  When, that is, the software has characters, setting,
theme, etc.  See Micro Star v. Formgen (available anywhere Google hits
are sold).


In a similar way, people claim that any driver written for Linux will
inevitably borrow some creative content that is in Linux, via the
various interfaces that are used (and it is the nature of kernel
modules that the interface between the module and the kernel is quite
intimate).  And so, they claim that any driver written for Linux will
ipso-facto be a derived work.  The interface that ties the kernel and
the module together is certainly more intimate than the interface
between the Printer and the Toner in the Lexmark case.


Yes, people claim these things.  It's just that they're wrong.  Read
Lexmark.  Read the First Circuit opinion in Lotus v. Borland.  For
some really eye-opening dialogue, read the transcript of oral argument
before the Supreme Court in the Lotus v. Borland certioriari
proceeding.  For some long-winded but cogent discourse, read the
amicus curiae brief of the League for Programming Freedom in Lotus v.
Borland, submitted to the Supremes by one Eben Moglen.  If you can
read that and still tolerate the stench of the FSF's argument that
linking against readline means they 0wn your source code, you have a
stronger stomach than I.


Also, the "every practical way" point doesn't entirely apply.  In a
growing number of cases, it is possible to write a driver in
user-space.  This is apparently true for USB and is becoming true for
PCI.  And writing drivers as user-space programs is explicitly not a
derived work for the purposes of the Linux kernel license.


"Possible" doesn't mean "practical".  Compare Galoob and Micro Star,
Atari v. Nintendo and Sega v. Accolade.  There's a fine line, and
Judge Sutton walked up one side of it and down the other, and his
fellow panelists ably advocated drawing it either to the left or to
the right of where he had.  When the Supremes denied cert. -- in a
case where the appellate court had vacated and remanded to the
district court, meaning that they had to demonstrate that the lower
court had erred _as_a_matter_of_law_ -- they endorsed Judge Sutton's
reading of the record.  Lexmark is now settled law.
MODULE_LICENSE("GPL") on a binary-only turd is -- insofar as you can
demonstrate to the court of fact that it resembles the Lexmark fact
pattern, anywhere in the US -- as legal as an 8.5" x 14" pad of yellow
paper.  IANAL, TINLA.


So while that case sets an interesting precedent, I don't think it can
apply to the general issue of Linux kernel modules.


I mean this in the nicest possible way:  Think again.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

On 2/17/07, Giuseppe Bilotta <[EMAIL PROTECTED]> wrote:

Which shows how that case is different from writing Linux drivers. For
example, looking at the example the OP was himself proposing a few
alternative approaches to work around the limitation they were hitting:
could just switch to static major/minors instead of dynamics ones, they
could skip sysfs, or they could even reimplement something like sysfs
themselves, or whatever other interface they deem useful for the purpose of
plopping in their own binary blob on top of it, sort of like what nVidia
and ATi do for their stuff.


Or they could run:
   find . -type f -exec perl -i.bak -pe 's/EXPORT_SYMBOL_GPL/EXPORT_SYMBOL/g'
and be done with it.  Or even just MODULE_LICENSE("GPL") in their
module -- that's not "lying about the module license", it's "doing the
minimum necessary in order to interoperate efficiently with the
kernel".  Atari v. Nintendo is still good law, but _only_ to the
extent that it does not conflict with Lexmark, which now has the seal
of Supreme Court approval.  And (IMHO, IANAL) if writing
MODULE_LICENSE("GPL") is obviously the only remotely efficient way to
achieve the goal of interoperation with the kernels that people
already have on their systems, through the documented, tested,
currently recommended APIs (like sysfs), then you have a Sega / Altai
/ Lexmark fact pattern, not an Atari v. Nintendo fact pattern.

So what's the penalty for MODULE_LICENSE("GPL") on code that is not
actually offered under the GPL?  Being shunned by the kernel
community.  Maintaining a fork.  Getting to keep both halves when it
breaks.  Friends don't let friends write non-GPL drivers.  But friends
also don't let friends go off into delusional spasms of denial.
nVidia and ATI do what they do so that their code has more than a
snowball's chance in hell of running on people's desktops, not out of
fear that the Big Bad LKML Wolf will come blow down their houses.
Their hardware is doubtless so fiddly and buggy and crash-prone that
four out of five attempts to compile a driver for it reorder the
instructions enough to slag the GPU, under Windows or Linux.  _That's_
why they ship binary drivers.  Capisce?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

On 2/17/07, Dave Neuer <[EMAIL PROTECTED]> wrote:

I think you are reading Lexmark wrong. First off, Lexmark ruled that
scenes a faire applied to the toner-level calculation, not "make a
toner cartridge that works with a particular Lexmark printer." It was
the toner-calculation algorithm that could't be done any other sane
way, which made the TLP unprotectable via copyright. The opinion says,
"Both prongs of the infringement test, in other words, consider
'copyrightability,' which at its heart turns on the principle that
copyright protection extends to expression, not to ideas."


David S. is reading Lexmark right (IMHO, IANAL).  The byte-for-byte
content of the TLP had to be copied in order to interoperate with the
printer, because the printer checksummed it (apparently using SHA-1,
but possibly truncating the result to 8 bits, which is rather comical;
it is not 100% clear to me based on the appellate decision, which also
seems to say that the printer cartridge contains the SHA-1 algorithm,
which is probably just an error).  That rendered it a "lock-out code"
within the sense of Sega v. Accolade, and ultimately that's why the
circuit court vacated the district court's decision in Lexmark's
favor.

In order to vacate and remand, the appellate court had to demonstrate
that the district court's grant of preliminary injunction to Lexmark
was wrong _as_a_matter_of_law_.  So they had to construe the facts of
the case in a light as favorable to Lexmark as humanly possible.  They
concluded that, _even_ if the TLP contained copyrightable expression,
and _even_ if all of the district court's reasoning about the other
prongs of a preliminary injunction test (potential for irreparable
harm, balance of harms, the public interest) were correct, neither the
Copyright Act nor the DMCA could be used to establish the fourth
prong: likelihood of success on the merits.

Cloners rejoice:  the US Supreme Court has, by denying certioriari on
Judge Sutton's opinion, given you carte blanche to copy and distribute
freely any software or firmware whose author has been so stupid as to
cryptographically checksum it as an anti-interoperability measure.
Using copyrighted software as a "lock-out code" to create a cause of
action against reverse engineers has the paradoxical effect of
rendering it uncopyrightable _as_a_matter_of_law_ in the US, unless
and until Congress or a later Supreme Court creates new law to the
contrary.  I am not a lawyer, this is not legal advice, on your own
head be it.


You're saying that there's no other way to interface device drivers to
an operating system than the current Linux driver model? That's
strange, since it's a different driver model than Linux had
previously, and it's also different from the BeOS driver interface,
etc. If the Linux driver interface is protectable, it doesn't seem
like scenes a faire applies.


The Linux driver interface is, as a matter of law, not copyrightable
in the U. S. of A., no matter how many EXPORT_SYMBOL_GPLs and dactylic
hexameters you adorn it with.  That was already true under Baker v.
Selden, and didn't get any less so as a result of Lotus v. Borland,
and is now inescapable (IMHO) under Lexmark, and it's not likely to
get any less true unless RMS is elected president and appoints Eben
Moglen to the Supreme Court.  Sorry, folks; I'm just the messenger.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-17 Thread Michael K. Edwards

On 2/17/07, Scott Preece <[EMAIL PROTECTED]> wrote:

Well, compilation is probably equivalent to "translation", which is
specifically included in the Act as forming a derivative work.


Nix.  "Translation" is something that humans do.  What's governed by
copyright is the creative expression contained in a work, and it makes
no difference whether it's source code or object code, RTL or silicon,
PDF or parchment.  There's a requirement of tangible fixation for
registration purposes, so you can't claim copyright on a story that's
in your head which you haven't written down.  But what's copyrightable
about a computer program is neither the "ideas and methods of
operation" nor the blob of bits (compiled or not); it's the
idiosyncrasies, the human touches, the things that would differ
between two equally skilled coders' ways of putting those ideas into
language.

A judge doesn't care whether a C compiler will spin silk purses or
spit chunks when fed this language, except insofar as duplicating
another coder's language (by trial and error or by blatant, arrant,
heartless enslavement of poor little bytes in ROM) is obligatory in
order to build a silk-purse-spinner.  You can't claim copyright on the
only way to accomplish some engineering purpose.  Even if that purpose
is to interoperate with, or even substitute for, someone else's
software or hardware in a way that destroys its marketability or turns
its author's moral imperatives into subjunctives.  Them's the breaks,
folks; if you don't like it, write poetry instead.  (And don't use it
as a passphrase for a printer cartridge.)

(You also can't claim copyright on something that isn't your work of
authorship, so you can't just write down someone else's sermon and go
obtain copyright registration on it.  Or rather, you can, but you will
lose when you try to sue someone else for infringing it, because
you've falsified the registration.  Under the Berne Convention, you
have also not spoiled the actual author's opportunity to write it
down, register her copyright, and sue you and anyone else for
infringing it.  People who thought they licensed it legitimately from
the ostensible copyright holder may have a defense of innocent
infringement, depending on whether the author can demonstrate
negligence according to the usual tort standards in the relevant
jurisdiction.  Copyright infringement is a statutory tort, and the
only limits to contracting away the right to sue for this tort are
those provided in the copyright statute itself.  A contract not to sue
for tort is called a "license".)

Again, I recommend the Lexmark v. Static Control decision to you.  It
references the major appellate decisions in this space from the late
70's forward, mostly from the 9th and 2nd Circuits.  The full text is
available from FindLaw; the few older decisions not available from
FindLaw are easily Googled.  Or you could just mine the debian-legal
archives for the links; sadly, 80% or more of the actual citations to
case law or statute in the debian-legal archives are in my
handwriting, so you can't take that as an independent source of
information.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-16 Thread Michael K. Edwards

On 2/15/07, Gene Heskett <[EMAIL PROTECTED]> wrote:
[ignorant silliness]

There is no one to my knowledge here, who would not cheer loudly once a
verdict was rendered because that courts decision would give the FOSS
community a quotable case law as to exactly what is, and is not legal for
you to do with GPL'd code.  We would after 16+ years of the GPL, finally
have a firm, well defined line drawn in the sand, a precedence in US case
law that at present, only exists in Germany.


Oferchrissake.  We do have a US precedent, insofar as a decision in a
court of fact on issues of fact can ever be a precedent in a common
law system (hint: zero, unless the later judge feels like quoting some
compelling prose).  That would be Progress Software v. MySQL (also
known as MySQL v. NuSphere in some commentators' writings).  The FSF
interpretation of the GPL lost.  Completely.  Which is true also of
the Munich and Frankfurt decisions.

The plaintiffs, as authors of GPL works, got a full hearing in each
case -- via routine reasoning about the GPL as an offer of contract,
whose conditions either had (Progress Software) or had not
(Fortinet/Sitecom and D-Link) been performed to the extent necessary
for the defendant to claim license under the GPL.  MySQL did obtain a
preliminary injunction, but on unrelated trademark license grounds;
the GPL claim got them nowhere, for at least four distinct reasons
stated in the opinion.  Harald's recovery was limited to statutory
costs and, in the Munich case, an injunction to _either_ offer the
source code of netfilter/iptables itself _or_ stop shipping product.
Both German courts refused to find contract "in personam" (necessary
to a breach of contract claim, in turn necessary to a demand for
specific performance).

"GPL is a creature of copyright law" lost in court, every time.
"Section 4 is a limitation of scope, not a conditional performance"
lost.  "You can lose your license irrevocably" lost.  "We can compel
disclosure of source code with no alternative" lost.  "We can
circumvent contract law standards of breach and remedy" lost.
Everything RMS and Eben Moglen have ever written about the legal
meaning of the GPL is wrong, and where "derivative works" are
concerned, embarrassingly hypocritical as well.  Take the Big Lie
elsewhere, please!

- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-15 Thread Michael K. Edwards

On 2/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Michael K. Edwards wrote:
> Bzzzt.  The whole point of the GPL is to "guarantee your freedom to
> share and change free software--to make sure the software is free for
> all its users."

No, that's the FSF marketing fluff you've been taught to recite.


 Didn't read far into that e-mail, did you?  That's a quote
from the GPL preamble, which isn't part of the offer of contract but
is arguably legally relevant context.  It's largely consistent with
the license as I read it.  It's also consistent with the origins of
the GPL, as you can read anywhere that folk history is sold.


In the context of the Linux kernel, I'm referring to the original reason
why Linus chose the GPL for the Linux kernel.


Can't say; wasn't there.  But I doubt that anyone truly "funnels
contributions back" to the kernel because of the letter of the GPL.
They do so because they think it will lower their costs, raise their
revenues, hedge their risks, earn goodwill from peers, enhance their
employability, stroke their egos, save the world, please their Author
and/or Linus, and so on and so forth.

The GPL tells us what is likely to happen in the event of a conflict
that gets as far as a courtroom.  I think the smart money's on
conflict avoidance (that's kind of ironic, isn't it), and failing
that, on ignoring the FSF and other armchair lawyers and reading the
law (spelled A-P-P-E-L-L-A-T-E D-E-C-I-S-I-O-N-S) for yourself.  That
is, after all, what judges have done and will continue to do, at least
until the Revolution comes.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-15 Thread Michael K. Edwards

On 2/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

v j wrote:
> So far I have heard nothing but, "if you don't contribute, screw you."
> All this is fine. Just say so. Make it black and white. Make it

It is black and white in copyright law and the GPL.

The /whole point/ of the GPL is to funnel contributions back.


Bzzzt.  The whole point of the GPL is to "guarantee your freedom to
share and change free software--to make sure the software is free for
all its users."  Undo the EXPORT_SYMBOL_GPL brain damage if you like;
it's not part of the offer of contract, which ties itself in knots to
avoid any measure of derivative-ness other than that enshrined in the
appropriate jurisdiction's copyright law (conformant to the Berne
Convention anywhere that matters).  Fork to your heart's content; if
it breaks, you get to keep both pieces.  Just remember that there is
also no severability clause in the offer of contract (very much the
contrary; see Section 7), so if the contract breaks, you do not
necessarily get to keep both pieces, and weird things happen when you
wander off into the vagaries of your jurisdiction's tort law and
commercial code.


We happen to think there are solid engineering reasons for this.
Customers think there are solid reasons for this. Libre activists think
there are freedom aspects to this.


There are no solid engineering reasons for demanding code that is
probably crap, tied to hardware you will never see, from people with
regard to whom you indulge in mutual hostility and distrust.  There
are, however, solid economies of scale in warranting the unwarrantable
and hand-holding at each stage of the skill hierarchy, and they depend
on keeping the interchangeable parts interchangeable.  The "IP asset"
delusion interferes with these economies of scale and annoys the
Morlocks who keep the gears turning.

99% of customers could care less; show them good, fast, and cheap,
offer them two out of three, they'll leave "good" on the table every
time.  Libre activists don't understand the meaning of the word
"freedom", at least not in the same way I do; freedom is spelled
R-U-L-E O-F L-A-W and they seem to have contempt for this concept.


But ignoring all the justifications, it /is/ the letter of the law.  And
you are expected to follow it, just like everybody else.


Fiddlesticks.  The letter of the law is very different from what you
have been led to believe.  Most law is created in the marketplace,
discovered in a courtroom, codified by a legislature, reconciled by
treaty and gunship, and enforced by insurers and other financial
institutions.  This kind of law considers the "free software"
philosophy to be a curiosity at best, and renders the GPL down to:

You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work ...

You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any part
thereof, to be licensed as a whole at no charge to all third parties
under the terms of this License.

... you must cause it ... to print or display an announcement
including an appropriate copyright notice and a notice that there is
no warranty (or else, saying that you provide a warranty) ...

... Accompany it with the complete corresponding machine-readable
source code ...

*** BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, *** THERE IS
NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE
LAW.

That's a straightforward license of modification and distribution
rights in return for attribution and warranty disclaimer, just like
any other "open source" license; except that it creates SEVERE
obstacles to commercial exploitation without cutting the originator
in, just like most "free as in beer" licenses.  The rest of the GPL is
composed of no-ops in any legal system that matters.  Don't take my
word for it; read the actual court decisions in which the GPL has come
up, and Nimmer on Copyright for the backstory.

IANAL, YMMV, HTH, HAND, Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 05/11] syslets: core code

2007-02-15 Thread Michael K. Edwards

On 2/15/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

Would it make the interface less cool? Yeah. Would it limit it to just a
few linked system calls (to avoid memory allocation issues in the kernel)?
Yes again. But it would simplify a lot of the interface issues.


Only in toy applications.  Real userspace code that lives between
networks+disks and impatient humans is 80% exception handling,
logging, and diagnostics.  If you can't do any of that between stages
of an async syscall chain, you're fscked when it comes to performance
analysis (the "which 10% of the traffic do we not abort under
pressure" kind, not the "cut overhead by 50%" kind).  Not that real
userspace code could get any simpler by using this facility anyway,
since you can't jump the queue, cancel in bulk, or add cleanup hooks.

Efficiently interleaved execution of high-latency I/O chains would be
nice.  Low overhead for cache hits would be nicer.  But least for the
workloads that interest me, neither is anywhere near as important as
the ability to say, "This 10% (or 90%) of my requests are going to
take forever?  Nevermind -- but don't cancel the 1% I can't do
without."

This is not a scheduling problem, it is a caching problem.  Caches are
data structures, not thread pools.  Assume that you need to design for
dynamic reprioritization, speculative fetch, and opportunistic flush,
even if you don't implement them at first.  Above all, stay out of the
way when a synchronous request misses cache -- and when application
code decides that a bunch of its outstanding requests are no longer
interesting, take the hint!

Oh, and while you're at it: I'd like to program AIO facilities using a
C compiler with an explicitly parallel construct -- something along
the lines of:

try (my_aio_batch, initial_priority, ...) {
} catch {
} finally {
}

Naturally the compiler will know how to convert synchronous syscalls
to their asynchronous equivalent, will use an analogue of IEEE NaNs to
minimize the hits to the exception path, and won't let you call
functions that aren't annotated as safe in IO completion context.  I
would also like five acres in town and a pony.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-15 Thread Michael K. Edwards

I sympathize with you, VJ, I really do, but you're going to have to do
a lot more homework in order to understand what is happening to Linux
in embedded space.  Allow me to supply some worthless, non-lawyer,
non-measurable-kernel-contributor observations, which are at least
quite different from those of most people who will bother to respond
in this thread.  Let's start with: Don't believe everything you read
on the Internet.  That goes double for press releases and public
mailing lists.

On 2/14/07, v j <[EMAIL PROTECTED]> wrote:

This is in reference to the following thread:

http://lkml.org/lkml/2006/12/14/63

I am not sure if this is ever addressed in LKML, but linux is _very_
popular in the embedded space. We (an embedded vendor) chose Linux 3
years back because of its lack of royalty model, robustness and
availability of infinite number of open-source tools.


That's not why you chose Linux.  You chose Linux because it made your
lives easy in the short term and you didn't know or care what the
long-term consequences would be.  You didn't have to interact with any
human beings, cut any deals, make any business case, obtain any
executive approvals, sign any contracts, or forecast any per-seat,
per-unit, or per-incident costs.  You did have to put work into making
it work for you, but you chalked your salaries up to "investment" in
an "IP asset".  Now you're finding out that the difference between a
sunk cost and an asset is that an asset produces a revenue stream, and
that few of the people who actually work on the mainline Linux kernel
care whether or not your investment in building things with Linux
inside will continue to produce revenues.  Live and learn.


We recently decided to move to Linux 2.6 for our next product, mainly
because Linux has worked so well for us in the past, and we would like
to move up to keep up with the latest and greatest.


Don't kid a kidder.  2.6 was the latest and greatest three years ago.
Now it's the only game in town if you want anyone to talk to you
without a cash retainer up front.  Mind you, 2.4 still runs great on
anything it ran great on back then; but nowadays the sort of
application coders who work cheap are dead in the water without the
(relatively) inexpensive threads of NPTL/TLS.  Even the most un-hip
and non-showing-up-at-the-table of chip vendors realize that their
price/performance slides will look better under 2.6 unless they are
really, really memory constrained.  Sticking to 2.4 would require a
business case for carrying your own maintenance weight, and we've
already established that proactive analysis of business cases is not
your enterprise's strong point.


However in moving to  2.6, we noticed a number of alarming things.
Porting drivers over from devfs to udev, though easy raised a number
of alarming issues. Driver's no longer could dynamically allocate
their MAJOR/MINOR numbers. Doing so would mean they would have to use
sysfs. However it seems that sysfs (and the class_ interface) is only
available to GPL modules. This is very concerning. The drivers which
we have written over the last three years are suddenly under threat.
We don't mind statically assigning MAJOR/MINOR numbers to our drivers.
We can do this and modify our user space applications too.


Given all the technical and cultural changes in the Linux ecosystem
since you last checked in, it's kinda funny you picked on device
numbering.  Do you really expect your customers to combine your
drivers with someone else's closed-source special sauce that happens
to pick the same major number?  Or if you're concerned about sharing a
major with other devices in the same conceptual class, do you really
think there's a legal risk in using a standardized, published,
arm's-length interface to interchangeable drivers, no matter what the
symbols are labeled?  Ask your lawyer to interpret Lotus v. Borland
and Lexmark v. Static Control for you, and to show you parallels to
the law in other markets that are relevant to you.  Then ask yourself
whether you really want to be calling EXPORT_SYMBOL_MOVING_TARGET
entrypoints from your out-of-tree drivers anyway.


However we have a worrying trend here. If at some point it becomes
illegal to load our modules into the linux kernel, then it is
unacceptable to us. We would have been better off choosing VxWorks or
OSE 3 years ago when we made an OS choice. The fact that Linux is
becoming more and more closed is very very alarming.


Illegal, shmillegal.  What's it's becoming is pointless, because all
the kernel "features" that matter in a modern embedded system (bounded
preemption latency, effective power management, reliable I/O
peristalsis with predictable system impact) can't be localized to
"core code".  I can tell you from experience that they're next to
impossible to achieve when there's a big opaque lump of binary driver
stuck in the platform integrator's craw.  Fans of EXPORT_SYMBOL_GPL
may be perversely motivated, but they're telling you in advance what
a

  1   2   >