subject:"2.6.11\-rc1\-mm1"

Re: Kernel panic - r8169 on 2.6.11-rc1-mm1

2005-03-19 Thread Francois Romieu

Cameron Harris <[EMAIL PROTECTED]> :
[r8169 crash]
> Linux laptop 2.6.11-rc1-mm1 #2 SMP Sun Jan 16 22:36:26 GMT 2005 i686
   ^^
[...]
> I would try a newer kernel, but the command line options for
> specifying the framebuffer settings seems to have changed in the
> latest kernel and i haven't had enough time to work out how to specify
> it.

If you can not upgrade the kernel, I can not do anything for you since
several fixes went in after 2.6.11-rc1-mm1.

Regarding your r8169 issue, I suggest:
1) download linux kernel 2.6.12-rc1
2) apply on top of it:
   
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.11/r8169/patches/r8169-570.patch
3) avoid the proprietary tainting module

Please Cc: netdev@oss.sgi.com for issues related to network drivers.

--
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel panic - r8169 on 2.6.11-rc1-mm1

2005-03-19 Thread Cameron Harris

Every time i try to use eth1 which is r8169, i get a kernel panic, but
on the actual use of it, not the configuring it.

e.g.
laptop ~ # ifconfig eth1 up 192.168.1.1

laptop ~ # ping 192.168.1.2

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

Oops:  [#1]

Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_oss
seq_midi_event snd_seq snd_seq_device irtty_sir sir_dev irda pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm snd_timer

snd snd_page_alloc wlan_wep fglrx sis_agp psmouse r8169 ath_pci
ath_rate_onoe wlan ath_hal

CPU:0

EIP:0060:[]   Tainted: P  VLI

EFLAGS: 00010206(2.6.11-rc1-mm1)

EIP is at rtl8169_start_xmit+0x55/0x2b0 [r8169]

eax: 003f   ebx: cf236140   ecx: cc9c6000   edx: 

esi: cf236240   edi: cfd9b280   ebp: cfd9b280   esp: c0685e00

ds: 007bes: 007bss: 0068

Process swapper (pid: 0, threadinfo=c0684000 task=c05b6ba0)

Stack: c0107e50 cf236140 cf935080 cfd9b280  d14da000 cc9c6000 

   8000 cf236140 cf935080 cf236000 cfd9b280 c049141e cfd9b280 cf236000

     cf236000 cfd9b280 cf236140 c048387f cf236000 cf935080



 <0>Kernel panic - not syncing: Fatal exception in interrupt

I never had time to write down the whole stack trace (and there was no
core file created)
This driver used to work in a previous kernel version (but it did get
IRQ #x nobody cared messages, usually when there was some sort of a
disconnection of my ethernet cable for whatever reason). This is
always reproducable.

uname -a:
Linux laptop 2.6.11-rc1-mm1 #2 SMP Sun Jan 16 22:36:26 GMT 2005 i686
Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux

I would try a newer kernel, but the command line options for
specifying the framebuffer settings seems to have changed in the
latest kernel and i haven't had enough time to work out how to specify
it.
-- 
Cameron Harris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-25 Thread Karim Yaghmour


Roman Zippel wrote:
> Ok, great.
> BTW I don't really expect the first version to be fully optimized (unless 
> you want to :) ), but once the basics are right, that can still be added 
> later.

Agreed. Tom will post updated patches sometime this week. I'll follow up
with the LTT stuff separately as agreed.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[2.6.11-rc1-mm1] Strange MCE errors

2005-01-24 Thread Vincent Vanackere

Hi all,

since I've replaced my Athlon XP 1800 with a Athlon XP 3000
(Barton/FSB333), my logs are flooded with these warnings:

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 1: d4004152
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 2: d400417a

(If it has any importance, my motherboard is a Gigabyte 7VAXP-Ultra.
I've tried with another ram chip, but no change at all in behaviour)

Apart from that, this system is running flawlessly, so I'm tented to
just disable MCE in the kernel. But... I'd like to know if this is a
kernel mistake or if I have really some configuration/hardware
problem. I could not deduce anything meaningful from the parsemce
(version 0.09) utility.

Any help or advice would be apreciated...

Vincent

P.S.: here follows more information on this cpu
--
# cat /proc/cpuinfo
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 6
model   : 10
model name  : AMD Athlon(tm) XP 3000+
stepping: 0
cpu MHz : 2170.470
cache size  : 512 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow
bogomips: 4292.60

# cpuid
 eax ineax  ebx  ecx  edx
 0001 68747541 444d4163 69746e65
0001 06a0   0383fbff
8000 8008 68747541 444d4163 69746e65
8001 07a0   c1c3fbff
8002 20444d41 6c687441 74286e6f 5820296d
8003 30332050 002b3030  
8004    
8005 0408ff08 ff20ff10 40020140 40020140
8006  41004100 02008140 
8007    0001
8008 2022   

Vendor ID: "AuthenticAMD"; CPUID level 1

AMD-specific functions
Version 06a0:
Family: 6 Model: 10 [Duron/Athlon model 10]

Standard feature flags 0383fbff:
Floating Point Unit
Virtual Mode Extensions
Debugging Extensions
Page Size Extensions
Time Stamp Counter (with RDTSC and CR4 disable bit)
Model Specific Registers with RDMSR & WRMSR
PAE - Page Address Extensions
Machine Check Exception
COMPXCHG8B Instruction
APIC
SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions
MTRR - Memory Type Range Registers
Global paging extension
Machine Check Architecture
Conditional Move Instruction
PAT - Page Attribute Table
PSE-36 - Page Size Extensions
MMX instructions
FXSAVE/FXRSTOR
25 - reserved
Generation: 7 Model: 10
Extended feature flags c1c3fbff:
Floating Point Unit
Virtual Mode Extensions
Debugging Extensions
Page Size Extensions
Time Stamp Counter (with RDTSC and CR4 disable bit)
Model Specific Registers with RDMSR & WRMSR
PAE - Page Address Extensions
Machine Check Exception
COMPXCHG8B Instruction
APIC
SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions
MTRR - Memory Type Range Registers
Global paging extension
Machine Check Architecture
Conditional Move Instruction
PAT - Page Attribute Table
PSE-36 - Page Size Extensions
AMD MMX Instruction Extensions
MMX instructions
FXSAVE/FXRSTOR
3DNow! Instruction Extensions
3DNow instructions

Processor name string: AMD Athlon(tm) XP 3000+
L1 Cache Information:
2/4-MB Pages:
   Data TLB: associativity 4-way #entries 8
   Instruction TLB: associativity 255-way #entries 8
4-KB Pages:
   Data TLB: associativity 255-way #entries 32
   Instruction TLB: associativity 255-way #entries 16
L1 Data cache:
   size 64 KB associativity 2-way lines per tag 1 line size 64
L1 Instruction cache:
   size 64 KB associativity 2-way lines per tag 1 line size 64

L2 Cache Information:
2/4-MB Pages:
   Data TLB: associativity L2 off #entries 0
   Instruction TLB: associativity L2 off #entries 0
4-KB Pages:
   Data TLB: associativity Direct mapped #entries 0
   Instruction TLB: associativity Direct mapped #entries 0
   size 2 KB associativity L2 off lines per tag 129 line size 64

Advanced Power Management Feature Flags
Has temperature sensing diode
Maximum linear address: 32; maximum phys address 34
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-23 Thread Roman Zippel

Hi,

On Sun, 23 Jan 2005, Karim Yaghmour wrote:

> But how does relayfs organize the namespace then? What if I have
> multiple channels per CPU, each for a different type of data, will
> all channels for the same CPU be under the same directory or will
> each type of data have its own directory with one entry per CPU?

I'd say the latter, you already do this for ltt.

> I don't have an answer to that, and I don't know that we should. Why
> not just leave it to the client to organize his data as he wishes.
> If we must assume that everyone will have at least one channel per
> CPU, then why not provide helper functions built on top of very
> basic functions instead of fixing the namespace in stone?

How should simple do you want to have these helper functions, isn't 
something like relay_create(path, num_chan) simple enough?
I don't think a directory structure is that bad, as that allows to add 
more control files to the relay stream and still leave the option to write 
out all buffers into one file.

> > I have to modify it a little (only the if (!buffer) part is new):
> > 
> > cpu = get_cpu();
> > buffer = relay_get_buffer(chan, cpu);
> > while(1) {
> > offset = local_add_return(buffer->offset, length);
> > if (likely(offset + length <= buffer->size))
> > break;
> > buffer = relay_switch_buffer(chan, buffer, offset);
> > if (!buffer) {
> > put_cpu();
> > return;
> > }
> > }
> > memcpy(buffer->data + offset, data, length);
> > put_cpu();
> > 
> > This has a very short fast path and I need very good reasons to change/add 
> > anything here. OTOH the slow path with relay_switch_buffer() is less 
> > critical and still leaves a lot of flexibility.
> 
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.

Ok, let's change it a little so it's more familiar. :)

void *relay_reserve(chan, length, cpu)
{
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
return buffer->data + offset;
buffer = relay_switch_buffer(chan, buffer, offset);
if (!buffer)
return NULL;
}
}

All you have to do is to put between get_cpu()/put_cpu().
The same is also possible as macro, which allows you to directly jump out 
of it to the failure code and avoid one test.

> > Look closer, it's already interrupt safe, the synchronization for the 
> > buffer switch is left to relay_switch_buffer().
> 
> Sorry, I'm still missing something. What exactly does local_add_return()
> do? I assume this code has got to be interrupt safe? Something like:
> #define local_add_return(OFFSET, LEN) \
> do {\
> ...
>   local_irq_save(); \
>   OFFSET += LEN;
>   local_irq_restore(); \
> ...
> } while(0);
> 
> I'm assuming local_irq_XXX because we were told by quite a few people
> in the related thread to avoid atomic ops because they are more expensive
> on most CPUs than cli/sti.

That would be about the generic implementation, but it allows archs to 
provide more efficient implementations in , e.g. i386 can use 
xadd.

> Also how does relay_get_buffer() operate?

#define relay_get_buffer(chan, cpu) chan->buffer[cpu]

> What if I'm writing an event
> from within a system call and I'm about to switch buffers and get
> an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to
> return the same pointer as the one obtained for the syscall, and aren't
> both cases now going to effect relay_switch_buffer(), one of which will
> be superfluous?

The synchronization has to be done in relay_switch_buffer(), but catching 
it there is still cheaper as in the fast path.

> > This adds a conditional and is not really needed. Above shows how to make 
> > it interrupt safe and if the clients wants to reuse the same buffer, leave 
> > the locking to the client.
> 
> Fine, but how is the client going to be able to reuse the same buffer if
> relayfs always assumes per-CPU buffer as you said above? This would be
> solved if at its core relayfs' functions worked on single channels and
> additional code provided helpers for making the SMP case very simple.

What do you mean? Why not make SMP case simple (less to get wrong)? The 
client can still serialize everything with a simple spinlock.

> > That's quite a lot of code with at least 14 conditions (or 13 conditions 
> > too much) and this is just

Re: 2.6.11-rc1-mm1

2005-01-23 Thread Karim Yaghmour


Karim Yaghmour wrote:
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.
> Even Zach Brown recognized this in his tracepipe proposal, here's from
> his patch:
> + *   - let caller reserve space and get a pointer into buf

Also, if the reserve is exported, then a client that chooses so, can
do something like:

local_irq_save();
relay_reserve();
write(); write(); write(); ...
local_irq_restore();

And therefore enforce in-order events is he so chooses.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-22 Thread Karim Yaghmour


Karim Yaghmour wrote:
> This is not good for any client that doesn't know beforehand the exact
> size of their data units, as in the case of LTT. If LTT has to use this
> code that means we are going to loose performance because we will need to
> fill an intermediate data structure which will only be used for relay_write().
> Instead of zero-copy, we would have an extra unnecessary copy. There has
> got to be a way for clients to directly reserve and write as they wish.
> Even Zach Brown recognized this in his tracepipe proposal, here's from
> his patch:
> + *   - let caller reserve space and get a pointer into buf

Actually, come to think of it, this code is not good for any client that
needs to fill complex data structures, whether they be fixed-size or not,
because it requires having a prepackaged structure already available.
Any client that wants to have zero-copying will want to write data
directly into the buffer instead of filling an intermediate buffer first.
And this requires being able to atomically reserve.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-22 Thread Karim Yaghmour

Hello Roman,

Roman Zippel wrote:
> Well, let's concentrate for a moment on the last thing and check later 
> if and how they fit into relayfs. Since ltt will be first main user, let's 
> optimize it for this.
> Also since relayfs is intended for large, fast data transfers, per cpu 
> buffers are pretty much always required, so it would make sense to leave 
> this to relayfs (less to get wrong for the client).

But how does relayfs organize the namespace then? What if I have
multiple channels per CPU, each for a different type of data, will
all channels for the same CPU be under the same directory or will
each type of data have its own directory with one entry per CPU?
I don't have an answer to that, and I don't know that we should. Why
not just leave it to the client to organize his data as he wishes.
If we must assume that everyone will have at least one channel per
CPU, then why not provide helper functions built on top of very
basic functions instead of fixing the namespace in stone?

> I have to modify it a little (only the if (!buffer) part is new):
> 
>   cpu = get_cpu();
>   buffer = relay_get_buffer(chan, cpu);
>   while(1) {
>   offset = local_add_return(buffer->offset, length);
>   if (likely(offset + length <= buffer->size))
>   break;
>   buffer = relay_switch_buffer(chan, buffer, offset);
>   if (!buffer) {
>   put_cpu();
>   return;
>   }
>   }
>   memcpy(buffer->data + offset, data, length);
>   put_cpu();
> 
> This has a very short fast path and I need very good reasons to change/add 
> anything here. OTOH the slow path with relay_switch_buffer() is less 
> critical and still leaves a lot of flexibility.

This is not good for any client that doesn't know beforehand the exact
size of their data units, as in the case of LTT. If LTT has to use this
code that means we are going to loose performance because we will need to
fill an intermediate data structure which will only be used for relay_write().
Instead of zero-copy, we would have an extra unnecessary copy. There has
got to be a way for clients to directly reserve and write as they wish.
Even Zach Brown recognized this in his tracepipe proposal, here's from
his patch:
+ * - let caller reserve space and get a pointer into buf

>>1) get_cpu() and put_cpu() won't do. You need to outright disable
>>interrupts because you may be called from an interrupt handler.
> 
> 
> Look closer, it's already interrupt safe, the synchronization for the 
> buffer switch is left to relay_switch_buffer().

Sorry, I'm still missing something. What exactly does local_add_return()
do? I assume this code has got to be interrupt safe? Something like:
#define local_add_return(OFFSET, LEN) \
do {\
...
local_irq_save(); \
OFFSET += LEN;
local_irq_restore(); \
...
} while(0);

I'm assuming local_irq_XXX because we were told by quite a few people
in the related thread to avoid atomic ops because they are more expensive
on most CPUs than cli/sti.

Also how does relay_get_buffer() operate? What if I'm writing an event
from within a system call and I'm about to switch buffers and get
an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to
return the same pointer as the one obtained for the syscall, and aren't
both cases now going to effect relay_switch_buffer(), one of which will
be superfluous?

> This adds a conditional and is not really needed. Above shows how to make 
> it interrupt safe and if the clients wants to reuse the same buffer, leave 
> the locking to the client.

Fine, but how is the client going to be able to reuse the same buffer if
relayfs always assumes per-CPU buffer as you said above? This would be
solved if at its core relayfs' functions worked on single channels and
additional code provided helpers for making the SMP case very simple.

> That's quite a lot of code with at least 14 conditions (or 13 conditions 
> too much) and this is just relayfs.

I believe Tom has refactored the code with your comments in mind, and has
something ready for review. I just want to clear up the above before we
make this final. Among other things, he just dropped all modes, and there's
only a basic relay_write() that closely resembles what you have above.

> That's not always true, where perfomance matters we provide different 
> functions (e.g. spinlocks), so having an alternative version of 
> relay_write is a possibility (although I'd like to see the user first).

Sure, see above in the case of LTT.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http:

Re: 2.6.11-rc1-mm1

2005-01-21 Thread Roman Zippel

Hi,

On Fri, 21 Jan 2005, Karim Yaghmour wrote:

> I should have avoided earlier confusing the use of a certain type of
> relayfs channel for a given purpose (i.e. LTT should not necessarily
> depend on the managed mode.) I believe that there is a need for
> more than one mode in relayfs independently of LTT. There are users
> who want to be able to manage the data in a buffer (by manage I mean:
> receive notification of important buffer events, be able to insert
> important data at boundaries, etc.), and there are users who just
> want to dump as much information as possible in as fast a way as
> possible without having to deal with non-essential codepaths.

Well, let's concentrate for a moment on the last thing and check later 
if and how they fit into relayfs. Since ltt will be first main user, let's 
optimize it for this.
Also since relayfs is intended for large, fast data transfers, per cpu 
buffers are pretty much always required, so it would make sense to leave 
this to relayfs (less to get wrong for the client).

> looking at this code:

I have to modify it a little (only the if (!buffer) part is new):

cpu = get_cpu();
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
break;
buffer = relay_switch_buffer(chan, buffer, offset);
if (!buffer) {
put_cpu();
return;
}
}
memcpy(buffer->data + offset, data, length);
put_cpu();

This has a very short fast path and I need very good reasons to change/add 
anything here. OTOH the slow path with relay_switch_buffer() is less 
critical and still leaves a lot of flexibility.

> 1) get_cpu() and put_cpu() won't do. You need to outright disable
> interrupts because you may be called from an interrupt handler.

Look closer, it's already interrupt safe, the synchronization for the 
buffer switch is left to relay_switch_buffer().

> 3) I'm unclear about the need for local_add_return(), why not
> just:
>   if (likely(buffer->offset + length <= buffer->size)
> In any case, here's what we do in relay_write():
>   write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);

Ok, let's take a closer look at the fast path of relay_write (via 
relay_managed.c):

>   rchan_get(rchan);

This is not needed, it's the responsibility of the client to keep a 
reference to the channel. A synchronize_kernel() is enough to get rid of 
current users of the channel on other cpus.

>   relay_lock_channel(rchan, flags);

what becomes:

>   FLAGS = 0;
>   if (RCHAN->flags & RELAY_USAGE_SMP) local_irq_save(FLAGS);
>   else spin_lock_irqsave(&(RCHAN)->mode.managed.lock, FLAGS);

This adds a conditional and is not really needed. Above shows how to make 
it interrupt safe and if the clients wants to reuse the same buffer, leave 
the locking to the client.

>   write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);

what becomes:

>   if (rchan == NULL) ...

Is this really needed?

>   if (slot_len >= rchan->buf_size) ...

You can leave it to caller to check for this, a BUG_ON should be enough 
here.

>   if (rchan->initialized == 0) ...

Does this really have to be in the fast path?

>   if (in_progress_event_size(rchan)) ...

What's the point of this? You already disable interrupts, so how can 
anything else be in progress?

>   if (cur_write_pos(rchan) + slot_len > write_limit(rchan)) ...

Ok. This leads to the slow path and not interesting right now.

>   if (likely(write_pos != NULL)) {

After 7 conditions we finally have a valid write position (and that's 
without ltt).

>   relay_write_direct(write_pos, data_ptr, count);

If write_pos is just a normal memory pointer, why not also just use 
memcpy?

>   relay_commit(rchan, write_pos, count, reserve_code, interrupting);

what becomes:

>   if (rchan == NULL)
>   return;

Hopefully no comment needed.

>   if (interrupting) ...

Same comment as above for in_progress_event_size().

>   if (deliver) ...
>   ...
>   if (deliver &&  waitqueue_active(&rchan->mmap_read_wait))

Why is that hook needed here? Why can't this be done by the client?
A buffer switch notification can be done somewhere else.

>   relay_unlock_channel(rchan, flags);
>   rchan_put(rchan);

Same comment as above.

That's quite a lot of code with at least 14 conditions (or 13 conditions 
too much) and this is just relayfs.

> The difference between these modes is akin the
> difference between GFP_KERNEL, GFP_ATOMIC, GFP_USER, etc.: same API,
> different underlying functionality.

That's not always true, where perfomance matters we provide different 
functions (e.g. spinlocks), so having an alternative version of 
relay_write is a possibility (although I'd like

Re: 2.6.11-rc1-mm1

2005-01-20 Thread Karim Yaghmour

OK, I finally come around to answering this ...

Roman Zippel wrote:
> Sorry, you missunderstood me. At the moment I'm only secondarily 
> interested in the API details, primarily I want to work out the details of 
> what exactly relayfs/ltt are supposed to do. One main question here I 
> can't answer yet, why you insist on multiple relayfs modes.

I should have avoided earlier confusing the use of a certain type of
relayfs channel for a given purpose (i.e. LTT should not necessarily
depend on the managed mode.) I believe that there is a need for
more than one mode in relayfs independently of LTT. There are users
who want to be able to manage the data in a buffer (by manage I mean:
receive notification of important buffer events, be able to insert
important data at boundaries, etc.), and there are users who just
want to dump as much information as possible in as fast a way as
possible without having to deal with non-essential codepaths.

> This is what I basically have in mind for the relay_write function:
> 
>   cpu = get_cpu();
>   buffer = relay_get_buffer(chan, cpu);
>   while(1) {
>   offset = local_add_return(buffer->offset, length);
>   if (likely(offset + length <= buffer->size))
>   break;
>   buffer = relay_switch_buffer(chan, buffer, offset);
>   }
>   memcpy(buffer->data + offset, data, length);
>   put_cpu();

looking at this code:

1) get_cpu() and put_cpu() won't do. You need to outright disable
interrupts because you may be called from an interrupt handler.

2) You assume that relayfs creates one buffer per cpu for each
channel. We think this is wrong. Relayfs should not need to care
about the number of CPUs, it's the clients' responsibility to
create as many channels as they see fit, whether it be one channel
per CPU or 10 channels per CPU or 1 channel per interrupt, etc.

3) I'm unclear about the need for local_add_return(), why not
just:
if (likely(buffer->offset + length <= buffer->size)
In any case, here's what we do in relay_write():
write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting);
If there's any buffer switching required, that will be done in
relay_reserve. This has the added advantage that clients that
want to write directly to the buffer without using relay_write()
can do so by calling relay_reserve() and not care about required
buffer switching.

4) After securing the area, you simply go ahead and do a memcpy()
and leave. We think that this is insufficient. Here's what we
do:
if (likely(write_pos != NULL)) {
relay_write_direct(write_pos, data_ptr, count);
relay_commit(rchan, write_pos, count, reserve_code, 
interrupting);
*wrote_pos = write_pos;
the relay_write_direct() is basically an memcpy(). We also do
a relay_commit(). This actually effects the delivery of the
event. If, for example, there had been a buffer switch at the
previous relay_reserve(), then this call to relay_commit() will
generate a call to the client's deliver() callback function.
In the case of LTT, for example, this is how it knows that it's
got to notify the user-space daemon that there are buffers to
consume (i.e. write to disk.)

> ltt_log_event should only be a few lines more (for writing header and 
> event data).

Actually no, you don't want ltt_log_event using relay_write(),
for one thing because is can generate variable size events.
Instead, ltt_log_event does (basically):
data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size);

relay_lock_channel();
relay_reserve();

relay_write_direct(&event_id, sizeof(event_id));
relay_write_direct(&time_delta, sizeof(event_id));
if (var_data) {
relay_write_direct(var_data, var_data_len);
data_size += var_data_len;
}
relay_write_direct(&data_size, sizeof(data_size));

relay_commit();
relay_unlock_channel();

> What I'd like to know now are the reasons why you need more than this.

I hope the above explanation clarifies things.

> It's not the amount of data and any timing requirements have to be done by 
> the caller. During processing you either take the events in the order they 
> were recorded (often that's good enough) or you sort them which is not 
> that difficult.

Ordering is a non-issue to be honest. Unless you've got some hardware
scope in there, it's almost impossible to pinpoint exactly when an
event occurred. There is no single line of code where an event occurs,
so it's all an educated guess anyway. You want things to resemble what
really happened in as much as possible though.

> I know you don't want to touch the topic of kernel debugging, but its 
> requirements greatly overlap with what you want to do with ltt, e.g. one 
> needs very often information about scheduling events as many kernel 
> processes rely more and more on kernel threads. The only real

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-20 Thread Karim Yaghmour

Werner Almesberger wrote:
>  - if the probe target is an instruction long enough, replace it with
>a jump or call (that's what I think the kprobes folks are working
>on. I remember for sure that they were thinking about it.)

I heard about this years ago, but I don't know that anything came of
it. I suspect that this is not as simple as it looks and that the
only reliable way to do it is with a trap.

> Probably because everybody saw that it was good :-)

Great, thanks. That's what we'll aim for then. We've already got
the "disable" and "static" implemented, so now we need to figure
out how do we best implement this tagging. IBM's kernel hooks
allowed the NOP solution, so I'm guessing it shouldn't be that
much of a stretch to extend it for marking up the code for kprobes
and friends. I don't know whether this code is still maintained or
not, but I'd like to hear input as to whether this is a good basis,
or whether you're thinking of something like your uml-sim hooks?

> So you need seeking, even in the presence of fine-grained control
> over what gets traced in the first place ? (As opposed to extracting
> the interesting data from the full trace, given that the latter
> shouldn't contain too much noise.)

The problem is that you don't necessarily know beforehand what's
the problem. So here's an actual example:

I had a client who had this box on which a task was always getting
picked up by the OOM killer. Try as they might, the development
team couldn't figure out which part of the code was causing this.
So we put LTT in there and in less than 5 minutes we found the
problem. It turned out that a user-space access to a memory-mapped
FPGA caused an unexpected FP interrupt to occur, and the application
found itself in a recursive signal handler. In this case there was
an application symptom, but it was a hardware problem.

This is just a simple example, but there are plenty of other
examples where a sysadmin will be experiencing some weird
hard to reproduce bugs on some of his systems and he'll spend
a considerable amount of time trying to guess what's happening.
This is especially complicated when there's no indication as to
what's the root of the problem. So at that point being able to
log everything and being able to rapidely browse through it is
critical.

Once you've done such a first trace you _may_ _possibly_ be
able to refine your search requirements and relog with that in
mind, but that's after the fact.

> Or that they have been consumed. My question is just whether this
> kind of aggregation is something you need.

Absolutely. If you're thinking about short 100kb or MBs traces,
then a simpler scheme would be possible. But when we're talking
about GB and 100GBs spaning days, there's got to be a managed
way of doing it.

>>I have nothing against kprobes. People keep refering to it as if
>>it magically made all the related problems go away, and it doesn't.
> 
> 
> Yes, I know just too well :-) In umlsim, I have pretty much the
> same problems, and the solutions aren't always nice. So far, I've
> been lucky enough that I could almost always find a suitable
> function entry to abuse.

Glad you acknowledge as much.

> However, since a kprobes-based mechanism is - in the worst case,
> i.e. when needing markup - as good as direct calls to LTT, and gives
> you a lot more flexibility if things aren't quite as hostile, I
> think it makes sense to focus on such a solution.

You certainly have a lot more experience than I do with that, so
I'd like to solicit your help. As above: what's the best way to
provide this in addition to the static and disable points?

> Yup, but you could move even more intelligence outside the kernel.
> All you really need in the kernel is a place to put the probe,
> plus some debugging information to tell you where you find the
> data (the latter possibly combined with gently coercing the
> compiler to put it at some accessible place).

Right, but then you end up with a mechanism with generalized hooks.
Actually there was a time when LTT was a driver and you could
either build it as a module or keep it built-in. However, when
we published patches to get LTT accepted in 2.5 we were told on
LKML to move LTT into kernel/ and avoid all this driver stuff.
Having it, or parts of it, in the kernel makes it much simpler
and much more likely that the existing ad-hoc tracing code
spreading accross the sources be removed in exchange for a
single agreed upon way of doing things.

It must be said that like I had done with relayfs, the LTT patch
will go through a major redux and I will post the patches for
review like before on LKML.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.htm

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-20 Thread Werner Almesberger

[ 3rd try. Apologies to Karim, Thomas, and Roman, who apparently also
  received my previous attempts. For some reason, one of my upstream
  DNS servers decided to send me highly bogus MX records. ]

Karim Yaghmour wrote:
> Might I add that this is part of the problem ... No personal
> offence intended, but there's been _A LOT_ of things said about
> LTT that were based on third-hand account and no direct contact
> with the toolset/code.

Sigh, yes, guilty as charged ...

At least today, I have a good excuse: my cable modem died, and I
couldn't possibly have download things to look at :)

> > As far as kprobes go, then you still need to have some form or another
> > of marking the code for key events, unless you keep maintaining a set
> > of kprobes-able points separately, which really makes it unusable for
> > the rest of us, as the users of LTT have discovered over time (having
> > to create a new patch for every new kernel that comes out.)

Yes, I think you will need some set of "pads" in the code, where you
can attach probes. I'm not sure how many, though. An alternative, at
least in some cases, would be to move such things into separate
functions, so that you could put the probe just at function entry.
Then add a comment that this function isn't supposed to be torn
apart without dire need.

> > Generating new interrupts is simply unacceptable for LTT's functionality.

Absolutely. If I remember correctly, this is in the process of being
addressed in kprobes. You basically have the following choices:

 - if the probe target is an instruction long enough, replace it with
   a jump or call (that's what I think the kprobes folks are working
   on. I remember for sure that they were thinking about it.)
 - if the probe target is in a basic block with enough room after the
   target, see above (needs feedback from compiler or assembler)
 - if all else fails, add some NOPs (i.e. the marker approach)

> I have received very little feedback on this suggestion,

Probably because everybody saw that it was good :-)

> As for the location of ltt trace points, then they are very rarely
> at function boundaries. Here's a classic:
>   prepare_arch_switch(rq, next);
>   ltt_ev_schedchange(prev, next);
>   prev = context_switch(rq, prev, next);

Yes, in some cases, you don't have a choice but to add some marker.

> > Removing this data would require more data for each event to
> > be logged, and require parsing through the trace before reading it in
> > order to obtain markers allowing random access.

So you need seeking, even in the presence of fine-grained control
over what gets traced in the first place ? (As opposed to extracting
the interesting data from the full trace, given that the latter
shouldn't contain too much noise.)

> If I understand you correctly, you are talking about the fact that
> the transport layer's management of the buffers is syncrhonized
> with some user-space entity that consumes the buffers produced
> and talks back to relayfs (albeit indirectly) to let it know that
> said buffers are now available?

Or that they have been consumed. My question is just whether this
kind of aggregation is something you need.

> I have nothing against kprobes. People keep refering to it as if
> it magically made all the related problems go away, and it doesn't.

Yes, I know just too well :-) In umlsim, I have pretty much the
same problems, and the solutions aren't always nice. So far, I've
been lucky enough that I could almost always find a suitable
function entry to abuse.

However, since a kprobes-based mechanism is - in the worst case,
i.e. when needing markup - as good as direct calls to LTT, and gives
you a lot more flexibility if things aren't quite as hostile, I
think it makes sense to focus on such a solution.

> Nothing precludes us to move in this direction once something is
> in the kernel, it's all currently hidden away in a .h, and it would
> be the same with this.

Yup, but you could move even more intelligence outside the kernel.
All you really need in the kernel is a place to put the probe,
plus some debugging information to tell you where you find the
data (the latter possibly combined with gently coercing the
compiler to put it at some accessible place).

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-19 Thread Barry K. Nathan

On Wed, Jan 19, 2005 at 11:06:10PM +, Marcos D. Marado Torres wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On Fri, 14 Jan 2005, Barry K. Nathan wrote:
> 
> >This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
> >users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
> >the computer off, on some computers. Here's the RH Bugzilla report where
> >most of the discussion took place:
> >
> >https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761
> 
> This is the same bug I've talked here:
> http://lkml.org/lkml/2005/1/11/88

FWIW the RH Bugzilla bug is (unfortunately) discussing several different
similar but not identical bugs, as far as I can tell.

> This only happens with -mm and not with vanilla sources.
> 
> I'm reporting about this issue in an ASUS M3N laptop with Debian.
> 
> Best regards,
> Mind Booster Noori

FWIW my report against -mm (where I narrowed it down to one of the kexec
patches in particular) is here:
http://bugme.osdl.org/show_bug.cgi?id=4041

-Barry K. Nathan <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-19 Thread Marcos D. Marado Torres

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Fri, 14 Jan 2005, Barry K. Nathan wrote:
This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora
users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning
the computer off, on some computers. Here's the RH Bugzilla report where
most of the discussion took place:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761
This is the same bug I've talked here:
http://lkml.org/lkml/2005/1/11/88
This only happens with -mm and not with vanilla sources.
I'm reporting about this issue in an ASUS M3N laptop with Debian.
Best regards,
Mind Booster Noori
In the Fedora kernels it turned out to be due to kexec. I'll see if I
can narrow it down further.
-Barry K. Nathan <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
- -- 
/* *** */
   Marcos Daniel Marado Torres	 AKA	Mind Booster Noori
   http://student.dei.uc.pt/~marado   -	  [EMAIL PROTECTED]
   () Join the ASCII ribbon campaign against html email, Microsoft
   /\ attachments and Software patents.   They endanger the World.
   Sign a petition against patents:  http://petition.eurolinux.org
/* *** */
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Made with pgp4pine 1.76

iD8DBQFB7ufzmNlq8m+oD34RAmsIAKDM55tzy957YqEXtNkz9l2O3O7V1ACeKXQB
v2LuSPMWch9A7NQApq6Bm8c=
=F7on
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance

2005-01-19 Thread Fabio Coatti

Alle 13:42, mercoledì 19 gennaio 2005, bert hubert ha scritto:
> On Tue, Jan 18, 2005 at 10:39:35PM +0100, Fabio Coatti wrote:
> > vmstat under load is the following, and config.gz attached. Of course I
> > can provide any other needed detail; many thanks for any hint.
>
> Looks mightily like DMA is not on, even though you compiled the PIIX driver
> in, which lists
>
> > :00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE
> > Controller
>
> Can you show the output of hdparm /dev/hda ? Can you show dmesg?

Sure, here is it:

/dev/hda:
 multcount= 16 (on)
 IO_support   =  0 (default 16-bit)
 unmaskirq=  0 (off)
 using_dma=  1 (on)
 keepsettings =  0 (off)
 readonly =  0 (off)
 readahead= 256 (on)
 geometry = 65535/16/63, sectors = 60040544256, start = 0

I've cut down the ide relevant part of dmesg, please let me know if more 
details are needed
an 19 21:43:53 kefk Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
Jan 19 21:43:53 kefk ide: Assuming 33MHz system bus speed for PIO modes; 
override with idebus=xx
Jan 19 21:43:53 kefk ICH5: IDE controller at PCI slot :00:1f.1
Jan 19 21:43:53 kefk ACPI: PCI interrupt :00:1f.1[A] -> GSI 18 (level, 
low) -> IRQ 169
Jan 19 21:43:53 kefk ICH5: chipset revision 2
Jan 19 21:43:53 kefk ICH5: not 100% native mode: will probe irqs later
Jan 19 21:43:53 kefk ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, 
hdb:pio
Jan 19 21:43:53 kefk ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, 
hdd:pio
Jan 19 21:43:53 kefk Probing IDE interface ide0...
Jan 19 21:43:53 kefk hda: MAXTOR 6L060J3, ATA DISK drive
Jan 19 21:43:53 kefk ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Jan 19 21:43:53 kefk Probing IDE interface ide1...
Jan 19 21:43:53 kefk hdc: TEAC DV-W58G, ATAPI CD/DVD-ROM drive
Jan 19 21:43:53 kefk ide1 at 0x170-0x177,0x376 on irq 15
Jan 19 21:43:53 kefk Probing IDE interface ide2...
Jan 19 21:43:53 kefk ide2: Wait for ready failed before probe !
Jan 19 21:43:53 kefk Probing IDE interface ide3...
Jan 19 21:43:53 kefk ide3: Wait for ready failed before probe !
Jan 19 21:43:53 kefk Probing IDE interface ide4...
Jan 19 21:43:53 kefk ide4: Wait for ready failed before probe !
Jan 19 21:43:53 kefk Probing IDE interface ide5...
Jan 19 21:43:53 kefk ide5: Wait for ready failed before probe !
Jan 19 21:43:53 kefk hda: max request size: 128KiB
Jan 19 21:43:53 kefk hda: 117266688 sectors (60040 MB) w/1819KiB Cache, 
CHS=65535/16/63, UDMA(100)
Jan 19 21:43:53 kefk hda: cache flushes supported
Jan 19 21:43:53 kefk hda: hda1 hda2 < hda5 hda6 hda7 > hda3 hda4
Jan 19 21:43:53 kefk PCI: :03:06.0 has unsupported PM cap regs version (1)
Jan 19 21:43:53 kefk ACPI: PCI interrupt :03:06.0[A] -> GSI 22 (level, 
low) -> IRQ 177
Jan 19 21:43:53 kefk PCI: :03:06.0 has unsupported PM cap regs version (1)
Jan 19 21:43:53 kefk ahc_pci:3:6:0: Host Adapter Bios disabled.  Using default 
SCSI device parameters
Jan 19 21:43:53 kefk scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 
6.2.36
Jan 19 21:43:53 kefk 
Jan 19 21:43:53 kefk aic7850: Single Channel A, SCSI Id=7, 3/253 SCBs
Jan 19 21:43:53 kefk
Jan 19 21:43:53 kefk Vendor: Nikon Model: COOLSCANIII   Rev: 1.31
Jan 19 21:43:53 kefk Type:   ScannerANSI SCSI 
revision: 02
Jan 19 21:43:53 kefk (scsi0:A:3): 10.000MB/s transfers (10.000MHz, offset 15)
Jan 19 21:43:53 kefk Vendor: PLEXTOR   Model: CD-ROM PX-40TSRev: 1.01
Jan 19 21:43:53 kefk Type:   CD-ROM ANSI SCSI 
revision: 02
Jan 19 21:43:53 kefk (scsi0:A:5): 10.000MB/s transfers (10.000MHz, offset 15)
Jan 19 21:43:53 kefk Vendor: YAMAHAModel: CRW6416S  Rev: 1.0c
Jan 19 21:43:53 kefk Type:   CD-ROM ANSI SCSI 
revision: 02
Jan 19 21:43:53 kefk libata version 1.10 loaded.
Jan 19 21:43:53 kefk ata_piix version 1.03
Jan 19 21:43:53 kefk ACPI: PCI interrupt :00:1f.2[A] -> GSI 18 (level, 
low) -> IRQ 169
Jan 19 21:43:53 kefk PCI: Setting latency timer of device :00:1f.2 to 64
Jan 19 21:43:53 kefk ata1: SATA max UDMA/133 cmd 0xC000 ctl 0xC402 bmdma 
0xD000 irq 169
Jan 19 21:43:53 kefk ata2: SATA max UDMA/133 cmd 0xC800 ctl 0xCC02 bmdma 
0xD008 irq 169
Jan 19 21:43:53 kefk ata1: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4003 85:7c69 
86:3e01 87:4003 88:207f
Jan 19 21:43:53 kefk ata1: dev 0 ATA, max UDMA/133, 320173056 sectors: lba48
Jan 19 21:43:53 kefk ata1: dev 0 configured for UDMA/133
Jan 19 21:43:53 kefk scsi1 : ata_piix
Jan 19 21:43:53 kefk ata2: SATA port has no device.
Jan 19 21:43:53 kefk scsi2 : ata_piix
Jan 19 21:43:53 kefk Vendor: ATA   Model: Maxtor 6Y160M0Rev: YAR5
Jan 19 21:43:53 kefk Type:   Direct-Access  ANSI SCSI 
revision: 05
Jan 19 21:43:53 kefk SCSI device sda: 320173056 512-byte hdwr sectors (163929 
MB)
Jan 19 21:43:53 kefk SCSI device sda: drive cache: write back
Jan 19 21:43:53 kefk SCSI device sda: 320173056 512-byte hdwr sectors (163929 
MB)
Jan 19 21:43:53 kefk SCSI device

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-19 Thread Karim Yaghmour


Werner Almesberger wrote:
>>From all I've heard and seen of LTT (and I have to admit that most
> of it comes from reading this thread, not from reading the code),

Might I add that this is part of the problem ... No personal
offence intended, but there's been _A LOT_ of things said about
LTT that were based on third-hand account and no direct contact
with the toolset/code. And part of the problem is that _many_
people on this list, and elsewhere, have done some form of
tracing or another as part of their development, so they all
have their idea of how this is best done. Yet, while such
experience can help provide additional ideas to LTT's development,
it also often requires re-explaining to every new suggestor why we
added features he couldn't imagine would be useful to any of
his/her own tracing needs ... Sometimes I wish my interests lied
in some arcane feature that few had ever played with ;)

IOW, while I don't discount anybody else's experience with tracing,
please give us at least the benefit of the doubt by actually:
a) Looking at the code
b) Looking at the mailing list archives
c) Asking us questions directly related to the code

> I have the impression that it may try to be a bit too specialized,
> and thus might miss opportunities for synergy. 

Bare with me on this one ...

> You must be getting tired of people trying to redesign things from
> scratch, but maybe you'll humor me anyway ;-)

Hey, from you Werner I'll take anything. It's always a pleasure
talking with you :)

> Karim Yaghmour wrote:
> 
>>If you really want to define layers, then there are actually four
>>layers:
>>1- hooking mechanism
>>2- event definition / registration
>>3- event management infrastructure
>>4- transport mechanism
> 
> 
> For 1, kprobes would seem largely sufficient. In cases where you
> don't have a usable attachment point (e.g. in the middle of a
> function and you need access to variables with unknown location),
> you can add lightweight instrumentation that arranges the code
> flow suitably. [1, 2]

Let me say outright, as I said to Andi early on in the sister thread,
that I have no problems with having the trace points being fed by
kprobes. In fact, in 2000, way back before kprobes even existed, LTT
was already interfacing with DProbes for dynamic insertion of trace
points.

... There I said it ... now watch me have to repeat this yet again
later on ... :/

However, kprobes is not magic:
a) Like I said to Andi:
> As far as kprobes go, then you still need to have some form or another
> of marking the code for key events, unless you keep maintaining a set
> of kprobes-able points separately, which really makes it unusable for
> the rest of us, as the users of LTT have discovered over time (having
> to create a new patch for every new kernel that comes out.)

b) Like I said to Andrew back in July:
> I've double-checked what I already knew about kprobes and have looked again
> at the site and the patch, and unless there's some feature of kprobes I don't
> know about that allows using something else than the debug interrupt to add
> hooks,
...
> Generating new interrupts is simply unacceptable for LTT's functionality.
> Not to mention that it breaks LTT because tracing something will generate
> events of its own, which will generating tracing events of their own ...
> recursion.

Ok, you can argue about the recursion thing with an "if()", but you'll
have to admit that like in the case I described to Roman:
> ... Say you're getting
> 2MB/s of data (which is not unrealistic on a loaded system.) That means
> that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
IOW, something like 200,000events/s (average of 10bytes/event). Do I
really need to explain that 200,000 traps/interrupts per second is
not something you want ... ?

But don't despair, like I said to Andi:
> So lately I've been thinking that there may be a middle-ground here
> where everyone could be happy. Define three states for the hooks:
> disabled, static, marker. The third one just adds some info into
> System.map for allowing the automation of the insertion of kprobes
> hooks (though you would still need the debugging info to find the
> values of the variables that you want to log.) Hence, you get to
> choose which type of poison you prefer. For my part, I think the
> noop/early-check should be sufficient to get better performance from
> the existing hook-set.
I have received very little feedback on this suggestion, though I
really think it's worth entertaining, especially with your mention
of uml-sim markers further below.

As for the location of ltt trace points, then they are very rarely
at function boundaries. Here's a classic:
prepare_arch_switch(rq, next);
ltt_ev_schedchange(prev, next);
prev = context_switch(rq, prev, next);

> 2 and 3 should be the main domain of LTT, with 2 sitting on top
> of kprobes. kprobes currently doesn't have a nice way for
> describing handlers, but that c

Re: 2.6.11-rc1-mm1

2005-01-19 Thread Tom Zanussi

Christoph Hellwig wrote:
On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote:
One of the things that uses these functions to read from a channel
from within the kernel is the relayfs code that implements read(2), so
taking them away means you wouldn't be able to use read() on a relayfs
file.

Removing them from the public API is different from disallowing the
read operation.
Right, but we were planning on removing all that code in the interest of 
   stripping relayfs down to its bare minimum as a high-speed data 
transfer mechanism.


That wouldn't matter for ltt since it mmaps the file, but there
are existing users of relayfs that do use relayfs this way.  In fact,
most of the bug reports I've gotten are from people using it in this
mode.  That doesn't mean though that it's necessarily the right thing
for relayfs or these users to be doing if they have suitable
alternatives for passing lower-volume messages in this way.  As others
have mentioned, that seems to be the major question - should relayfs
concentrate on being solely a high-speed data relay mechanism or
should it try to be more, as it currently is implemented?

I'd say let it do one thing well, that is high-volume data transfer.
Yes, I think that's the one thing everyone's agreed on.

If the
former, then I wonder if you need a filesystem at all - all you have
is a collection of mmappable buffers and the only thing the filesystem
provides is the namespace.  Removing read()/write() and filesystem
support would of course greatly simplify the code; I'd like to hear
from any existing users though and see what they'd be missing.

What else would manage the namespace?
I have to confess I haven't had the time to look at it in detail, but I 
previously suggested that we might be able to recover the read() 
operations by providing them in userspace on top of the mmapped relayfs 
buffer, using FUSE.  If we did that, our FUSE filesystem could also 
provide the namespace, I assume.

Anyway, I don't think I've seen any objections in principal to the 
filesystem part of relayfs, so maybe it's not an issue - any other 
suggestions would be welcome, of course...

Tom
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance

2005-01-19 Thread bert hubert

On Tue, Jan 18, 2005 at 10:39:35PM +0100, Fabio Coatti wrote:
> vmstat under load is the following, and config.gz attached. Of course I can 
> provide any other needed detail; many thanks for any hint.

Looks mightily like DMA is not on, even though you compiled the PIIX driver
in, which lists 
> :00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE 
> Controller 

Can you show the output of hdparm /dev/hda ? Can you show dmesg?


-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://lartc.org   Linux Advanced Routing & Traffic Control HOWTO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-19 Thread Christoph Hellwig

On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote:
> One of the things that uses these functions to read from a channel
> from within the kernel is the relayfs code that implements read(2), so
> taking them away means you wouldn't be able to use read() on a relayfs
> file.

Removing them from the public API is different from disallowing the
read operation.

> That wouldn't matter for ltt since it mmaps the file, but there
> are existing users of relayfs that do use relayfs this way.  In fact,
> most of the bug reports I've gotten are from people using it in this
> mode.  That doesn't mean though that it's necessarily the right thing
> for relayfs or these users to be doing if they have suitable
> alternatives for passing lower-volume messages in this way.  As others
> have mentioned, that seems to be the major question - should relayfs
> concentrate on being solely a high-speed data relay mechanism or
> should it try to be more, as it currently is implemented?

I'd say let it do one thing well, that is high-volume data transfer.

> If the
> former, then I wonder if you need a filesystem at all - all you have
> is a collection of mmappable buffers and the only thing the filesystem
> provides is the namespace.  Removing read()/write() and filesystem
> support would of course greatly simplify the code; I'd like to hear
> from any existing users though and see what they'd be missing.

What else would manage the namespace?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-19 Thread Christoph Hellwig

On Sun, Jan 16, 2005 at 02:30:33PM -0600, Tom Zanussi wrote:
> This would allow an application to write trace events of its own to a
> trace stream for instance.

I don't think this is a good idea.  Userspace could aswell easily write
its trace into shared memory segments.

> Also, I added a user-requested 'feature'
> whereby write()s on a relayfs channel would be sent to a callback that
> could be used to interpret 'out-of-band' commands sent from the
> userspace application.

Now write as a control channel makes lots of sense, but I'd encapsulate
that differently.  Basically a net ctl file for each stream (and get
rid of ioctl in favour of this one while we're at it)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-18 Thread Werner Almesberger

>From all I've heard and seen of LTT (and I have to admit that most
of it comes from reading this thread, not from reading the code),
I have the impression that it may try to be a bit too specialized,
and thus might miss opportunities for synergy. 

You must be getting tired of people trying to redesign things from
scratch, but maybe you'll humor me anyway ;-)

Karim Yaghmour wrote:
> If you really want to define layers, then there are actually four
> layers:
> 1- hooking mechanism
> 2- event definition / registration
> 3- event management infrastructure
> 4- transport mechanism

For 1, kprobes would seem largely sufficient. In cases where you
don't have a usable attachment point (e.g. in the middle of a
function and you need access to variables with unknown location),
you can add lightweight instrumentation that arranges the code
flow suitably. [1, 2]

2 and 3 should be the main domain of LTT, with 2 sitting on top
of kprobes. kprobes currently doesn't have a nice way for
describing handlers, but that can be fixed [3]. But you probably
don't need a "nice" interface right now, but might be satisfied
with one that works and is fast (?)

>From the discussion, it seems that the management is partially
done by relayfs. I find this a little strange. E.g. instead of
filtering events, you may just not generate them in the first
place, e.g. by not placing a probe, or by filtering in LTT,
before submitting the event.

Timestamps may be fine either way. Restoring sequence should be
a task user-space can handle: in the worst case, you'd have to
read and merge from #cpus streams. Seeking works in that context,
too.

Last but not least, 4 should be simple. Particularly since you're
worried about extreme speeds, there should be as little
processing as you can afford. If you need to seek efficiently
(do you, really ?), you may not even want message boundaries at
that level.

Something that isn't entirely clear to me is if you also need to
aggregate information in buffers. E.g. by updating a record until
is has been retrieved by user space, or by updating a record
when there is no space to create a new one. Such functionality
would add complexity and needs tight sychronization with the
transport.

[1] I've seen the argument that kprobes aren't portable. This
strikes me a highly questionable. Even if an architecture
doesn't have a trap instruction (or equivalent code sequence)
that is at least as short as the shortest instruction, you
can always fall back to adding instrumentation [2]. Also, if
you know where your basic blocks are, you may be able to
use traps that span multiple instructions. I recall that
things of this kind are already planned for kprobes.

[2] See the "reliable markers" of umlsim from umlsim.sf.net.
Implementation: cd umlsim/lib; make; tail -50 markers_kernel.h
Examples: cd umlsim/sim/tests; cat sbug.marker
They're basically extra-light markup in the source code.
Works on ia32, but I haven't found a way to get the assembler
to cooperate for amd64, yet.

[3] I've already solved this problem in umlsim: there, I have a
Perl/C-like scripting language that allows handlers to do
pretty much anything they want. Of course, kprobes would
want pre-compiled C code, not some scripts, but I think the
design could be developped in a direction that would allow
both. Will take a while, but since I'll eventually have to
rewrite the "microcode" anyway, ...

So my comments are basically as follows:

1) kprobes seems like a suitable and elegant mechanism for
   placing all the hooks LTT needs, so I think that it would
   be better to build on this basis, and extend it where
   necessary, than to build yet another specialized variant
   in parallel.
2) LTT should do what it is good at, and not have to worry
   about the rest (i.e. supporting infrastructure).
3) relayfs should be lean and fast, as you intend it to be, so
   that non-LTT tracing or fnord debugging fnord code may find
   it useful, too.

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance

2005-01-18 Thread Fabio Coatti

Under heavy disk I/O, the system becomes very unresponsive (i.e. even a drop 
down menu takes several seconds to open).
I've noticed this under 2.6.11-rc1-mm1 and 2.6.10-mm2, but I can try whatever 
version is suggested. The way to reproduce this is quite simple: I'm using 
gentoo, when emerge --sync rebuilds cache the systems slows like a crawl; the 
same behaviour can be seen during a updatedb operation. with top, bdflush is 
often stuck in "D" state, as well the I/O bound process (say, emerge or 
updatedb).
vmstat under load is the following, and config.gz attached. Of course I can 
provide any other needed detail; many thanks for any hint.


[EMAIL PROTECTED] ~ $ vmstat 1
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 1  0628   5252 499696 217712001914   8060  3  1 95  1
 0  1628  25764 498764 20538400   444  1252 2121   943  7  6 48 39
 0  1628  24412 498812 20662800   596   948 2032  1634 11  5 58 27
 0  1628  23584 498816 20737200   380  2604 2045  1408  6  5 70 18
 0  1628  23360 498816 2075760056  1528 1982   559  3  2 50 45
 0  1628  22292 498820 20859200   496   980 2092  1120 11  5 51 33
 0  1628  20372 498856 21012000   772  1504 2293  1621 21  9 49 21
 0  1628  18964 498912 21135600   620  1432 2170  1615 13  7 53 28
 0  1628  18340 498920 21189200   292  2924 2137   883  5  4 57 34
 0  0628  17636 498956 21253600   264   712 2018   954  5  3 65 28
 0  1628  17316 498968 21279600   148  1096 1983   607  2  3 51 44
 0  1628  16356 499032 21354800   416   952 2061  1417  7  3 58 32
 0  0628  15708 499060 21413200   256  1912 1993  1409  4  4 53 38
 1  0628  14804 499068 21473600   352  2644 2136  1475  7  4 72 16
 0  1628  14548 499076 21513600   196  1676 2046   526  4  2 49 45
 0  1628  13972 499104 21585600   384   816 2062  1033  9  4 51 37
 0  1628  12916 499172 21680800   504  1056 2135  1311 14  5 51 30
 0  1628  12020 499236 21756000   448  1044 2111  1280 17  5 51 27
 0  0628  11380 499268 21807200   256  2048 2039   838 10  4 62 24
 1  0628  11060 499288 21839200   156  2436 2043   832  7  4 83  5
 0  1628  10612 499328 21869200   124  2180 1899   442  5  2 50 44
 1  0628  10292 499336 2100   104   368 1883   599  2  2 50 47
 0  1628   8292 499384 22054000   788  1536 2283  1524 18  8 49 27
 0  0628   7652 499388 22108000   276  2044 2039   796  5  4 69 22
 0  1628   6948 499392 22168800   288  2352 2086   783  6  4 52 38
 1  0628   6308 499396 2800   256   356 2008   797  7  3 50 41
 1  0628   5024 499404 22310400   476  1012 2092   983 13  5 49 32
 0  1628   9848 498300 22393600   420  1096 2075  1243  8  4 53 34
 0  1628   9344 498312 22440000   236  3744 2097  1181  5  4 73 19

To be honest I can't say when this started, I've installed gentoo and seen 
emerge --sync load only with 2.6.10-mm2

system: P4 IV 2.8/1Gb ram/i875p MB (abit IC7-g)
ide:  
hda: MAXTOR 6L060J3
hdc: TEAC DV-W58G 
scsi/Sata:
PLEXTOR CD-ROM PX-40TS  1.01
YAMAHA  CRW6416S1.0c
ATA Maxtor 6Y160M0  YAR5

lspci -v:
kefk ide # lspci -v
:00:00.0 Host bridge: Intel Corp. 82875P/E7210 Memory Controller Hub (rev 
02)
Subsystem: ABIT Computer Corp.: Unknown device 1014
Flags: bus master, fast devsel, latency 0
Memory at d000 (32-bit, prefetchable)
Capabilities: [e4] #09 [2106]
Capabilities: [a0] AGP version 3.0

:00:01.0 PCI bridge: Intel Corp. 82875P Processor to AGP Controller (rev 
02) (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, fast devsel, latency 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
Memory behind bridge: f000-f1ff
Prefetchable memory behind bridge: e800-efff

:00:03.0 PCI bridge: Intel Corp. 82875P/E7210 Processor to PCI to CSA 
Bridge (rev 02) (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, fast devsel, latency 32
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 9000-9fff
Memory behind bridge: f200-f20f
Expansion ROM at 9000 [disabled] [size=4K]

:00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #1 (rev 02) (prog-if 00 [UHCI])
Subsystem: ABIT Computer Corp.: Unknown device 1014
Flags: bus master, medium devsel, latency 0, IRQ 193
I/O ports at bc00 [size=32]

:00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller

Re: 2.6.11-rc1-mm1

2005-01-18 Thread Tom Zanussi

Karim Yaghmour writes:
 > 
 > Tom Zanussi wrote:
 > > I have to disagree.  Awhile back, if you remember, I posted a patch to
 > > the LTT daemon that would monitor the trace stream in real time, and
 > > process it using an embedded Perl interpreter, no less:
 > > 
 > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2
 > > 
 > > It didn't seem to have any problems keeping up with the trace stream
 > > even though it was monitoring all LTT event types (and a couple of
 > > others - custom events injected using kprobes) and not doing any
 > > filtering in the kernel, through kernel compiles, normal X traffic,
 > > etc.  I don't know what volume of event traffic would cause this model
 > > to break down, but I think it shows that at least some level of
 > > non-trivial live processing is possible...
 > 
 > Good Point.
 > 
 > My bad. Thanks for bringing this up. Obviously this didn't get as
 > much attention as it should've had the last time it was posted,
 > especially as it allows very easy scripting of filtering in userspace.
 > That email you refer to is pretty loaded and I'm sure those who
 > are interested will dig through it. But in the interest of helping
 > everyone get a rapid understanding of what it does and how it does it,
 > can you break it down in to a short description, possibly with a
 > diagram? I'm sure many will find this very interesting.

It's so simple it doesn't really deserve a diagram, which I'm pretty
bad at anyway...

Basically all it does is loop around the received buffer, reading each
event and sending it off to a handler.  In this case the handler
massages the data into a form that allows it to be passed to the Perl
interpreter as arguments to a Perl function that in turn acts as
callback handler in the Perl interpreter.

At that point, the Perl callback can do whatever it wants with the
data - save events matching a certain pid and discard everything else,
keep running counts or time totals e.g. total syscall counts for each
pid, function call tracing (if you dynamically instrumented function
call entry/exit with kprobes for example), etc, etc, etc.  Probably
even more useful is the ability to monitor the event stream looking
for sporadically occuring events, again under the control of the Perl
interpreter, so your criteria for deciding what an 'important event'
is can be arbitrarily complex and incorporate past history.  It also
means that you don't have to save anything at all to disk until you
detect your specified condition (which makes tracing for days or weeks
on end more practical), at which point you can dump out the currently
mapped buffer containing the last bufsize number of events most likely
to be of interest anyway.

Perl makes this kind of quick and dirty processing extremely easy and
it has a lot of powerful language features such as nested hashes built
in, which is why I chose it, but you could of course avoid the extra
layer and the interpreter and do your filtering in straight C, or
create a binding for any language you want.

IMHO being able to do most of the filtering in user space like this
opens up a lot of avenues for not only one-off problem determination
hacks, but a proliferation of more substantial tools, considering how
easy it is to put together applications using for instance the copious
number of Perl modules available.

Tom

 > 
 > Thanks,
 > 
 > Karim
 > -- 
 > Author, Speaker, Developer, Consultant
 > Pushing Embedded and Real-Time Linux Systems Beyond the Limits
 > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546

-- 
Regards,

Tom Zanussi <[EMAIL PROTECTED]>
IBM Linux Technology Center/RAS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-18 Thread Karim Yaghmour

Tom Zanussi wrote:
> I have to disagree.  Awhile back, if you remember, I posted a patch to
> the LTT daemon that would monitor the trace stream in real time, and
> process it using an embedded Perl interpreter, no less:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2
> 
> It didn't seem to have any problems keeping up with the trace stream
> even though it was monitoring all LTT event types (and a couple of
> others - custom events injected using kprobes) and not doing any
> filtering in the kernel, through kernel compiles, normal X traffic,
> etc.  I don't know what volume of event traffic would cause this model
> to break down, but I think it shows that at least some level of
> non-trivial live processing is possible...

Good Point.

My bad. Thanks for bringing this up. Obviously this didn't get as
much attention as it should've had the last time it was posted,
especially as it allows very easy scripting of filtering in userspace.
That email you refer to is pretty loaded and I'm sure those who
are interested will dig through it. But in the interest of helping
everyone get a rapid understanding of what it does and how it does it,
can you break it down in to a short description, possibly with a
diagram? I'm sure many will find this very interesting.

Thanks,

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-18 Thread Karim Yaghmour

Thomas,

Thomas Gleixner wrote:
> Yes, I did already start cleaning
> 
> cat ../broken-out/ltt* | patch -p1 -R

:D

If it gives you a warm and fuzzy feeling to have the last
cheap-shot, then I'm all for it, it is of no consequence anyway.
And _please_ don't forget to answer this very email with
something of the same substance.

For my part I consider that I've invested a substantial amount
of time in responding to both your conceptual and practical
feedback, as the archives clearly show.

That being said, I have to thank you for making sure that all
the obvious questions have been asked. I now have more than a
dozen archive links of my answers to those. I'll sure come in
handy when writing an FAQ.

Thanks again,

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-18 Thread Roman Zippel

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> With that said, I hope we've agreed that we'll have a callback for
> letting relayfs clients know that they need to write the begining of
> the buffer event. There won't be any associated reserve. Conversly,
> I hope it is not too much to ask to have an end-of-buffer callback.

There of course has to be some kind of end marker, but that's less 
critical as it's not the active buffer anymore.

> Roman, of all people I've been more than happy to change my stuff following
> your recommendations. Do I have to list how far down relayfs has been
> stripped down?

Sorry, you missunderstood me. At the moment I'm only secondarily 
interested in the API details, primarily I want to work out the details of 
what exactly relayfs/ltt are supposed to do. One main question here I 
can't answer yet, why you insist on multiple relayfs modes.
This is what I basically have in mind for the relay_write function:

cpu = get_cpu();
buffer = relay_get_buffer(chan, cpu);
while(1) {
offset = local_add_return(buffer->offset, length);
if (likely(offset + length <= buffer->size))
break;
buffer = relay_switch_buffer(chan, buffer, offset);
}
memcpy(buffer->data + offset, data, length);
put_cpu();

ltt_log_event should only be a few lines more (for writing header and 
event data).
What I'd like to know now are the reasons why you need more than this.
It's not the amount of data and any timing requirements have to be done by 
the caller. During processing you either take the events in the order they 
were recorded (often that's good enough) or you sort them which is not 
that difficult.

> You ask what compromises can be found from both sides to obtain a
> single implementation. I have looked at this, and given how
> stripped down it has become, anything less from relayfs will make
> it useless for LTT. IOW, I would have to reimplement a buffering
> scheme within LTT outside of relayfs.

I know you don't want to touch the topic of kernel debugging, but its 
requirements greatly overlap with what you want to do with ltt, e.g. one 
needs very often information about scheduling events as many kernel 
processes rely more and more on kernel threads. The only real requirement 
for kernel debugging is low runtime overhead, which you certainly like to 
have as well. So what exactly are these requirements and why can't there 
be no reasonable alternative?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lkst-develop] Re: 2.6.11-rc1-mm1

2005-01-18 Thread Masami Hiramatsu

Hi,
Andi Kleen wrote:
On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote:
Hello,
I?m a developer of yet another kernel tracer, LKST. I and co-developers 
are very glad to hear that LTT was merged into -mm tree and to talk 
about the kernel tracer on this ML. Because we think that the kernel 
event tracer is useful to debug Linux systems, and to improve the kernel 
reliability.

I haven't looked at your code, but I would suggest you also post
for review it so that it can be evaluated in the same way
as other more noisy proposals.
Perhaps Andrew can test both for some time in MM like he used
to do for the various schedulers.
Thanks to your advice.
The latest release package of LKST baesd on linux-2.6.9 can be 
downloaded from
http://sourceforge.net/projects/lkst/

I'll release the LKST based on the latest kernel as soon as possible.
Regards,
--
Masami HIRAMATSU
Hitachi, Ltd., Systems Development Laboratory
E-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-18 Thread Andi Kleen

On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> I?m a developer of yet another kernel tracer, LKST. I and co-developers 
> are very glad to hear that LTT was merged into -mm tree and to talk 
> about the kernel tracer on this ML. Because we think that the kernel 
> event tracer is useful to debug Linux systems, and to improve the kernel 
> reliability.

I haven't looked at your code, but I would suggest you also post
for review it so that it can be evaluated in the same way
as other more noisy proposals.

Perhaps Andrew can test both for some time in MM like he used
to do for the various schedulers.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-18 Thread Masami Hiramatsu

Hello,
I’m a developer of yet another kernel tracer, LKST. I and co-developers 
are very glad to hear that LTT was merged into -mm tree and to talk 
about the kernel tracer on this ML. Because we think that the kernel 
event tracer is useful to debug Linux systems, and to improve the kernel 
reliability.

Andi Kleen wrote:
Andrew Morton <[EMAIL PROTECTED]> writes:
- Added the Linux Trace Toolkit (and hence relayfs).  Mainly because I
 haven't yet taken as close a look at LTT as I should have.  Probably neither
 have you.

I think it would be better to have a standard set of kprobes instead
of all the ugly LTT hooks. kprobes could then log to relayfs or another
fast logging mechanism.
I agree.
I’m interested in kprobes. Currently, LKST can switch off and on each 
hook. But, even if a hook was disabled, there is a little overhead-time 
(one conditional-jump instruction should be executed). I think 
kprobes-based hooks can completely remove this overhead-time. Moreover, 
kprobes-based hooks can be inserted dynamically into the code-point 
specified by user. This feature is greatly useful for debugging. So, I 
have an idea to renew LKST to kprobes-based hooks.
Also, I’m developing a prototype implementation.


The problem relayfs has IMHO is that it is too complicated. It 
seems to either suffer from a overfull specification or second system
effect. There are lots of different options to do everything,
instead of a nice simple fast path that does one thing efficiently.
IMHO before merging it should go through a diet and only keep
the paths that are actually needed and dropping a lot of the current
baggage.

Preferably that would be only the fastest options (extremly simple
per CPU buffer with inlined fast path that drop data on buffer overflow), 
with leaving out anything more complicated. My ideal is something
like the old SGI ktrace which was an extremly simple mechanism
to do lockless per CPU logging of binary data efficiently and
reading that from a user daemon.
LKST’s logging buffer is (much) simpler than relayfs. It is just the 
linked-perCPU-buffer.

If you are interested in this, please try LKST.
--
Masami HIRAMATSU
Hitachi, Ltd., Systems Development Laboratory
E-mail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-18 Thread Thomas Gleixner

On Mon, 2005-01-17 at 18:57 -0500, Karim Yaghmour wrote: 
> Thomas Gleixner wrote:
> > If we add another hardwired implementation then we do not have said
> > benefits.
> 
> Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman,
> and others actually made specific requests for changes in the code.
> What makes you think you're so special that you think you are
> entitled to stay on the side and handwave about concepts.

So the points you added to your todo list which were brought up by me
are worthless ?

I'm not handwaving. I started this RFC to move the discussion into a
general discussion about instrumentation. A couple of people are
seriosly interested to do this. If you are not interested then ignore
the thread, but you're way not in a position to tell me to shut up.

You turned this thread into your LTT prayer wheel.

Roman pointed out your unwillingness to create a common framework
before. But I have to disagree with him in one point. It's not amazing,
it's annoying.

> If there is a limitation with the code, please present actual
> snippets that need to be changed and suggest alternatives. That's
> what everyone else does on this list.

I pointed you to actually broken code and you accused me of throwing
mud.

> Save the bandwidth 

Please remove me from cc, it's a good start to save bandwidth.

> and start cleaning.

Yes, I did already start cleaning

cat ../broken-out/ltt* | patch -p1 -R

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Tom Zanussi

Karim Yaghmour writes:
 > 
 > Aaron Cohen wrote:
 > >   I've got a quick question and I just want to be clear that it
 > > doesn't have a political agenda behind it.
 > 
 > :)
 > 
 > > Here goes, why can't LTT and/or relayfs, work similar to the way
 > > syslog does and just fill a buffer (aka ring-buffer or whatever is
 > > appropriate), while a userspace daemon of some kind periodically reads
 > > that buffer and massages it.  I'm probably being naive but if the
 > > difficulty is with huge several hundred-gig files, the daemon if it
 > > monitors the buffer often enough could stuff it into a database or
 > > whatever high-performance format you need.
 > 
 > Because of the bandwidth it is not possible to do any sort of live
 > processing of any kind. The only thing the daemon can possibly do
 > is write large blocks of tracing info to disk as rapidly as possible.

I have to disagree.  Awhile back, if you remember, I posted a patch to
the LTT daemon that would monitor the trace stream in real time, and
process it using an embedded Perl interpreter, no less:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2

It didn't seem to have any problems keeping up with the trace stream
even though it was monitoring all LTT event types (and a couple of
others - custom events injected using kprobes) and not doing any
filtering in the kernel, through kernel compiles, normal X traffic,
etc.  I don't know what volume of event traffic would cause this model
to break down, but I think it shows that at least some level of
non-trivial live processing is possible...

Tom

 > 
 > >  It also seems to me that Linus' nascent "splice and tee" work would
 > > be really useful for something like this to avoid a lot of unnecessary
 > > copying by the userspace daemon.
 > 
 > There is no copying by the userspace daemon. All it does is open(),
 > then mmap(), and then it sleeps until it is woken up by the ltt
 > kernel subsystem. When that happens, it only does a write() on the
 > mmaped area, tells the ltt subsystem that it commited X number of
 > sub-buffers and goes back asleep. This is all zero-copy.
 > 
 > Karim
 > -- 
 > Author, Speaker, Developer, Consultant
 > Pushing Embedded and Real-Time Linux Systems Beyond the Limits
 > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546

-- 
Regards,

Tom Zanussi <[EMAIL PROTECTED]>
IBM Linux Technology Center/RAS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour

Aaron Cohen wrote:
>   I've got a quick question and I just want to be clear that it
> doesn't have a political agenda behind it.

:)

> Here goes, why can't LTT and/or relayfs, work similar to the way
> syslog does and just fill a buffer (aka ring-buffer or whatever is
> appropriate), while a userspace daemon of some kind periodically reads
> that buffer and massages it.  I'm probably being naive but if the
> difficulty is with huge several hundred-gig files, the daemon if it
> monitors the buffer often enough could stuff it into a database or
> whatever high-performance format you need.

Because of the bandwidth it is not possible to do any sort of live
processing of any kind. The only thing the daemon can possibly do
is write large blocks of tracing info to disk as rapidly as possible.

>  It also seems to me that Linus' nascent "splice and tee" work would
> be really useful for something like this to avoid a lot of unnecessary
> copying by the userspace daemon.

There is no copying by the userspace daemon. All it does is open(),
then mmap(), and then it sleeps until it is woken up by the ltt
kernel subsystem. When that happens, it only does a write() on the
mmaped area, tells the ltt subsystem that it commited X number of
sub-buffers and goes back asleep. This is all zero-copy.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Aaron Cohen

Hi,
   I'm very much a newbie to all of this, but I'm finding this
discussion fairly interesting.

  I've got a quick question and I just want to be clear that it
doesn't have a political agenda behind it.
  
Here goes, why can't LTT and/or relayfs, work similar to the way
syslog does and just fill a buffer (aka ring-buffer or whatever is
appropriate), while a userspace daemon of some kind periodically reads
that buffer and massages it.  I'm probably being naive but if the
difficulty is with huge several hundred-gig files, the daemon if it
monitors the buffer often enough could stuff it into a database or
whatever high-performance format you need.

 It also seems to me that Linus' nascent "splice and tee" work would
be really useful for something like this to avoid a lot of unnecessary
copying by the userspace daemon.


On Mon, 17 Jan 2005 23:03:46 -0500, Karim Yaghmour <[EMAIL PROTECTED]> wrote:
> 
> Hello Roman,
> 
> Roman Zippel wrote:
> > Why is so important that it's at the start of the buffer? What's wrong
> > with a special event _near_ the start of a buffer?
> [snip]
> > What gives you the idea, that you can't do this with what I proposed?
> > You can still seek freely within the data at buffer boundaries and you
> > only have to search a little into the buffer to find the delimiter. Events
> > are not completely at random, so that the little reordering can be done at
> > runtime. Sorry, but I don't get what kind of unsolvable problems you see
> > here.
> 
> Actually I just checked the code and this is a non-issue. The callback
> can only be called when the condition is met, which itself happens only
> on buffer switch, which itself only happens when we try to reserve
> something bigger than what is left in the buffer. IOW, there is no need
> for reserving anything. Here's what the code does:
> if (!finalizing) {
> bytes_written = rchan->callbacks->buffer_start ...
> cur_write_pos(rchan) += bytes_written;
> }
> 
> With that said, I hope we've agreed that we'll have a callback for
> letting relayfs clients know that they need to write the begining of
> the buffer event. There won't be any associated reserve. Conversly,
> I hope it is not too much to ask to have an end-of-buffer callback.
> 
> > Wrong question. What compromises can be made on both sides to create a
> > common simple framework? Your unwillingness to compromise a little on the
> > ltt requirements really amazes me.
> 
> Roman, of all people I've been more than happy to change my stuff following
> your recommendations. Do I have to list how far down relayfs has been
> stripped down? I mean, we got rid of the lockless scheme (which was
> one of ltt's explicit requirements), we got rid of the read/write capabilities
> for user-space, etc. And we are now only left with the bare-bones API:
> rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks);
> intrelay_close(*rchan);
> intrelay_reset(*rchan);
> intrelay_write(*rchan, *data_ptr, count, **wrote-pos);
> 
> char*  relay_reserve(*rchan, len, *ts, *td, *err, *interrupting);
> void   relay_commit(*rchan, *from, len, reserve_code, interrupting);
> void   relay_buffers_consumed(*rchan, u32);
> 
> #define relay_write_direct(DEST, SRC, SIZE) \
> #define relay_lock_channel(RCHAN, FLAGS) \
> #define relay_unlock_channel(RCHAN, FLAGS) \
> 
> This is a far-cry from what we had before, have a look at the
> relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if
> you want to compare. Please at least acknowledge as much.
> 
> I'm more than willing to compromise, but at least give me something
> substantive to feed on. I've explained why I believe there needs to be
> two modes for relayfs. If you don't think they are appropriate, then
> please explain why. Either my experience blinds me or it rightly
> compels me to continue defending it.
> 
> You ask what compromises can be found from both sides to obtain a
> single implementation. I have looked at this, and given how
> stripped down it has become, anything less from relayfs will make
> it useless for LTT. IOW, I would have to reimplement a buffering
> scheme within LTT outside of relayfs.
> 
> Can't you see that not all buffering schemes are adapted to all
> applications and that it's preferable to have a single API
> transparently providing separate mechanisms instead of a single
> mechanism that doesn't satisfy any of its users?
> 
> If I can't convince you of the concept, can I at least convince
> you to withhold your final judgement until you actually see the
> code f

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour

Hello Roman,

Roman Zippel wrote:
> Why is so important that it's at the start of the buffer? What's wrong 
> with a special event _near_ the start of a buffer?
[snip]
> What gives you the idea, that you can't do this with what I proposed?
> You can still seek freely within the data at buffer boundaries and you 
> only have to search a little into the buffer to find the delimiter. Events 
> are not completely at random, so that the little reordering can be done at 
> runtime. Sorry, but I don't get what kind of unsolvable problems you see 
> here.

Actually I just checked the code and this is a non-issue. The callback
can only be called when the condition is met, which itself happens only
on buffer switch, which itself only happens when we try to reserve
something bigger than what is left in the buffer. IOW, there is no need
for reserving anything. Here's what the code does:
if (!finalizing) {
bytes_written = rchan->callbacks->buffer_start ...
cur_write_pos(rchan) += bytes_written;
}

With that said, I hope we've agreed that we'll have a callback for
letting relayfs clients know that they need to write the begining of
the buffer event. There won't be any associated reserve. Conversly,
I hope it is not too much to ask to have an end-of-buffer callback.

> Wrong question. What compromises can be made on both sides to create a 
> common simple framework? Your unwillingness to compromise a little on the 
> ltt requirements really amazes me.

Roman, of all people I've been more than happy to change my stuff following
your recommendations. Do I have to list how far down relayfs has been
stripped down? I mean, we got rid of the lockless scheme (which was
one of ltt's explicit requirements), we got rid of the read/write capabilities
for user-space, etc. And we are now only left with the bare-bones API:
rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks);
intrelay_close(*rchan);
intrelay_reset(*rchan);
intrelay_write(*rchan, *data_ptr, count, **wrote-pos);

char*  relay_reserve(*rchan, len, *ts, *td, *err, *interrupting);
void   relay_commit(*rchan, *from, len, reserve_code, interrupting);
void   relay_buffers_consumed(*rchan, u32);

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

This is a far-cry from what we had before, have a look at the
relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if
you want to compare. Please at least acknowledge as much.

I'm more than willing to compromise, but at least give me something
substantive to feed on. I've explained why I believe there needs to be
two modes for relayfs. If you don't think they are appropriate, then
please explain why. Either my experience blinds me or it rightly
compels me to continue defending it.

You ask what compromises can be found from both sides to obtain a
single implementation. I have looked at this, and given how
stripped down it has become, anything less from relayfs will make
it useless for LTT. IOW, I would have to reimplement a buffering
scheme within LTT outside of relayfs.

Can't you see that not all buffering schemes are adapted to all
applications and that it's preferable to have a single API
transparently providing separate mechanisms instead of a single
mechanism that doesn't satisfy any of its users?

If I can't convince you of the concept, can I at least convince
you to withhold your final judgement until you actually see the
code for the managed vs. ad-hoc schemes?

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour


Thomas Gleixner wrote:
> Provide a hook, export it and load your filters as a module, but keep
> the filters out of the mainline kernel code. 

Great idea! I will do exactly that.

Thanks,

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour


Hello Roman,

Roman Zippel wrote:
> An additional comment about the order of events. What you're doing in 
> lockless_reserve is bogus anyway. There is no single correct time to 
> write into the event. By artificially synchronizing event order and event 
> time you only cheat yourself. You either take it into account during 
> postprocessing that events can be interrupted or the time stamp doesn't 
> seem to be that important, but there is nothing you can do during the 
> recording of the event except of completely disabling interrupts.

Correct and like I said before, we are dropping the lockless scheme.
Ergo, disabling interrupts we will.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Roman Zippel

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> a) create indexes, b) reorder events, and likely c) have to rewrite

An additional comment about the order of events. What you're doing in 
lockless_reserve is bogus anyway. There is no single correct time to 
write into the event. By artificially synchronizing event order and event 
time you only cheat yourself. You either take it into account during 
postprocessing that events can be interrupted or the time stamp doesn't 
seem to be that important, but there is nothing you can do during the 
recording of the event except of completely disabling interrupts.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Daniel Drake

J.A. Magallon wrote:
This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10
and your code.
Could you revamp it against -mm1, please ? I looked at it but seems out
of my understanding...
My patch replaces the one in -mm1.
Just revert the waiting-10s-... patch that is in 2.6.11-rc1-mm1 using patch -p1 
-R
Then apply the one I attached to the last mail normally.
I'll also be sending in a cleaner version of the patch shortly.
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Thomas Gleixner

On Mon, 2005-01-17 at 18:41 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
> > I know, what I have said. I said reduce the filtering to the absolute
> > minimum and do the rest in userspace.
> 
> You keep adopting the interpretation which best suits you, taking
> quotes out of context, and keep repeating things that have already
> been answered. There are limits to one's patience.

I said before: "Sorting out disabled events is the filtering you 
have to do in kernel and you should do it in the hot path or
remove the unneccecary tracepoints at compiletime." 

This is exactly what I stated above. I omitted the addon of "do the rest 
in userspace", as it was obvious enough.

> What you did is change your position twice. It's there for anyone to see.

Sorry, I didn't know that you are representing anyone.

> > The now builtin filters are defined to fit somebodys needs or idea of
> > what the user should / wants to see. They will not fit everybodys
> > needs / ideas. So we start modifying, adding and #ifdefing kernel
> > filters, which is a scary vision.
> 
> Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just
> export a ltt_set_filter(*callback) and rewrite the if in
> _ltt_log_event() to:
> if ((ltt_filter != NULL) && !(   return -EINVAL;
> 
> You're always welcome to do the following from anywhere in your code:
> ltt_set_filter(NULL);

Provide a hook, export it and load your filters as a module, but keep
the filters out of the mainline kernel code. 

> > Enabling and disabling events is a valid basic filter request, which
> > should live in the kernel. Anything else should go into userspace, IMO.
> 
> What you are suggesting is that a system administator that wants to
> monitor his sendmail server over a period of three weeks should
> just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't
> like the idea of kernel event filtering based on anything but events.

A real common scenario with a broad range of users. And everybody has to
like the idea of hardwired filters in the kernel code to make the life
of this sysadmin easier.

See above about hooks.

Maybe some simple pipe would be helpful too:
read_stream | prefilter | buildbuffers | storeit

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Roman Zippel

Hi,

On Mon, 17 Jan 2005, Karim Yaghmour wrote:

> > Periodically can also mean a buffer start call back from relayfs 
> > (although that would mean the first entry is not guaranteed) or a 
> > (per cpu) eventcnt from the subsystem. The amount of needed search would 
> > be limited. The main point is from the relayfs POV the buffer structure 
> > has always the same (simple) structure.
> 
> But two e-mails ago, you told us to drop the start_reserve and end_reserve
> and move the details of the buffer management into relayfs and out of
> ltt? Either we have a callback, like you suggest, and then we need to
> reserve some space to make sure that the callback is guaranteed to have
> the first entry, or we drop the callback and provide an option to the
> user for relayfs to write this first entry for him. Providing a callback
> without reservation is no different than relying purely on the heartbeat,
> which, like I said before and for the reasons illustrated below, is
> unrealistic.

Why is so important that it's at the start of the buffer? What's wrong 
with a special event _near_ the start of a buffer?

> > Why is it "totally unrealistic"?
> 
> Ok, let's expand a little here on the amount of data. Say you're getting
> 2MB/s of data (which is not unrealistic on a loaded system.) That means
> that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
> In practice, users aren't necessarily interested in plowing through the
> entire 345GB, they just want to view a given portion of it. Now, if I
> follow what you are suggesting, I have to go through the entire 345GB to:
> a) create indexes, b) reorder events, and likely c) have to rewrite
> another 345GB of data. And I haven't yet discussed the kind of problems
> you would encounter in trying to reorder such a beast that contains,
> by definition, variable-sized events. For one thing, if event N+1 doesn't
> follow N, then you would be forced to browse forward until you actually
> found it before you could write a properly ordered trace. And it just
> takes a few processes that are interrupted and forced to sleep here and
> there to make this unusable. That's without the RAM or fs space required
> to store those index tables ... At 3 to 12 bytes per events, that's a lot
> of space for indexes ...
> 
> If I keep things as they are with ordered events and delimiters on buffer
> boundaries, I can skip to any place within this 345GB and start processing
> from there.

What gives you the idea, that you can't do this with what I proposed?
You can still seek freely within the data at buffer boundaries and you 
only have to search a little into the buffer to find the delimiter. Events 
are not completely at random, so that the little reordering can be done at 
runtime. Sorry, but I don't get what kind of unsolvable problems you see 
here.

> Rhetorical: Couldn't the ad-hoc mode case be a special case of the
> managed mode?

Wrong question. What compromises can be made on both sides to create a 
common simple framework? Your unwillingness to compromise a little on the 
ltt requirements really amazes me.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-17 Thread Karim Yaghmour

Thomas Gleixner wrote:
> If we add another hardwired implementation then we do not have said
> benefits.

Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman,
and others actually made specific requests for changes in the code.
What makes you think you're so special that you think you are
entitled to stay on the side and handwave about concepts.

If there is a limitation with the code, please present actual
snippets that need to be changed and suggest alternatives. That's
what everyone else does on this list.

If you want to clean-up the existing tracing code in the kernel,
then here are some ltt calls you may be interested in:
int ltt_create_event(char *event_type,
 char *event_desc,
 int format_type,
 char *format_data);
int ltt_log_raw_event(int event_id, int event_size, void *event_data);

And here's an actual example:
...
  delta_id = ltt_create_event("Delta",
  NULL,
  CUSTOM_EVENT_FORMAT_TYPE_HEX,
  NULL);
...
  ltt_log_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event);
...
  ltt_destroy_event(delta_id);

You can then use LibLTT to read the trace and extract your custom
events and format your binary data as it suits you.

Save the bandwidth and start cleaning.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Thomas Gleixner

On Mon, 2005-01-17 at 17:42 -0500, Robert Wisniewski wrote:

> I believe (and Karim can correct me if I'm wrong) the idea is to have
> groups of events that can be disabled and enabled via a one word mask.  No
> checking multiple variables, no #ifdefing, something very streamlined.  By
> userspace I assume you mean post-processing, i.e., if the user/library/etc
> needs to log events they use the same simple facility.

Yes, I was talking about postprocessing in userspace. 

The logging of userspace events is a complete seperate issue. You have
to solve the timestamp problem and do the correlation to kernel events
in the postprocessing.

> I think we agree to optimize/streamline performance for the gathering and
> do work in the post processing.  There is an outstanding patch that makes
> strides in this direction.

Ack.

Have you any plans to seperate the layers into different pieces, so they
provide better reusability ?

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour

Thomas Gleixner wrote:
> I know, what I have said. I said reduce the filtering to the absolute
> minimum and do the rest in userspace.

You keep adopting the interpretation which best suits you, taking
quotes out of context, and keep repeating things that have already
been answered. There are limits to one's patience.

What you did is change your position twice. It's there for anyone to see.

> The now builtin filters are defined to fit somebodys needs or idea of
> what the user should / wants to see. They will not fit everybodys
> needs / ideas. So we start modifying, adding and #ifdefing kernel
> filters, which is a scary vision.

Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just
export a ltt_set_filter(*callback) and rewrite the if in
_ltt_log_event() to:
if ((ltt_filter != NULL) && !( Enabling and disabling events is a valid basic filter request, which
> should live in the kernel. Anything else should go into userspace, IMO.

What you are suggesting is that a system administator that wants to
monitor his sendmail server over a period of three weeks should
just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't
like the idea of kernel event filtering based on anything but events.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread J.A. Magallon


On 2005.01.16, Daniel Drake wrote:
> Hi,
> 
> Joseph Fannin wrote:
> > On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
> > 
> >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/
> > 
> > 
> >>waiting-10s-before-mounting-root-filesystem.patch
> >>  retry mounting the root filesystem at boot time
> > 
> > 
> > With this patch, initrds seem to get 'skipped'.  I think this is
> > probably the cause for the reports of problems with RAID too.
> 
> This patch should do the job. Replaces the existing 
> waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1.
> 
> Daniel
> 

> Retry up to 20 times if mounting the root device fails.  This fixes booting
> from usb-storage devices, which no longer make their partitions immediately
> available. Also cleans up the mount_block_root() function.
> 
> Based on an earlier patch from William Park <[EMAIL PROTECTED]>
> 
> Signed-off-by: Daniel Drake <[EMAIL PROTECTED]>
> 

This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10
and your code.
Could you revamp it against -mm1, please ? I looked at it but seems out
of my understanding...

TIA

--
J.A. Magallon  \   Software is like sex:
werewolf!able!es \ It's better when it's free
Mandrakelinux release 10.2 (Cooker) for i586
Linux 2.6.10-jam4 (gcc 3.4.3 (Mandrakelinux 10.2 3.4.3-3mdk)) #2



pgpJTZVivsc8z.pgp
Description: PGP signature

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-17 Thread Thomas Gleixner

On Mon, 2005-01-17 at 15:34 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
> > Thats the point. Adding another hardwired implementation does not give
> > us a possibility to solve the hardwired problem of the already available
> > stuff.
> 
> Well then, like I said before, you know what you need to do:
> http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

Oh, I guess my English must be really bad.

I was talking about seperation of layers, so why do I need
kernelhooks ? 

The seperation of layers makes it possible to actually reuse
functionality and gives the possibility that existing hardwired stuff
can be cleaned up to use the new functionality too. 

If we add another hardwired implementation then we do not have said
benefits.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Robert Wisniewski

n   <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs  Lucid
Message-ID: <[EMAIL PROTECTED]>
From: Robert Wisniewski <[EMAIL PROTECTED]>
Bcc: [EMAIL PROTECTED],[EMAIL PROTECTED]

Thomas Gleixner writes:
 > On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote:
 > > You're either on crack or I don't know how to read english. Here's what
 > > you said:
 > 
 > Maybe you should read your own comment about ad-hominem attacks earlier
 > in this thread and consider if it might apply to you.
 > 
 > I know, what I have said. I said reduce the filtering to the absolute
 > minimum and do the rest in userspace.
 > 
 > The now builtin filters are defined to fit somebodys needs or idea of
 > what the user should / wants to see. They will not fit everybodys
 > needs / ideas. So we start modifying, adding and #ifdefing kernel
 > filters, which is a scary vision.
 > 
 > Enabling and disabling events is a valid basic filter request, which
 > should live in the kernel. Anything else should go into userspace, IMO.
 > 
 > tglx

I believe (and Karim can correct me if I'm wrong) the idea is to have
groups of events that can be disabled and enabled via a one word mask.  No
checking multiple variables, no #ifdefing, something very streamlined.  By
userspace I assume you mean post-processing, i.e., if the user/library/etc
needs to log events they use the same simple facility.

I think we agree to optimize/streamline performance for the gathering and
do work in the post processing.  There is an outstanding patch that makes
strides in this direction.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Thomas Gleixner

On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote:
> You're either on crack or I don't know how to read english. Here's what
> you said:

Maybe you should read your own comment about ad-hominem attacks earlier
in this thread and consider if it might apply to you.

I know, what I have said. I said reduce the filtering to the absolute
minimum and do the rest in userspace.

The now builtin filters are defined to fit somebodys needs or idea of
what the user should / wants to see. They will not fit everybodys
needs / ideas. So we start modifying, adding and #ifdefing kernel
filters, which is a scary vision.

Enabling and disabling events is a valid basic filter request, which
should live in the kernel. Anything else should go into userspace, IMO.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread William Lee Irwin III

On Fri, Jan 14, 2005 at 06:58:10PM -0800, William Lee Irwin III wrote:
> No idea what hit me just yet. x86-64 doesn't boot. Still going through
> the various architectures. The same system (including the initrd FPOS
> bullcrap, though, of course, I'm using an initrd built just for this
> kernel) boots various 2.6.x up to 2.6.10-mm1. There are vague indications
> something in/around SCSI and/or initrd's has violently exploded in my face.

With the waiting 10s patch backed out, things seem to be going well:

$ ssh analyticity
Last login: Mon Jan 17 14:03:13 2005 from meromorphy
Linux analyticity 2.6.11-rc1-mm1 #5 SMP Sat Jan 15 01:25:23 PST 2005 sparc64 
GNU/Linux
$ uptime
 14:10:55 up 10 min,  7 users,  load average: 0.10, 0.40, 0.31

Now I just have to remember to set up ip route add 192.168.1.0/24 dev
eth3 via 192.168.1.1 instead of just ip route add 192.168.1.0/24 dev
eth3 so I can tftpboot the thing (well, it took all of 10s to figure
out, but it may not next time). Routing changes are painful.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour


Hello Chistoph,

Christoph Hellwig wrote:
> The thing I'm unhappy with is what the code does currently.  I haven't
> looked at the code enough nor through about the problem enough to tell
> you what's the right thing to do.  Knowing that will involve review of
> the architecture and serious benchmarking on a few plattforms.

Like I was saying elswhere, we are likely going to drop the lockless
code for now (i.e. the code that does the cmpxchg). Instead we will
depend on normal cli/sti abstractions.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour

Hello Roman,

Roman Zippel wrote:
> Periodically can also mean a buffer start call back from relayfs 
> (although that would mean the first entry is not guaranteed) or a 
> (per cpu) eventcnt from the subsystem. The amount of needed search would 
> be limited. The main point is from the relayfs POV the buffer structure 
> has always the same (simple) structure.

But two e-mails ago, you told us to drop the start_reserve and end_reserve
and move the details of the buffer management into relayfs and out of
ltt? Either we have a callback, like you suggest, and then we need to
reserve some space to make sure that the callback is guaranteed to have
the first entry, or we drop the callback and provide an option to the
user for relayfs to write this first entry for him. Providing a callback
without reservation is no different than relying purely on the heartbeat,
which, like I said before and for the reasons illustrated below, is
unrealistic.

> You have to be more specific, what's so special about this amount of data. 
> You likely want to (incrementally) build an index file, so you don't have 
> to repeat the searches, but even with your current format you would 
> benefit from such an index file.
[snip]
>>As above, restoring the original order of events is fine if you are
>>looking at mbs or kbs of data. It's just totally unrealistic for
>>the amounts of data we want to handle.
> 
> 
> Why is it "totally unrealistic"?

Ok, let's expand a little here on the amount of data. Say you're getting
2MB/s of data (which is not unrealistic on a loaded system.) That means
that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour).
In practice, users aren't necessarily interested in plowing through the
entire 345GB, they just want to view a given portion of it. Now, if I
follow what you are suggesting, I have to go through the entire 345GB to:
a) create indexes, b) reorder events, and likely c) have to rewrite
another 345GB of data. And I haven't yet discussed the kind of problems
you would encounter in trying to reorder such a beast that contains,
by definition, variable-sized events. For one thing, if event N+1 doesn't
follow N, then you would be forced to browse forward until you actually
found it before you could write a properly ordered trace. And it just
takes a few processes that are interrupted and forced to sleep here and
there to make this unusable. That's without the RAM or fs space required
to store those index tables ... At 3 to 12 bytes per events, that's a lot
of space for indexes ...

If I keep things as they are with ordered events and delimiters on buffer
boundaries, I can skip to any place within this 345GB and start processing
from there.

And that's for two days. If you're a sysadmin encountering a transient
problem on a server, you may actually want more than that.

>>But like I said earlier, the added relayfs mode (kdebug) would allow
>>for exactly what you are suggesting:
>>  event_id = atomic_inc_return(&event_cnt);
> 
> 
> Actually that would be already too much for low level kernel debugging.
> Why do you want to put this into relayfs?

I don't. I was just saying that with the adhoc mode, a relayfs client
could use the code snippet you were suggesting.

> What are the _specific_ reasons you need these various modes, why can't 
> you build any special requirements on top of a very light weight relay 
> mechanism?

Because of the opposite requirements.

Here are the two modes I'm suggesting in relayfs and how they operate:

Managed:
- Presumes active user-space daemon interested in catching _all_ events.
- Allows N buffers in buffer ring
- Provides limit-checking (callback on end of sub-buffer)
- Provides buffer delimiters (writes timestamp at beg and end)
- Suited for all types of event sizes (both fixed and variable) at
  very high frequency.
- Daemon is woken up when buffer is ready for writing, executes a
  write() on an mmaped area and notifies relevant kernel subsystem,
  which in turn notifies relayfs that buffer can now be reused.
- Relies on proper abstraction of cli/sti.

Ad-Hoc:
- Presumes transient userspace tool interested in event snapshots.
- Single circular buffer.
- No limits checking (or very basic: as in stop if overwrite).
- No buffer delimiters.
- Best suited for fixed-size events at extreme high frequency.
- User-space tool simply does a write() on an mmaped area and
  exits or goes back to sleep.
- Relies on proper abstraction of cli/sti.

Basically, the ad-hoc modes abides by the principles of KISS, whereas
the managed is a more elaborate for clients like LTT.

Rhetorical: Couldn't the ad-hoc mode case be a special case of the
managed mode? In theory yes, in practice no. The various conditionals
and code paths for switching buffers, invoking callbacks, writing
delimiters and the likes, which make this mode useful to client like
LTT, will always be a problem for those seeking the shortest path to
buffer comital. In the case of Ingo, for example, I'm sure he'd

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-17 Thread Karim Yaghmour


Thomas Gleixner wrote:
> Thats the point. Adding another hardwired implementation does not give
> us a possibility to solve the hardwired problem of the already available
> stuff.

Well then, like I said before, you know what you need to do:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Karim Yaghmour


Thomas Gleixner wrote:
> Sorting out disabled events is the filtering you have to do in kernel
> and you should do it in the hot path or remove the unneccecary
> tracepoints at compiletime. 

Do you actually read my replies or do you just grep for something
you can object to? If you care to read my replies you will see that
this has already been answered.

> You are not answering my argument. 8MB/sec is an event frequency of
> 128hz when we assume 64byte/event. It's one event every 8us. So every
> unneccecary computation, every leaving the hotpath for nothing is just
> giving you performance loss.

I have, you just choose not to read. Here's what I said earlier:
> Note, however, that we are thinking of dropping the lockless scheme
> for now. We will pick up this discussion separately further down the
> road.

IOW, we will be using cli/sti. So there is no "leaving the hotpath".

> I said:
> 
>>>Sorting out disabled events in the hot path 
> 
> 
> s/Sorting/Filtering/
> 
> I never said this should not be done.

You're either on crack or I don't know how to read english. Here's what
you said:
> Sorting out disabled events in the hot path and moving the if
> (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> alien request.

Clearly you are suggesting to moving the filtering into user-space.

> Seperating layers as I suggested before is not making it a generic
> debugging tool. It makes parts of those layers available for other usage
> and gives us the chance to reuse the parts for cleaning up already
> available code which has the same hardwired structure.

This has already been answered.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Matthias Urlichs

Hi,   Andrew Morton schrub am Fri, 14 Jan 2005 10:35:34 -0800:

> What filesystem(s) do you use, and why?

sshfs (best idea for file access through firewalls).
gmailfs (best free off-site backup facility).
Will use encfs as soon as FUSE is in mainline
  (I'm using cryptoloop now, but that's not sanely backupable.)

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Tom Zanussi

Karim Yaghmour writes:
 > 
 > Hello Roman,
 > 
 > 
 > What we are dropping for later review: read/write semantics from
 > user-space. It has to be understood that we believe that this is
 > a major drawback. For one thing, you won't be able to do something
 > like:
 > $ cat /relayfs/xchg/my-file > ~/test-data
 > 
 > Instead, you will have to write a custom app that does open(),
 > mmap(), write(). We could still provide a small app/library that
 > did this automagically, but you've got to admit that nothing
 > beats the real thing.
 > 

Maybe we could use FUSE to provide read()/write() for relayfs files -
opening a FUSE relayfs file would open and mmap the actual relayfs
file, read() would move around in the buffer using basically the
current relayfs read logic moved down into the FUSE filesystem read
fileop, and write() could write directly to the buffer...

Tom

 > Also note that there are people who currently use this already,
 > so there will be some unhappy campers.
 > 
 > Karim
 > -- 
 > Author, Speaker, Developer, Consultant
 > Pushing Embedded and Real-Time Linux Systems Beyond the Limits
 > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546

-- 
Regards,

Tom Zanussi <[EMAIL PROTECTED]>
IBM Linux Technology Center/RAS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Christoph Hellwig

On Mon, Jan 17, 2005 at 10:48:52AM -0500, Robert Wisniewski wrote:
> Wow - disabling interrupts is handfuls to tens of cycles, so that means
> some architectures take thousands of cycles to do atomic operations.  Then
> I would definitely agree we should not be using atomic operations on those,
> fwiw, out of curiosity, what archs make atomic ops so expensive.
> 
> Andrew, on the broader note.  If the community feels disabling interrupts
> is the better way to go for the variables (I think it's index and count) we
> were protecting with atomic ops then as the code stands things should be
> fine with that approach and we can make that change.

The thing I'm unhappy with is what the code does currently.  I haven't
looked at the code enough nor through about the problem enough to tell
you what's the right thing to do.  Knowing that will involve review of
the architecture and serious benchmarking on a few plattforms.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Robert Wisniewski

Arjan van de Ven writes:
 > On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote:
 > 
 > > :-) - as above.  Furthermore, it seems that reducing the places where
 > > interrupts are disabled would be a good thing?  
 > 
 > depends at the price. On several cpus, disabling interupts is hundreds
 > of times cheaper than doing an atomic op. 

Wow - disabling interrupts is handfuls to tens of cycles, so that means
some architectures take thousands of cycles to do atomic operations.  Then
I would definitely agree we should not be using atomic operations on those,
fwiw, out of curiosity, what archs make atomic ops so expensive.

Andrew, on the broader note.  If the community feels disabling interrupts
is the better way to go for the variables (I think it's index and count) we
were protecting with atomic ops then as the code stands things should be
fine with that approach and we can make that change.

Thanks for your attention to looking through this.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Roman Zippel

Hi,

On Sun, 16 Jan 2005, Karim Yaghmour wrote:

> > You can make it even simpler by dropping this completely. Every buffer is 
> > simply a list of events and you can let ltt write periodically a timer 
> > event. In userspace you can randomly seek at buffer boundaries and search 
> > for the timer events. It will require a bit more work for userspace, but 
> > even large amount of tracing data stays managable.
> 
> We already do write a heartbeat event periodically to have readable
> traces in the case where the lower 32 bits of the TSC wrap-around.
> 
> As I mentioned elsewhere, please don't think of this in terms of
> kbs or mbs of data. What we're talking about here is gbs if not
> 100gbs of data. Having to start reading each sub-buffer until you
> hit a heartbeat really is a killer for such large traces. If there
> was a significant impact on relayfs for having this I would have
> understood the argument, but relayfs needs to do buffer-management
> anyway, so I don't see that much complexity being added by allowing
> the channel user to ask relayfs for delimiters.

Periodically can also mean a buffer start call back from relayfs 
(although that would mean the first entry is not guaranteed) or a 
(per cpu) eventcnt from the subsystem. The amount of needed search would 
be limited. The main point is from the relayfs POV the buffer structure 
has always the same (simple) structure.
You have to be more specific, what's so special about this amount of data. 
You likely want to (incrementally) build an index file, so you don't have 
to repeat the searches, but even with your current format you would 
benefit from such an index file.

> > Userspace can then easily restore the original order of events.
> 
> As above, restoring the original order of events is fine if you are
> looking at mbs or kbs of data. It's just totally unrealistic for
> the amounts of data we want to handle.

Why is it "totally unrealistic"?

> But like I said earlier, the added relayfs mode (kdebug) would allow
> for exactly what you are suggesting:
>   event_id = atomic_inc_return(&event_cnt);

Actually that would be already too much for low level kernel debugging.
Why do you want to put this into relayfs?
What are the _specific_ reasons you need these various modes, why can't 
you build any special requirements on top of a very light weight relay 
mechanism?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-17 Thread Thomas Gleixner

On Sun, 2005-01-16 at 21:24 -0500, Karim Yaghmour wrote:

> > Sorting out disabled events in the hot path and moving the if
> > (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> > alien request.
> 
> It is. Have you even read what I suggested to change in my other mail:
> if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
>   return -EINVAL;

Sorting out disabled events is the filtering you have to do in kernel
and you should do it in the hot path or remove the unneccecary
tracepoints at compiletime. 

> > 4096kB/sec for  64 events/ms (event frequency  64kHz) (15 us)
> > 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us) 

> Actually, on a PII-350MHz, I was already generating 0.5MB/s of data
> just by running an X session. If we assume that a machine 10 times
> faster generates 10 times as many events, we've already got 5MB/s,
> and I'm sure that there are heavier cases than X.

You are not answering my argument. 8MB/sec is an event frequency of
128hz when we assume 64byte/event. It's one event every 8us. So every
unneccecary computation, every leaving the hotpath for nothing is just
giving you performance loss.

> Not even Ingo hinted at getting rid of filtering. Remember the earlier
> e-mail I refered to? Here's what he was suggesting:

I said:
> > Sorting out disabled events in the hot path 

s/Sorting/Filtering/

I never said this should not be done.

> Like I said, we are willing to accomodate those who want to be able
> to use relayfs for kernel debugging purposes, but we can hardly
> be blamed for not making LTT a generic kernel debugging tool as this
> is exactly the excuse many kernel developers had for not including
> LTT to start with. It's just totally dissengenious for giving us
> grief for claiming that we are doing something and then later turn
> around and blame us for not doing it ... cheesh ...

Seperating layers as I suggested before is not making it a generic
debugging tool. It makes parts of those layers available for other usage
and gives us the chance to reuse the parts for cleaning up already
available code which has the same hardwired structure.

tglx



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-17 Thread Thomas Gleixner

On Sun, 2005-01-16 at 20:54 -0500, Karim Yaghmour wrote:

> If you really want to define layers, then there are actually four
> layers:
> 1- hooking mechanism
> 2- event definition / registration
> 3- event management infrastructure
> 4- transport mechanism
> 
> LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to
> earlier, there is code in the kernel that already does 1, 2, 3,
> and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking
> for them to remove it. We're offering 4 separately and are putting
> LTT on top of it. If you want to get 1 & 2 separately, have a look
> at kernel hooks and genevent:

I know that there is enough code which does x,y,z hardcoded/hardwired
already. 

Thats the point. Adding another hardwired implementation does not give
us a possibility to solve the hardwired problem of the already available
stuff.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Prasanna S Panchamukhi

Hi Karim,

> Thomas Gleixner wrote:
>> It's not only me, who needs constant time. Everybody interested in
>> tracing will need that. In my opinion its a principle of tracing.
> 
> relayfs is a generalized buffering mechanism. Tracing is one application
> it serves. Check out the web site: "high-speed data-relay filesystem."
> Fancy name huh ...
> 
>> The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
>> locks by do { } while loops. So what ?
> 

How about combining "buffering mechansim of relayfs" and
"kernel-> user space tranport by debugfs"
This will also remove lots of compilcated code from realyfs.

Thanks
Prasanna
-- 

Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Karim Yaghmour

Thomas Gleixner wrote:
> Which is every 1.42 seconds on a 3GHz machine. I guess we don't have
> GB's of data when the 1.42 seconds elapse without an event.

My argument was about being able to browse the amount of data I was
refering to. The hearbeat thing was an asside to Roman as to the
fact that we already do what he's suggesting.

> I still don't see the point. The implicit ability of LTT to allow
> tracing of up to 8192 bytes user data, strings and XML makes this
> neccecary. I do not see any neccecarity to integrate this special usage
> modes instead of an generic usable instrumentation implementation.

I've already clarified your mischaracterization of custom events,
you are being dissengenious here. If you want a generalized hooking
mechanism, feel free to ask Andrew to take kernel hooks:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/

> If relayfs is giving those users the ability to do so then they can do
> it, but I object the fact that LTT/relayfs is occupying the place of a
> more generic implementation in the way it is implemeted now.

Again, damned if we do, damned if don't. LTT isn't meant for kernel
debugging per se, though you can use it to that end to a certain extent.
However, if you are kernel debugging, you will find the ad-hoc mode I'm
talking about adding to relayfs quite useful.

> For normal event tracing you have about 32-64 byte of data per event. So
> disabling interrupts in order to copy this amount of imformation into a
> buffer is cheaper on most architectures than doing the whole magic in
> LTT and relayfs. This also keeps your buffers consistent and does not
> need any magic for postprocessing. 

Oh, now you want to lighten the weight on postprocessing? Common Thomas,
please stop wasting my time.

Note, however, that we are thinking of dropping the lockless scheme
for now. We will pick up this discussion separately further down the
road.

> Sorting out disabled events in the hot path and moving the if
> (pid/gid/grp) whatever stuff into userspace postprocessing is not an
> alien request.

It is. Have you even read what I suggested to change in my other mail:
if ((any_filtering) && !(ltt_filter(event_id, event_struct, data)))
return -EINVAL;

You're not honestly telling me that checking for any_filtering is
going to ruin your day.

> You are talking of Gigabytes of data. In what time ?
> 
> Let's do some math.
> 
> For simplicity all events use 64 Byte event space.
> 
> ~ 64kB/sec for 1000 events/s (event frequency   1kHz) ( 1 ms)
> 1024kB/sec for  16 events/ms (event frequency  16kHz) (62 us)
> 2048kB/sec for  32 events/ms (event frequency  32kHz) (31 us)
> 4096kB/sec for  64 events/ms (event frequency  64kHz) (15 us)
> 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us)
> 
> where a 100Mbit network can theoretically transport 10240kB/sec and
> practically does 4000-8000 kB/sec. 
> 
> An event frequency of 8us even on a 3 GHz machine is complete illusion,
> because we spend already a couple of usecs in servicing the legacy 8254
> timer.
> 
> So the realistic assumption on a 3Ghz machine is definitely below 64kHz,
> which means we have to handle max. 4Mb of data per second. 

Actually, on a PII-350MHz, I was already generating 0.5MB/s of data
just by running an X session. If we assume that a machine 10 times
faster generates 10 times as many events, we've already got 5MB/s,
and I'm sure that there are heavier cases than X.

Here's the paper if you want to read it:
http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz

> I'm not impressed. Disabling interrupts for a couple of nano seconds to
> store the trace data in the buffer does not hurt at all. Running through
> a big bunch of out of cache line instructions does.

Like I said above, fighting for/against lockless is not our immediate
goal, and we will likely remove it.

> If you try to trace more than this amount you are toast anyway.
> 
> Please beware me of "reality has bitten" arguments. The whole if(..)
> scenario in _ltt_event_log() is doing postprocessing, which can be done
> in userspace. I don't care about the required time as long as it does
> not introduce additional burden into the kernel.

Not even Ingo hinted at getting rid of filtering. Remember the earlier
e-mail I refered to? Here's what he was suggesting:
> void trace(event, data1, data2, data3)
> {
>   int cpu = smp_processor_id();
>   int idx, pending, *curr = curr_idx + cpu;
>   struct trace_event *t;
>   unsigned long flags;
> 
>   if (!event_wanted(current, event, data1, data2, data3))
>   return;
> 
>   local_irq_save(flags);
> 
> idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
>   pending = ++curr_pending[cpu];
> 
> t = trace_ring[cpu] + idx;
> 
> t->event = event;
> rdtscll(t->timestamp);
> t->data1 = data1;
> t->data2 = data2;
> t->data3 = data3;
> 
>   if (curr_pending == TRACE_LOW_WATERMARK &

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-16 Thread Karim Yaghmour

Thomas Gleixner wrote:
> This implies to seperate 
> 
> - infrastructure 
> - event registration
> - transport mechanism

Like I said in my first response: we can't be everything for everbody,
the requirements are just too broad. ISO tried it with OSI. Have a
look at net/* for the result.

Currently, LTT provides the first two in one piece, and relayfs
provides the third. Like I acknowledged earlier, there is room for
generalizing the transport mechanism, and I'm thinking of amending
the relayfs API proposal further and rename the modes to make them
more straight-forward:
- Managed (locking or lockless.)
- Ad-Hoc (which works like Ingo, yourself, and others have requested.)

If you really want to define layers, then there are actually four
layers:
1- hooking mechanism
2- event definition / registration
3- event management infrastructure
4- transport mechanism

LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to
earlier, there is code in the kernel that already does 1, 2, 3,
and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking
for them to remove it. We're offering 4 separately and are putting
LTT on top of it. If you want to get 1 & 2 separately, have a look
at kernel hooks and genevent:
http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/
http://www.listserv.shafik.org/pipermail/ltt-dev/2003-January/000408.html

We'd gladly take a serious look at using the former if it was
included, and there is work in progress being conducted on getting
the latter being the standard way for declaring LTT events instead
of using a static ltt-events.h.

Five years ago, there was a discussion about integrating GKHI into
the kernel (the kernel hooks ancestor). Have a look for yourself
as to the response to this suggestion (basically people weren't
ready to accept a generalized hooking mechanism without a defined
set of hooks, and then others didn't like the idea at all because
creating general hooks in the kernel which anybody can register
to creates legal and maintenance problems ... basically it's a
can of worms):
http://marc.theaimsgroup.com/?l=linux-kernel&m=97371908916365&w=2

There's only so much we can push into the kernel in the same time.
Not to mention that before you can be generic, you've got to have
some specific implementation to start working off on. I believe
that what we've ironed out through the discussion of the past
two days is a good basis.

There is some irony in all this. For years, we were told that we
couldn't make it into the kernel because we were perceived as
providing a kernel debugging tool, and now that we're starting
to get our things seriously reviewed we're being told that maybe
it ain't really that useful because those who want to do kernel
debugging can't use it as-is ... go figure.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Thomas Gleixner

On Sun, 2005-01-16 at 16:18 -0500, Karim Yaghmour wrote:

> We already do write a heartbeat event periodically to have readable
> traces in the case where the lower 32 bits of the TSC wrap-around.

Which is every 1.42 seconds on a 3GHz machine. I guess we don't have
GB's of data when the 1.42 seconds elapse without an event.

> > Userspace can then easily restore the original order of events.
> 
> As above, restoring the original order of events is fine if you are
> looking at mbs or kbs of data. It's just totally unrealistic for
> the amounts of data we want to handle.

I still don't see the point. The implicit ability of LTT to allow
tracing of up to 8192 bytes user data, strings and XML makes this
neccecary. I do not see any neccecarity to integrate this special usage
modes instead of an generic usable instrumentation implementation.

If relayfs is giving those users the ability to do so then they can do
it, but I object the fact that LTT/relayfs is occupying the place of a
more generic implementation in the way it is implemeted now.

For normal event tracing you have about 32-64 byte of data per event. So
disabling interrupts in order to copy this amount of imformation into a
buffer is cheaper on most architectures than doing the whole magic in
LTT and relayfs. This also keeps your buffers consistent and does not
need any magic for postprocessing. 

Sorting out disabled events in the hot path and moving the if
(pid/gid/grp) whatever stuff into userspace postprocessing is not an
alien request.

You are talking of Gigabytes of data. In what time ?

Let's do some math.

For simplicity all events use 64 Byte event space.

~ 64kB/sec for 1000 events/s (event frequency   1kHz) ( 1 ms)
1024kB/sec for  16 events/ms (event frequency  16kHz) (62 us)
2048kB/sec for  32 events/ms (event frequency  32kHz) (31 us)
4096kB/sec for  64 events/ms (event frequency  64kHz) (15 us)
8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us)

where a 100Mbit network can theoretically transport 10240kB/sec and
practically does 4000-8000 kB/sec. 

An event frequency of 8us even on a 3 GHz machine is complete illusion,
because we spend already a couple of usecs in servicing the legacy 8254
timer.

So the realistic assumption on a 3Ghz machine is definitely below 64kHz,
which means we have to handle max. 4Mb of data per second. 

I'm not impressed. Disabling interrupts for a couple of nano seconds to
store the trace data in the buffer does not hurt at all. Running through
a big bunch of out of cache line instructions does.

If you try to trace more than this amount you are toast anyway.

Please beware me of "reality has bitten" arguments. The whole if(..)
scenario in _ltt_event_log() is doing postprocessing, which can be done
in userspace. I don't care about the required time as long as it does
not introduce additional burden into the kernel.

> Also note that there are people who currently use this already,
> so there will be some unhappy campers.

Be aware that there are some unhappy campers in the kernel community too
when the special purpose tracing is included instead of a general usable
framework.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-16 Thread Thomas Gleixner

On Sat, 2005-01-15 at 23:23 -0500, Karim Yaghmour wrote:
> > Well, that's really a core problem. We don't want to duplicate 
> > infrastructure, which practically does the same. So if relayfs isn't 
> > usable in this kind of situation, it really raises the question whether 
> > relayfs is usable at all. We need to make relayfs generally usable, 
> > otherwise it will join the fate of devfs.
> 
> Hmm, coming from you I will take this is a pretty strong endorsement
> for what I was suggesting earlier: provide a basic buffering mode
> in relayfs to be used in kernel debugging. However, it must be
> understood that this is separate from the existing modes and ltt,
> for example, could not use such a basic infrastructure. If this is
> ok with you, and no one wants to complain too loudly about this, I
> will go ahead and add this to our to-do list for relayfs.

This implies to seperate 

- infrastructure 
- event registration
- transport mechanism

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Arjan van de Ven

On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote:

> :-) - as above.  Furthermore, it seems that reducing the places where
> interrupts are disabled would be a good thing?  

depends at the price. On several cpus, disabling interupts is hundreds
of times cheaper than doing an atomic op. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Robert Wisniewski

Christoph Hellwig writes:
 > On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote:
 > > int global_val;
 > > 
 > > modify_val_spin()
 > > {
 > >acquire_spin_lock()
 > >// calculate some_value based on global_val
 > >// for example c=global_val; if (c%0) some_value=10; else some_value=20;
 > >global_val = global_val + some_value
 > >release_spin_lock()
 > > }
 > > 
 > > modify_val_atomic()
 > > {
 > >do
 > >// calculate some_value based on global_val
 > >// for example c=global_val; if (c%0) some_value=10; else some_value=20;
 > >global_val = global_val + some_value
 > >while (compare_and_store(global_val, , ))
 > > }
 > > 
 > > What's the difference.  The deal is if two processes execute this code
 > > simultaneously and one gets interrupted in the middle of modify_val_spin,
 > > then the other wastes its entire quantum spinning for the lock.  In the
 > > modify_val_atomic if one process gets interrupted, no problem, the other
 > > process can proceed through, then when the first one runs again the CAS
 > > will fail, and it will go around the loop again.  Now imagine it was the
 > > kernel involved...
 > 
 > Just prevent that with spin_lock_irq.  But anyway this example doesn't
 > fit the ltt code.  cmpxchg loops can make lots of sense for such simple
 > loops, but as soon as you need to do significant work in the loop it
 > starts to get problematic.  Your example would btw be better off using

The loop in question is where we grab the current (old) index, perform a
computation (or three).  The only expensive operation is the timestamp
acquisition which has been modified to use the cheaper rtsc, so I still
think that's within the realm of a reasonably simply loop.  I think what
you want to avoid is starting to walk a (potentially indeterminate) data
structure in such atomic op loop.

 > atomic_t and it's primitives so you abstract away the actual implementation
 > and the architecture can chose the most efficient implementation.
 > 

That's an interesting thought because it might address Andrew's concern.
We'll investigate.  Thanks.

-bob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Karim Yaghmour

Hello Roman,

Roman Zippel wrote:
> It seems we first need to specify, what relayfs actually is supposed to 
> be. Is it a relaying mechanism for large amount of data from kernel to 
> user space or is it a general communication channel between kernel and 
> user space? You have to choose one, if you mix contradicting requirements, 
> you'll never get a simple abstraction layer and relayfs will always be a 
> pain to work with.

I think we want to concentrate on the former, though I suspect the latter
will happen eventually. But let's keep our focus on providing a mechanism
for relaying large amounts of data from the kernel to user-space.

> You can make it even simpler by dropping this completely. Every buffer is 
> simply a list of events and you can let ltt write periodically a timer 
> event. In userspace you can randomly seek at buffer boundaries and search 
> for the timer events. It will require a bit more work for userspace, but 
> even large amount of tracing data stays managable.

We already do write a heartbeat event periodically to have readable
traces in the case where the lower 32 bits of the TSC wrap-around.

As I mentioned elsewhere, please don't think of this in terms of
kbs or mbs of data. What we're talking about here is gbs if not
100gbs of data. Having to start reading each sub-buffer until you
hit a heartbeat really is a killer for such large traces. If there
was a significant impact on relayfs for having this I would have
understood the argument, but relayfs needs to do buffer-management
anyway, so I don't see that much complexity being added by allowing
the channel user to ask relayfs for delimiters.

> Userspace can then easily restore the original order of events.

As above, restoring the original order of events is fine if you are
looking at mbs or kbs of data. It's just totally unrealistic for
the amounts of data we want to handle.

But like I said earlier, the added relayfs mode (kdebug) would allow
for exactly what you are suggesting:
event_id = atomic_inc_return(&event_cnt);

So here's the new API based on input from Christoph and Tom:

rchan* relay_open(channel_path, bufsize, nbufs);
intrelay_close(*rchan);
intrelay_reset(*rchan)
intrelay_write(*rchan, *data_ptr, count, **wrote-pos);

intrelay_info(*rchan, *channel_info)
void   relay_set_property(*rchan, property, value);
void   relay_get_property(*rchan, property, *value);

For direct writing (currently already used by ltt, for example):

char*  relay_reserve(*rchan, len, *ts, *td, *err, *interrupting)
void   relay_commit(*rchan, *from, len, reserve_code, interrupting);
void   relay_buffers_consumed(*rchan, u32)

These are the related macros:

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

What we are dropping for later review: read/write semantics from
user-space. It has to be understood that we believe that this is
a major drawback. For one thing, you won't be able to do something
like:
$ cat /relayfs/xchg/my-file > ~/test-data

Instead, you will have to write a custom app that does open(),
mmap(), write(). We could still provide a small app/library that
did this automagically, but you've got to admit that nothing
beats the real thing.

Also note that there are people who currently use this already,
so there will be some unhappy campers.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Robert Wisniewski

Andrew Morton writes:
 > Robert Wisniewski <[EMAIL PROTECTED]> wrote:
 > >
 > > modify_val_spin()
 > >  {
 > >acquire_spin_lock()
 > >// calculate some_value based on global_val
 > >// for example c=global_val; if (c%0) some_value=10; else some_value=20;
 > >global_val = global_val + some_value
 > >release_spin_lock()
 > >  }
 > > 
 > >  modify_val_atomic()
 > >  {
 > >do
 > >// calculate some_value based on global_val
 > >// for example c=global_val; if (c%0) some_value=10; else some_value=20;
 > >global_val = global_val + some_value
 > >while (compare_and_store(global_val, , ))
 > >  }
 > > 
 > >  What's the difference.  The deal is if two processes execute this code
 > >  simultaneously and one gets interrupted in the middle of modify_val_spin,
 > >  then the other wastes its entire quantum spinning for the lock.  In the
 > >  modify_val_atomic if one process gets interrupted, no problem, the other
 > >  process can proceed through, then when the first one runs again the CAS
 > >  will fail, and it will go around the loop again.
 > 
 > One could use spin_lock_irq().  The performance would be similar.

Yes on some architectures I think you right (on some archs though I'm not
so sure) - Ingo and I had that debate a bit ago.  But as you astutely noted
or asked below, the original intent was to be able to use a single shared
buffer for user and kernel space.  In fact, the lockless design of tracing
in K42, which motivated this design does that.  For a couple of reasons we
have not (yet?) done that for LTT.  But, for example, NPTL could have made
use of it when they were investigating a tracing facility.  Recently,
another company using LTT for device driver and video debugging is very
interested in cheap user space tracing in conjunction with kernel tracing
because they need both sets of events to understand what is up.  The debate
is still open for the best way to get cheap user space logging, but there
seems to be an increasing need for it by the community.

 > 
 > > Now imagine it was the kernel involved...
 > 
 > Or are you saying that userspace does the above as well?

:-) - as above.  Furthermore, it seems that reducing the places where
interrupts are disabled would be a good thing?  By not introducing
additional disable interrupts tracing has less of an impact.  I was also
pointing out Christoph's statement that spin locks and atomic ops are the
same is not accurate (except for perhaps limited cases, but then you must
make such arguments - not necessarily good), and we had good reasons for
using an atomic op.

Thanks.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Christoph Hellwig

On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote:
> int global_val;
> 
> modify_val_spin()
> {
>   acquire_spin_lock()
>   // calculate some_value based on global_val
>   // for example c=global_val; if (c%0) some_value=10; else some_value=20;
>   global_val = global_val + some_value
>   release_spin_lock()
> }
> 
> modify_val_atomic()
> {
>   do
>   // calculate some_value based on global_val
>   // for example c=global_val; if (c%0) some_value=10; else some_value=20;
>   global_val = global_val + some_value
>   while (compare_and_store(global_val, , ))
> }
> 
> What's the difference.  The deal is if two processes execute this code
> simultaneously and one gets interrupted in the middle of modify_val_spin,
> then the other wastes its entire quantum spinning for the lock.  In the
> modify_val_atomic if one process gets interrupted, no problem, the other
> process can proceed through, then when the first one runs again the CAS
> will fail, and it will go around the loop again.  Now imagine it was the
> kernel involved...

Just prevent that with spin_lock_irq.  But anyway this example doesn't
fit the ltt code.  cmpxchg loops can make lots of sense for such simple
loops, but as soon as you need to do significant work in the loop it
starts to get problematic.  Your example would btw be better off using
atomic_t and it's primitives so you abstract away the actual implementation
and the architecture can chose the most efficient implementation.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Andrew Morton

Robert Wisniewski <[EMAIL PROTECTED]> wrote:
>
> modify_val_spin()
>  {
>   acquire_spin_lock()
>   // calculate some_value based on global_val
>   // for example c=global_val; if (c%0) some_value=10; else some_value=20;
>   global_val = global_val + some_value
>   release_spin_lock()
>  }
> 
>  modify_val_atomic()
>  {
>   do
>   // calculate some_value based on global_val
>   // for example c=global_val; if (c%0) some_value=10; else some_value=20;
>   global_val = global_val + some_value
>   while (compare_and_store(global_val, , ))
>  }
> 
>  What's the difference.  The deal is if two processes execute this code
>  simultaneously and one gets interrupted in the middle of modify_val_spin,
>  then the other wastes its entire quantum spinning for the lock.  In the
>  modify_val_atomic if one process gets interrupted, no problem, the other
>  process can proceed through, then when the first one runs again the CAS
>  will fail, and it will go around the loop again.

One could use spin_lock_irq().  The performance would be similar.

> Now imagine it was the kernel involved...

Or are you saying that userspace does the above as well?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Robert Wisniewski

Karim Yaghmour writes:
 > 
 > Christoph Hellwig wrote:
 > > the lockless mode is really just loops around cmpxchg.  It's spinlocks
 > > reinvented poorly.

Christoph,
 Sadly they're not the same, atomic operations provide a set of
functionality that simple spin locks do not give you.  Consider two
different processes each executing the following code

int global_val;

modify_val_spin()
{
acquire_spin_lock()
// calculate some_value based on global_val
// for example c=global_val; if (c%0) some_value=10; else some_value=20;
global_val = global_val + some_value
release_spin_lock()
}

modify_val_atomic()
{
do
// calculate some_value based on global_val
// for example c=global_val; if (c%0) some_value=10; else some_value=20;
global_val = global_val + some_value
while (compare_and_store(global_val, , ))
}

What's the difference.  The deal is if two processes execute this code
simultaneously and one gets interrupted in the middle of modify_val_spin,
then the other wastes its entire quantum spinning for the lock.  In the
modify_val_atomic if one process gets interrupted, no problem, the other
process can proceed through, then when the first one runs again the CAS
will fail, and it will go around the loop again.  Now imagine it was the
kernel involved...

I don't claim to have all the answers and am happy to have discussion on
something, but the attitude expressed by "It's spinlocks reinvented
poorly."  is not conducive to a useful exchange even if you were correct.

 > 
 > I beg to differ. You have to use different spinlocks depending on
 > where you are:
 > - serving user-space
 > - bh-derivatives
 > - irq
 > 
 > lockless is the same primitive regardless of your current state,
 > it's not the same as spinlocks.
 > 
 > Karim
 > -- 
 > Author, Speaker, Developer, Consultant
 > Pushing Embedded and Real-Time Linux Systems Beyond the Limits
 > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Tom Zanussi

Christoph Hellwig writes:
 > On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote:
 > >Where does this appear in relayfs and what rights do
 > >user-space apps have over it (rwx).
 > 
 > Why would you want anything but read access?

This would allow an application to write trace events of its own to a
trace stream for instance.  Also, I added a user-requested 'feature'
whereby write()s on a relayfs channel would be sent to a callback that
could be used to interpret 'out-of-band' commands sent from the
userspace application.  And if lockless logging were being used, this
could provide a cheaper way for applications to write to the trace
buffer than having to do it via syscall.

 > 
 > > bufsize, nbufs:
 > >Usually things have to be subdivided in sub-buffers to make
 > >both writing and reading simple. LTT uses this to allow,
 > >among other things, random trace access.
 > 
 > I think random access is overkill.  Keeping the code simple is more
 > important and user-space can post-process it.
 > 
 > > resize_min, resize_max:
 > >Allow for dynamic resizing of buffer.
 > 
 > Auto-resizing sounds like a really bad idea.

It also doesn't seem to be really useful to anyone, so we should
probably remove it.

Tom

 > 
 > > init_buf, init_buf_size:
 > >Is there an initial buffer containing some data that should
 > >be used to initialize the channel's content. If you're doing
 > >init-time tracing, for example, you need to have a pre-allocated
 > >static buffer that is copied to relayfs once relayfs is mounted.
 > 
 > And why can't you do this from that code?  It just needs an initcall-like
 > thing that runs after mounting of relayfs.
 > 

-- 
Regards,

Tom Zanussi <[EMAIL PROTECTED]>
IBM Linux Technology Center/RAS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Karim Yaghmour


Christoph Hellwig wrote:
> the lockless mode is really just loops around cmpxchg.  It's spinlocks
> reinvented poorly.

I beg to differ. You have to use different spinlocks depending on
where you are:
- serving user-space
- bh-derivatives
- irq

lockless is the same primitive regardless of your current state,
it's not the same as spinlocks.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Karim Yaghmour

Hello Christoph,

Christoph Hellwig wrote:
> Why would you want anything but read access?

Fine, we can put it read-only, we'll drop the "mode" field.

> I think random access is overkill.  Keeping the code simple is more
> important and user-space can post-process it.

it's overkill if you're thinking in terms of kbs or mbs of data.
it isn't if you're looking at gbs and 100gbs. please read my
other posting as to who is using this and how.

but regardless of access, you have to have some way of telling
relayfs of the size of the channel you want. bufsize, nbufs
just tell relayfs the size of the buffers you want and how many
buffers there are in the ring. both of which are really basic
to any sort of buffering scheme.

> Auto-resizing sounds like a really bad idea.

Ok, it will go.

> And why can't you do this from that code?  It just needs an initcall-like
> thing that runs after mounting of relayfs.

Ok, we'll leave it to the caller to do a relay_write() with his
init-bufs at startup.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread William Lee Irwin III

Joseph Fannin wrote:
>>With this patch, initrds seem to get 'skipped'.  I think this is
>> probably the cause for the reports of problems with RAID too.

On Sun, Jan 16, 2005 at 07:09:31PM +, Daniel Drake wrote:
> This seems likely and is probably also the cause of wli's problems 
> mentioned elsewhere in this thread.
> I had overlooked the way that initrd's work in that part of the boot 
> sequence. Will investigate.

akpm suspected this immediately, and my tests confirmed it.

I should probably do the work to make the box boot with CONFIG_MODULES=n
as I don't like initrd's or modules anyway (new points of failure).


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Daniel Drake

Hi,
Joseph Fannin wrote:
On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/

waiting-10s-before-mounting-root-filesystem.patch
 retry mounting the root filesystem at boot time

With this patch, initrds seem to get 'skipped'.  I think this is
probably the cause for the reports of problems with RAID too.
This patch should do the job. Replaces the existing 
waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1.

Daniel
Retry up to 20 times if mounting the root device fails.  This fixes booting
from usb-storage devices, which no longer make their partitions immediately
available. Also cleans up the mount_block_root() function.

Based on an earlier patch from William Park <[EMAIL PROTECTED]>

Signed-off-by: Daniel Drake <[EMAIL PROTECTED]>

--- linux-2.6.10/init/do_mounts.c.orig	2005-01-16 19:18:57.0 +
+++ linux-2.6.10/init/do_mounts.c	2005-01-16 21:04:29.198471440 +
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -261,6 +262,9 @@ static void __init get_fs_names(char *pa
 static int __init do_mount_root(char *name, char *fs, int flags, void *data)
 {
 	int err = sys_mount(name, "/root", fs, flags, data);
+	if (err == -EACCES && (flags | MS_RDONLY) == 0)
+		err = sys_mount(name, "/root", fs, flags | MS_RDONLY, data);
+
 	if (err)
 		return err;
 
@@ -273,38 +277,57 @@ static int __init do_mount_root(char *na
 	return 0;
 }
 
+static int __init mount_root_try_all_fs(char *name, char *fs_names, int flags, void *data)
+{
+	char *p;
+	int err = -EFAULT;
+
+	for (p = fs_names; *p; p += strlen(p)+1) {
+		err = do_mount_root(name, p, flags, root_mount_data);
+		if (err != -EINVAL)
+			break;
+	}
+
+	return err;
+}
+
 void __init mount_block_root(char *name, int flags)
 {
 	char *fs_names = __getname();
-	char *p;
 	char b[BDEVNAME_SIZE];
+	int tryagain = 20;
 
 	get_fs_names(fs_names);
-retry:
-	for (p = fs_names; *p; p += strlen(p)+1) {
-		int err = do_mount_root(name, p, flags, root_mount_data);
-		switch (err) {
-			case 0:
-goto out;
-			case -EACCES:
-flags |= MS_RDONLY;
-goto retry;
-			case -EINVAL:
-continue;
+
+	while (1) {
+		int err = mount_root_try_all_fs(name, fs_names, flags, root_mount_data);
+		if (err == 0)
+			break;
+
+		/*
+		 * The root device may not be ready yet, so we retry a number of times
+		 */
+		if (--tryagain) {
+			printk(KERN_WARNING "VFS: Waiting %dsec for root device...\n",
+			   tryagain);
+			ssleep(1);
+			if (!ROOT_DEV) {
+ROOT_DEV = name_to_dev_t(saved_root_name);
+create_dev(name, ROOT_DEV, root_device_name);
+			}
+			continue;
 		}
-	/*
+
+		/*
 		 * Allow the user to distinguish between failed sys_open
 		 * and bad superblock on root device.
 		 */
 		__bdevname(ROOT_DEV, b);
-		printk("VFS: Cannot open root device \"%s\" or %s\n",
-root_device_name, b);
-		printk("Please append a correct \"root=\" boot option\n");
-
+		printk(KERN_CRIT "VFS: Cannot open root device \"%s\" or %s\n",
+		   root_device_name, b);
+		printk(KERN_CRIT "Please append a correct \"root=\" boot option\n");
 		panic("VFS: Unable to mount root fs on %s", b);
 	}
-	panic("VFS: Unable to mount root fs on %s", __bdevname(ROOT_DEV, b));
-out:
 	putname(fs_names);
 }

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Tom Zanussi

Karim Yaghmour writes:
 > 
 > What I'm dropping for now is all the functions that allow a
 > subsystem to read from a channel from within the kernel. So,
 > for example, if you want to obtain large amounts of data from
 > user-space via a relayfs channel you won't be able to. Here
 > are the functions that would go:
 > 
 > rchan_reader *add_rchan_reader(channel_id, auto_consume)
 > intremove_rchan_reader(rchan_reader *reader)
 > rchan_reader *add_map_reader(channel_id)
 > intremove_map_reader(rchan_reader *reader)
 > intrelay_read(reader, buf, count, wait, *actual_read_offset)
 > void   relay_buffers_consumed(reader, buffers_consumed)
 > void   relay_bytes_consumed(reader, bytes_consumed, read_offset)
 > intrelay_bytes_avail(reader)
 > intrchan_full(reader)
 > intrchan_empty(reader)
 > 
 > We could add these at a later time when/if needed. Removing
 > these changes nothing for ltt.

One of the things that uses these functions to read from a channel
from within the kernel is the relayfs code that implements read(2), so
taking them away means you wouldn't be able to use read() on a relayfs
file.  That wouldn't matter for ltt since it mmaps the file, but there
are existing users of relayfs that do use relayfs this way.  In fact,
most of the bug reports I've gotten are from people using it in this
mode.  That doesn't mean though that it's necessarily the right thing
for relayfs or these users to be doing if they have suitable
alternatives for passing lower-volume messages in this way.  As others
have mentioned, that seems to be the major question - should relayfs
concentrate on being solely a high-speed data relay mechanism or
should it try to be more, as it currently is implemented?  If the
former, then I wonder if you need a filesystem at all - all you have
is a collection of mmappable buffers and the only thing the filesystem
provides is the namespace.  Removing read()/write() and filesystem
support would of course greatly simplify the code; I'd like to hear
from any existing users though and see what they'd be missing.

ltt would still need at least relay_buffers_consumed() though.  This
is used to support the 'no-overwrite' option, which means that when
the buffers are full i.e. the daemon has fallen behind and needs to
catch up, channel writing is 'suspended' until it catches up.

 > 
 > Also, we should try to get rid of the following. They are there
 > for allowing dynamically-resizable buffers, but if we are to
 > make buffer-management opaque, then this should be done
 > internally (Tom: I can't remember the rationale for these. Let
 > me know if there's a reason why the must be kept.)
 > 
 > intrelay_realloc_buffer(*rchan, nbufs, async)
 > intrelay_replace_buffer(*rchan)

relay_realloc_buffer actually does the work of allocating the new
buffer space for used for resizing, and since it can sleep, it's done
in the background using a work queue.  When everything's ready, the
channel buffer can then be replaced, thus relay_replace_buffer().

The only user of channel resizing that I know of is the 'dynamically
resizeable printk replacement' I posted awhile back, and that
apparently doesn't have any users, so I'd be happy to get rid of all
the resizing code.

Tom

 > 
 > I think this is a pretty major change and simplification of the
 > API along the lines of what others have asked for. Let me know
 > what you think.
 > 
 > Karim
 > -- 
 > Author, Speaker, Developer, Consultant
 > Pushing Embedded and Real-Time Linux Systems Beyond the Limits
 > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546

-- 
Regards,

Tom Zanussi <[EMAIL PROTECTED]>
IBM Linux Technology Center/RAS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Roman Zippel

Hi,

On Sun, 16 Jan 2005, Karim Yaghmour wrote:

> The per-cpu buffering issue is really specific to the client. It just
> so happens that LTT creates one channel for each CPU. Not everyone
> who needs to ship lots of data to user-space needs/wants one channel
> per cpu. You could, for example, use a relayfs channel as a big
> chunk of memory visible to both a user-space app and its kernel buddy
> in order to exchange data without ever using either needing more
> than one such channel for your entire subsystem.

It seems we first need to specify, what relayfs actually is supposed to 
be. Is it a relaying mechanism for large amount of data from kernel to 
user space or is it a general communication channel between kernel and 
user space? You have to choose one, if you mix contradicting requirements, 
you'll never get a simple abstraction layer and relayfs will always be a 
pain to work with.

> > Why not just move the ltt buffer management into relayfs and provide a 
> > small library, which extracts the event stream again? Otherwise you have 
> > to duplicate this work for every serious relayfs user anyway.
> 
> Ok, I've been meditating over what you say above for some time in order
> to understand how best to follow what you are suggesting. So here's
> what I've been able to come up with. Let me know if you have other
> suggestions:
> 
> Drop the buffer-start/end callbacks altogether. Instead, allow user
> to specify in the channel properties whether they want to have
> sub-buffer delimiters. If so, relayfs would automatically prepend
> and append the structures currently written by ltt:
> /* Start of trace buffer information */
> typedef struct _ltt_buffer_start {
>   struct timeval time;/* Time stamp of this buffer */
>   u32 tsc;/* TSC of this buffer, if applicable */
>   u32 id; /* Unique buffer ID */
> } LTT_PACKED_STRUCT ltt_buffer_start;
> 
> /* End of trace buffer information */
> typedef struct _ltt_buffer_end {
>   struct timeval time;/* Time stamp of this buffer */
>   u32 tsc;/* TSC of this buffer, if applicable */
> } LTT_PACKED_STRUCT ltt_buffer_end;

You can make it even simpler by dropping this completely. Every buffer is 
simply a list of events and you can let ltt write periodically a timer 
event. In userspace you can randomly seek at buffer boundaries and search 
for the timer events. It will require a bit more work for userspace, but 
even large amount of tracing data stays managable.

> As for lockless vs. locking there is a need for both. Not having
> to get locks has obvious advantages, but if you require strict
> timing you will want to use the locking scheme because its logging
> time is linear (see Thomas' complaints about lockless elsewhere
> in this thread, and Ingo's complaints about relayfs somewhere back
> in October.)

But why has it to be done in relayfs? Simply leave it to the user to write 
an extra id field:

event_id = atomic_inc_return(&event_cnt);

Userspace can then easily restore the original order of events.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Daniel Drake

Joseph Fannin wrote:
On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/

waiting-10s-before-mounting-root-filesystem.patch
 retry mounting the root filesystem at boot time

With this patch, initrds seem to get 'skipped'.  I think this is
probably the cause for the reports of problems with RAID too.
This seems likely and is probably also the cause of wli's problems mentioned 
elsewhere in this thread.

I had overlooked the way that initrd's work in that part of the boot sequence. 
Will investigate.

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Christoph Hellwig

On Fri, Jan 14, 2005 at 06:09:23PM -0500, Karim Yaghmour wrote:
> relayfs implements two schemes: lockless and locking. The later uses
> standard linear locking mechanisms. If you need stringent constant
> time, you know what to do.

the lockless mode is really just loops around cmpxchg.  It's spinlocks
reinvented poorly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Christoph Hellwig

On Sat, Jan 15, 2005 at 01:24:16AM +0100, Thomas Gleixner wrote:
> Putting a 200k patch into the kernel for limited usage and maybe
> restricting a generic simple non intrusive and more generic
> implementation by its mere presence is making it inapplicable enough.
> 
> Merge the instrumentation points from ltt and other projects like DSKI
> and the places where in kernel instrumentation for specific purposes is
> already available and use a simple and effective framework which moves
> the burden into postprocessing and provides a simple postmortem dump
> interface, is the goal IMHO.
> 
> When this is available, trace tool developers can concentrate on
> postprocessing improvement rather than moving postprocessing
> incapabilities into the kernel.

I completely agree with that statement.  We've been working in most
areas of the kernel to move or keep complexity and policy in userspace.
The same should be true for a tracing framework.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Christoph Hellwig

On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote:
>   Where does this appear in relayfs and what rights do
>   user-space apps have over it (rwx).

Why would you want anything but read access?

> bufsize, nbufs:
>   Usually things have to be subdivided in sub-buffers to make
>   both writing and reading simple. LTT uses this to allow,
>   among other things, random trace access.

I think random access is overkill.  Keeping the code simple is more
important and user-space can post-process it.

> resize_min, resize_max:
>   Allow for dynamic resizing of buffer.

Auto-resizing sounds like a really bad idea.

> init_buf, init_buf_size:
>   Is there an initial buffer containing some data that should
>   be used to initialize the channel's content. If you're doing
>   init-time tracing, for example, you need to have a pre-allocated
>   static buffer that is copied to relayfs once relayfs is mounted.

And why can't you do this from that code?  It just needs an initcall-like
thing that runs after mounting of relayfs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-16 Thread Robert Wisniewski

Karim Yaghmour writes:
 > 
 > Hello Thomas,
 > 
 > In the interest of avoiding expanding the thread too thin, I'm replying to
 > both emails in the same time.
 > 
 > Thomas Gleixner wrote:
 > >>relayfs is a generalized buffering mechanism. Tracing is one application
 > >>it serves. Check out the web site: "high-speed data-relay filesystem."
 > >>Fancy name huh ...
 > >
 > >
 > > I do not doubt that.
 > >
 > > But hardwiring an instrumentation framework on it is also hardwiring
 > > implicit restrictions on the usability of the instrumentation for
 > > certain purposes.
 > 
 > To a certain extent this is true. Please refer to my reply to your RFC
 > for a discussion of this.
 > 
 > >>Well for one thing, a portion of code running in user-context won't
 > >>disable interrupts while it's attempting to get buffer space, and
 > >>therefore won't impact on interrupt delivery.
 > >
 > >
 > > The do {} while loops are in the fast ltt_log_event path

As Greg's comments implicitly involved this issue as well, maybe it's worth
expanding on what is going on here.  The idea behind the lockless tracing
is for each process/thread to atomically reserve space in the buffer, then
write in the events.  Also note that buffers are per-processor.  So the do
{} while loop loads the current index, does a calculation and attempts to
use the calculated value (which is the old index + length of current event)
to atomically compare_and_swap with the actual index pointer.  As Karim
correctly notes, the only way this will fail is if an interrupt occurred
during the couple of instruction calculation, i.e., between when the old
value was loaded and when we do the CAS, so it's unlikely, but even much
more unlikely that, as he notes, this process would be woken up only for a
couple of instructions and re-interrupted.  Back to Greg's volatile issue:
The reason the index needs to be volatile (or as was originally coded the
reason we clobbered the registers) is to make sure the compiler knows the
index value needs to get reloaded from memory each time around the loop.

Hope this helps.  I'm certainly happy to discuss in more length if there's
any concerns/questions.

-bob

Robert Wisniewski
The K42 MP OS Project
http://www.research.ibm.com/K42/
[EMAIL PROTECTED]

 > 
 > You mean that it would impact on interrupt deliver? This code's behavior
 > has actually been carefully studied, and what has been seen is that
 > there code almost never loops, and when it does, it very rarely does
 > it more than twice. In the case of an interrupt, you'd have to receive
 > an interrupt while reserving space for logging a current's interrupt
 > occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski
 > on this as he's the one that implemented this code and studied its
 > behavior in depth.
 > 
 > > Yeah, did you answer one of my arguments except claiming that I'm to
 > > stupid to understand how it works ? 
 > 
 > If I miss-spoke, then I appologize. For one thing, I've never thought
 > of you as stupid. I'm just trying to get specifics here.
 > 
 > > I just dont like the idea, that instrumentation is bound on relayfs and
 > > adds a feature to the kernel which fits for a restricted set of problems
 > > rather than providing a generic optimized instrumentation framework,
 > > where one can use relayfs as a backend, if it fits his needs. Making
 > > this less glued together leaves the possibility to use other backends. 
 > 
 > Yes, I understand and I hope my other mail properly addresses this issue.
 > 
 > > There is a loop in ltt_log_event, which enforces the processing of each
 > > event twice. Spliting traces is postprocessing and can be done
 > > elsewhere.
 > 
 > Sorry, this is not postprocessing. Let me explain:
 > 
 > Basically, the ltt framework allows only one tracing session to be active
 > at all times. IOW, if you were planning on starting a 2 week trace and
 > after doing so wanted to trace a short 10s on an application then you are
 > screwed, LTT won't allow you to do that. Currently this is a limitation
 > which we haven't heard any complaints about, so we're not going to
 > generalize it until there is proof that people really need this.
 > 
 > However, there are cases where you want to have tracing running at _all_
 > times in what is refered to as flight-recorder mode and only dump the
 > content of the buffers when something special happens. Yet, those who
 > are interested in having this 24x7 mode also know enough about tracing
 > that they do need to actually trace other things for short periods
 > without disrupting their flight-recording. That's why there's a loop.
 > An event will be processed twice only if you're tracing AND flight-
 > recording in the same time.
 > 
 > There is no way to do an equivalent of what I just described with any
 > form of postprocessing.
 > 
 > Here's the proper snippet from include/linux/ltt-events.h:
 > /* We currently support 2 traces, normal trace and flight recorder */
 > #define NR_TRACES

Re: 2.6.11-rc1-mm1 waiting-10s-before-mounting-root-....

2005-01-16 Thread syrius . ml

Daniel Kirsten <[EMAIL PROTECTED]> writes:

>> Are you using an initrd?
> yes.

Then read Documentation/initrd.txt ...
Your initrd must be deprecated, i guess you have to use
root=/dev/whatever/your_final_root_fs with it while it should be
root=/dev/ram0. (pretty sure it doesn't use pivot_root either :) )

FYI it works here with an updated initrd without reversing a patch...

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Breakage with raid in 2.6.11-rc1-mm1 [Regression in mm]

2005-01-16 Thread Reuben Farrelly

Hi,
Reuben Farrelly wrote:
At 12:58 a.m. 15/01/2005, Andrew Morton wrote:
Reuben Farrelly <[EMAIL PROTECTED]> wrote:
>
> Something seems to have broken with 2.6.11-rc1-mm1, which worked ok 
with
> 2.6.10-mm3.
>
> NET: Registered protocol family 17
> Starting balanced_irq
> BIOS EDD facility v0.16 2004-Jun-25, 2 devices found
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: ... autorun DONE.

> Kernel panic - not syncing: VFS: Unable to mount root fs on 
unknown-block(0,0)
>
> The system is running 5 RAID-1 partitions, and md2 is the root as per
> grub.conf.  Problem seems to be that raid autodetection finds no raid
> partitions :(
>
> The two ST380013AS SATA drives are detected earlier in the boot, so 
I don't
> think that's the problem..

hm, the only raidy thing we have in there is the below.  Maybe you could
try reverting that?
--- 25/drivers/md/raid5.c~raid5-overlapping-read-hack   2005-01-09 
22:20:40.211246912 -0800
+++ 25-akpm/drivers/md/raid5.c  2005-01-09 22:20:40.216246152 -0800
@@ -232,6 +232,7 @@ static struct stripe_head *__find_stripe
 }

 static void unplug_slaves(mddev_t *mddev);
+static void raid5_unplug_device(request_queue_t *q);
 static struct stripe_head *get_active_stripe(raid5_conf_t *conf, 
sector_t sector,
 int pd_idx, int noblock)

Ok the breakage occurred somewhere between 2.6.10-mm3 (works) and 
2.6.11-rc1 (doesn't work) ie wasn't introduced into the latest -mm 
patchset as I first thought.

Are there any other patches that might be worth a try backing out?
reuben
I did a full untar of the source and rebuilt my (crusty old) config file
from scratch, and it seems to have come right now.  Can't really explain
it though...but obviously wasn't a problem with the -mm release as I
first though.  Now running -rc1-mm1 with no problems and no other patches.
Thanks to those who helped on what turned out to be a false alarm.
reuben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Karim Yaghmour

Hello Roman,

Roman Zippel wrote:
> It's interesting to read more about ltt's requirements, but I still think 
> it's possible to leave this work to the relayfs layer.

Ok, I'm willing to play ball, but can you be a little bit more specific.

> Why not just move the ltt buffer management into relayfs and provide a 
> small library, which extracts the event stream again? Otherwise you have 
> to duplicate this work for every serious relayfs user anyway.

Ok, I've been meditating over what you say above for some time in order
to understand how best to follow what you are suggesting. So here's
what I've been able to come up with. Let me know if you have other
suggestions:

Drop the buffer-start/end callbacks altogether. Instead, allow user
to specify in the channel properties whether they want to have
sub-buffer delimiters. If so, relayfs would automatically prepend
and append the structures currently written by ltt:
/* Start of trace buffer information */
typedef struct _ltt_buffer_start {
struct timeval time;/* Time stamp of this buffer */
u32 tsc;/* TSC of this buffer, if applicable */
u32 id; /* Unique buffer ID */
} LTT_PACKED_STRUCT ltt_buffer_start;

/* End of trace buffer information */
typedef struct _ltt_buffer_end {
struct timeval time;/* Time stamp of this buffer */
u32 tsc;/* TSC of this buffer, if applicable */
} LTT_PACKED_STRUCT ltt_buffer_end;

This would also allow dropping the start_reserve, end_reserve, and
channel_start_reserve. The latter can be added by ltt as its first
event.

Is this what you are looking for and is there something else we should
be doing.

> Completely abstracting the buffer management would the make whole 
> interface simpler and it would be a lot easier to change without breaking 
> everything. E.g. it would be possible to use per cpu buffers and remove 
> the need for different locking mechanisms, for a good tracing mechanism 
> it's not just important that it's lockless, but also that the cpus don't 
> share cache lines in the fast path. In this regard relayfs/ltt has really 
> still too much overhead and the complex relayfs API isn't really making it 
> easy to fix this.

The per-cpu buffering issue is really specific to the client. It just
so happens that LTT creates one channel for each CPU. Not everyone
who needs to ship lots of data to user-space needs/wants one channel
per cpu. You could, for example, use a relayfs channel as a big
chunk of memory visible to both a user-space app and its kernel buddy
in order to exchange data without ever using either needing more
than one such channel for your entire subsystem.

As for lockless vs. locking there is a need for both. Not having
to get locks has obvious advantages, but if you require strict
timing you will want to use the locking scheme because its logging
time is linear (see Thomas' complaints about lockless elsewhere
in this thread, and Ingo's complaints about relayfs somewhere back
in October.)

But in trying to make things simpler, here's a reworked API:

rchan* relay_open(channel_path, mode, bufsize, nbufs);
intrelay_close(*rchan);
intrelay_reset(*rchan)
intrelay_write(*rchan, *data_ptr, count, **wrote-pos);

intrelay_info(*rchan, *channel_info)
void   relay_set_property(*rchan, property, value);
void   relay_get_property(*rchan, property, *value);

For direct writing (currently already used by ltt, for example):

char*  relay_reserve(*rchan, len, *ts, *td, *err, *interrupting)
void   relay_commit(*rchan, *from, len, reserve_code, interrupting);

These are the related macros:

#define relay_write_direct(DEST, SRC, SIZE) \
#define relay_lock_channel(RCHAN, FLAGS) \
#define relay_unlock_channel(RCHAN, FLAGS) \

As I hinted elsewhere, we would now have three modes for relayfs
channels:
- locking => relies on local_irq_save.
- lockless => relies on try_reserve/fail->retry (based on cmpxchg).
- kdebug => this is for kernel debugging.

The last one could be based on Ingo's tracing code, or any
implementation suggestions by Thomas. It wouldn't do all
the checks and provide all the capabilities of the other two
mechanisms, but would really be a hot-path logger with only
minimalistic provisions for content loss and other such things.

(note to Tom: time_delta_offset that used to be in relay_write
should be a property set using relay_set_property).

What I'm dropping for now is all the functions that allow a
subsystem to read from a channel from within the kernel. So,
for example, if you want to obtain large amounts of data from
user-space via a relayfs channel you won't be able to. Here
are the functions that would go:

rchan_reader *add_rchan_reader(channel_id, auto_consume)
intremove_rchan_reader(rchan_reader *reader)
rchan_reader *add_map_reader(channel_id)
intremove_map_reader(rchan_reader *reader)
intrelay_read(reader, buf, count, wait, *actual_read_offset)
void   relay_buffers_consumed

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-15 Thread Karim Yaghmour


Hello Roman,

Roman Zippel wrote:
> On Sat, 15 Jan 2005, Karim Yaghmour wrote:
>>In addition, and this is a very important issue, quite a few
>>kernel developers mistook LTT for a kernel debugging tool, which
>>it was never meant to be. When, in fact, if you ask those who have
>>looked at using it for that purpose (try Marcelo or Andrea) you will
>>see that they didn't find it to be appropriate for them. And
>>rightly so, it was never meant for that purpose. Even lately, when
>>I suggested Ingo try using relayfs instead of his custom tracing
>>code for his preemption work, he looked at it and said that it
>>wasn't suited, but would consider reusing parts of it if it were
>>in the kernel.
> 
> Well, that's really a core problem. We don't want to duplicate 
> infrastructure, which practically does the same. So if relayfs isn't 
> usable in this kind of situation, it really raises the question whether 
> relayfs is usable at all. We need to make relayfs generally usable, 
> otherwise it will join the fate of devfs.

Hmm, coming from you I will take this is a pretty strong endorsement
for what I was suggesting earlier: provide a basic buffering mode
in relayfs to be used in kernel debugging. However, it must be
understood that this is separate from the existing modes and ltt,
for example, could not use such a basic infrastructure. If this is
ok with you, and no one wants to complain too loudly about this, I
will go ahead and add this to our to-do list for relayfs.

Karim
-- 
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Karim Yaghmour

Hello Thomas,

In the interest of avoiding expanding the thread too thin, I'm replying to
both emails in the same time.

Thomas Gleixner wrote:
>>relayfs is a generalized buffering mechanism. Tracing is one application
>>it serves. Check out the web site: "high-speed data-relay filesystem."
>>Fancy name huh ...
>
>
> I do not doubt that.
>
> But hardwiring an instrumentation framework on it is also hardwiring
> implicit restrictions on the usability of the instrumentation for
> certain purposes.

To a certain extent this is true. Please refer to my reply to your RFC
for a discussion of this.

>>Well for one thing, a portion of code running in user-context won't
>>disable interrupts while it's attempting to get buffer space, and
>>therefore won't impact on interrupt delivery.
>
>
> The do {} while loops are in the fast ltt_log_event path

You mean that it would impact on interrupt deliver? This code's behavior
has actually been carefully studied, and what has been seen is that
there code almost never loops, and when it does, it very rarely does
it more than twice. In the case of an interrupt, you'd have to receive
an interrupt while reserving space for logging a current's interrupt
occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski
on this as he's the one that implemented this code and studied its
behavior in depth.

> Yeah, did you answer one of my arguments except claiming that I'm to
> stupid to understand how it works ? 

If I miss-spoke, then I appologize. For one thing, I've never thought
of you as stupid. I'm just trying to get specifics here.

> I just dont like the idea, that instrumentation is bound on relayfs and
> adds a feature to the kernel which fits for a restricted set of problems
> rather than providing a generic optimized instrumentation framework,
> where one can use relayfs as a backend, if it fits his needs. Making
> this less glued together leaves the possibility to use other backends. 

Yes, I understand and I hope my other mail properly addresses this issue.

> There is a loop in ltt_log_event, which enforces the processing of each
> event twice. Spliting traces is postprocessing and can be done
> elsewhere.

Sorry, this is not postprocessing. Let me explain:

Basically, the ltt framework allows only one tracing session to be active
at all times. IOW, if you were planning on starting a 2 week trace and
after doing so wanted to trace a short 10s on an application then you are
screwed, LTT won't allow you to do that. Currently this is a limitation
which we haven't heard any complaints about, so we're not going to
generalize it until there is proof that people really need this.

However, there are cases where you want to have tracing running at _all_
times in what is refered to as flight-recorder mode and only dump the
content of the buffers when something special happens. Yet, those who
are interested in having this 24x7 mode also know enough about tracing
that they do need to actually trace other things for short periods
without disrupting their flight-recording. That's why there's a loop.
An event will be processed twice only if you're tracing AND flight-
recording in the same time.

There is no way to do an equivalent of what I just described with any
form of postprocessing.

Here's the proper snippet from include/linux/ltt-events.h:
/* We currently support 2 traces, normal trace and flight recorder */
#define NR_TRACES   2
#define TRACE_HANDLE0
#define FLIGHT_HANDLE   1

> In _ltt_log_event lives quite a bunch of if(...) processing decisions
> which have to be evaluated for _each_ event.

Correct, and I'm honest enough with myself to admit that this is the bit
of code that I think needs the most reviewing. So, in order to help
you help me, here's the various code snippets and things I can think
of which would help make the code faster/simpler:

Here's the preamble where we check some make some basic sanity checks:

if (!trace)
return -ENOMEDIUM;

if (trace->paused)
return -EBUSY;

tracer_handle = trace->trace_handle;

if (!trace->flight_recorder && (trace->daemon_task_struct == NULL))
return -ENODEV;

channel_handle = trace_channel_handle(tracer_handle, cpu_id);

if ((trace->tracer_started == 1) || (event_id == LTT_EV_START) || 
(event_id == LTT_EV_BUFFER_START))
goto trace_event;

return -EBUSY;

trace_event:
if (!ltt_test_bit(event_id, &trace->traced_events))
return 0;

Basically, unless we've succeeded in all those if's, we're not going to
write anything. I think we could get rid of the first 4 ones by simply
maintaining a state-machine for the tracer. Then we could either have
a single if or even use function pointers (though I think this costs
more) to call or not call _ltt_log_event. As for checking whether the
event has a certain ID (EV_START or EV_BUFFER_STAR

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-15 Thread Roman Zippel

Hi,

On Sat, 15 Jan 2005, Karim Yaghmour wrote:

> In addition, and this is a very important issue, quite a few
> kernel developers mistook LTT for a kernel debugging tool, which
> it was never meant to be. When, in fact, if you ask those who have
> looked at using it for that purpose (try Marcelo or Andrea) you will
> see that they didn't find it to be appropriate for them. And
> rightly so, it was never meant for that purpose. Even lately, when
> I suggested Ingo try using relayfs instead of his custom tracing
> code for his preemption work, he looked at it and said that it
> wasn't suited, but would consider reusing parts of it if it were
> in the kernel.

Well, that's really a core problem. We don't want to duplicate 
infrastructure, which practically does the same. So if relayfs isn't 
usable in this kind of situation, it really raises the question whether 
relayfs is usable at all. We need to make relayfs generally usable, 
otherwise it will join the fate of devfs.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][2.6.11-rc1-mm1] relayfs - remove klog debugging channel

2005-01-15 Thread Tom Zanussi

Andrew,

This patch removes from relayfs the 'klog debugging channel', which is
a relayfs 'application' that doesn't belong in the main code.  Please
apply.

Signed-off-by: Tom Zanussi <[EMAIL PROTECTED]>

diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/Kconfig 
linux-2.6.11-rc1-mm1-cur/fs/Kconfig
--- linux-2.6.11-rc1-mm1-vanilla/fs/Kconfig Fri Jan 14 06:13:12 2005
+++ linux-2.6.11-rc1-mm1-cur/fs/Kconfig Fri Jan 14 09:28:25 2005
@@ -923,8 +923,7 @@ config RELAYFS_FS
  an efficient mechanism for tools and facilities to relay large
  amounts of data from kernel space to user space.  It's not useful
  on its own, and should only be enabled if other facilities that
- need it are enabled, such as for example klog or the Linux Trace
- Toolkit.
+ need it are enabled, such as for example the Linux Trace Toolkit.
 
  See  for further
  information.
@@ -935,37 +934,6 @@ config RELAYFS_FS
  module, say M here and read .
 
  If unsure, say N.
-
-config KLOG_CHANNEL
-   bool "Enable klog debugging support"
-   depends on RELAYFS_FS
-   default n
-   help
- If you say Y to this, a relayfs channel named klog will be created
- in the root of the relayfs file system.  You can write to the klog
- channel using klog() or klog_raw() from within the kernel or
- kernel modules, and read from the klog channel by mounting relayfs
- and using read(2) to read from it (or using cat).  If you're not
- sure, say N.
-
-config KLOG_CHANNEL_AUTOENABLE
-   bool "Enable klog logging on startup"
-   depends on KLOG_CHANNEL
-   default y
-   help
- If you say Y to this, the klog channel will be automatically enabled
- on startup.  Otherwise, to turn klog logging on, you need use
- sysctl (fs.relayfs.klog_enabled).  This option is used in cases where
- you don't actually want the channel to be written to until it's
- enabled.  If you're not sure, say Y.
-
-config KLOG_CHANNEL_SHIFT
-   depends on KLOG_CHANNEL
-   int "klog debugging channel size (14 => 16KB, 22 => 4MB)"
-   range 14 22
-   default 21
-   help
- Select klog debugging channel size as a power of 2.
 
 endmenu
 
diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/Makefile 
linux-2.6.11-rc1-mm1-cur/fs/relayfs/Makefile
--- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/MakefileFri Jan 14 06:13:13 2005
+++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/MakefileFri Jan 14 09:30:25 2005
@@ -5,4 +5,4 @@
 obj-$(CONFIG_RELAYFS_FS) += relayfs.o
 
 relayfs-y := relay.o relay_lockless.o relay_locking.o inode.o resize.o
-relayfs-$(CONFIG_KLOG_CHANNEL) += klog.o
+
diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/inode.c 
linux-2.6.11-rc1-mm1-cur/fs/relayfs/inode.c
--- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/inode.c Fri Jan 14 06:13:13 2005
+++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/inode.c Fri Jan 14 09:29:17 2005
@@ -604,19 +604,12 @@ static int __init
 init_relayfs_fs(void)
 {
int err = register_filesystem(&relayfs_fs_type);
-#ifdef CONFIG_KLOG_CHANNEL
-   if (!err)
-   create_klog_channel();
-#endif
return err;
 }
 
 static void __exit
 exit_relayfs_fs(void)
 {
-#ifdef CONFIG_KLOG_CHANNEL
-   remove_klog_channel();
-#endif
    unregister_filesystem(&relayfs_fs_type);
 }
 
diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/klog.c 
linux-2.6.11-rc1-mm1-cur/fs/relayfs/klog.c
--- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/klog.c  Fri Jan 14 06:13:13 2005
+++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/klog.c  Wed Dec 31 18:00:00 1969
@@ -1,206 +0,0 @@
-/*
- * KLOGGeneric Logging facility built upon the relayfs 
infrastructure
- *
- * Authors:Hubertus Franke  ([EMAIL PROTECTED])
- * Tom Zanussi  ([EMAIL PROTECTED])
- *
- * Please direct all questions/comments to [EMAIL PROTECTED]
- *
- * Copyright (C) 2003, IBM Corp
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/* klog channel id */
-static int klog_channel = -1;
-
-/* maximum size of klog formatting buffer beyond which truncation will occur */
-#define KLOG_BUF_SIZE (512)
-/* per-cpu klog formatting buffer */
-static char buf[NR_CPUS][KLOG_BUF_SIZE];
-
-/*
- * klog_enabled determines whether klog()/klog_raw() actually do write
- * to the klog channel at any given time. If klog_enabled == 1

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Roman Zippel

Hi,

On Fri, 14 Jan 2005, Karim Yaghmour wrote:

> > Why should a subsystem care about the details of the buffer management?
> 
> Because it wants to enforce a data format on buffer boundaries.

It's interesting to read more about ltt's requirements, but I still think 
it's possible to leave this work to the relayfs layer.
Why not just move the ltt buffer management into relayfs and provide a 
small library, which extracts the event stream again? Otherwise you have 
to duplicate this work for every serious relayfs user anyway.
Completely abstracting the buffer management would the make whole 
interface simpler and it would be a lot easier to change without breaking 
everything. E.g. it would be possible to use per cpu buffers and remove 
the need for different locking mechanisms, for a good tracing mechanism 
it's not just important that it's lockless, but also that the cpus don't 
share cache lines in the fast path. In this regard relayfs/ltt has really 
still too much overhead and the complex relayfs API isn't really making it 
easy to fix this.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-15 Thread Karim Yaghmour

Hello Thomas,

I don't mind having a general discussion about instrumentation, but
it has to be understood that the topic is so general and means so
many different things to different people that we are unlikely to
reach any useful consensus. Believe me, it's not for the lack of
trying. More below.

Thomas Gleixner wrote:
> 

:D

> One of those backends is LTT+relayfs. 
> I really respect the work you have done there, but please accept that I
> just see the limitations and try to figure out a way to make it more
> generic and flexible before it is cemented into the kernel and makes it
> hard to use for other interesting instrumentation aspects and maybe
> enforces redundant implementation of infrastructure related
> functionality.
> 
> E.g. tracking down timing related issues can make use from such
> functionality if the infrastructure is provided seperately.
> I guess a lot of developers would be happy to use it when it is already
> around in the kernel and it can help testers for giving better
> information to developers.

I would invite you to review the history behind LTT and the history
behind the efforts to get LTT integrated in the kernel (which are
two separate topics.) If you look back, you will see that I worked
very hard trying to get people to think about a common framework
and that I and others made numerous suggestions in this regard. Here
are a few examples:

- DProbes (kprobes ancestor):
Shortly after dprobes came out in 2000, I was one of the first to
suggest that there could be interfacing between both to allow
dynamically added trace points. We worked with, and eventually
joined forces with, the IBM team working on this and very early
on, LTT and DProbes were interfacing:
http://marc.theaimsgroup.com/?l=linux-kernel&m=97079714009328&w=2
- OProfile:
When time came to integrate oprofile in the kernel, I tried to push
for oprofile to use ltt as it's logging engine (to John's utter
horror.) relayfs didn't exist at the time, and obviously oprofile
made it in without relying on ltt.
Here's a posting from July 2002 where I suggested oprofile rely on
ltt. In that same posting I listed a number of drivers/subsystems
that already contained tracing statements. Obviously I was pointing
out that there was an opportunity to create a common, uniform
infrastructure based on ltt:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102624656615567&w=2
- Syscalltrack:
In replying to a posting of someone looking for tracing info, there
was a brief discussion as to how syscalltrack could use ltt instead
of: a) redirecting the syscall table, b) have its own buffering
mechanism. Again, relayfs didn't exist at the time:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102822343523369&w=2
- Event logging:
When there was discussion about event logging, there was suggestion
to use ltt's engine. Again, relayfs wasn't there:
http://marc.theaimsgroup.com/?l=linux-kernel&m=101836133400796&w=2

And there are many other cases. As you can see, it's not as if
I didn't try to have this discussion before. Unfortunately, interest
in this was rather limited.

In addition, and this is a very important issue, quite a few
kernel developers mistook LTT for a kernel debugging tool, which
it was never meant to be. When, in fact, if you ask those who have
looked at using it for that purpose (try Marcelo or Andrea) you will
see that they didn't find it to be appropriate for them. And
rightly so, it was never meant for that purpose. Even lately, when
I suggested Ingo try using relayfs instead of his custom tracing
code for his preemption work, he looked at it and said that it
wasn't suited, but would consider reusing parts of it if it were
in the kernel.

So, in general, one thing I learned over the years is to not touch
the topic of kernel debugging even with a 10 foot poll when
discussing LTT.

What you are hinting at here (mention of developers vs. testers,
for example), and your stated preference for the type of ring-buffer
you described earlier clearly goes in the direction I've learned to
avoid: buffering support for the general purpose of kernel debugging.

Let me say outright that I see the relevance of what you are looking
for, but let me also say that what we tried to achieve with relayfs
is to provide a general mechanism for kernel subsystems that need to
convey large amounts of data to user-space. We did not attempt to
solve the problem of providing a buffering framework for core kernel
debugging. As I mentioned to Ingo in the mail I referred to earlier
regarding the type of buffering you are looking for:
> The above tracer may indeed be very appropriate for kernel development,
> but it doesn't provide enough functionality for the requirements of
> mainstream users.

If there is interest for using either relayfs and/or ltt for that
purpose, then this is an entirely different mandate and a few things
would need to be added for that to happen. For starters, we could
add another mode to relayfs. Currently, it supports a locking and

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Joseph Fannin

On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/

> waiting-10s-before-mounting-root-filesystem.patch
>   retry mounting the root filesystem at boot time

With this patch, initrds seem to get 'skipped'.  I think this is
probably the cause for the reports of problems with RAID too.

Just after loading the initrd (RAMDISK: Loading 5284KiB [1 disk]
into ram disk...) the kernel tries to mount the real root fs -- if the
necessary drivers are built-in, it proceeds from there; if not, not.

I'm guessing that when the initrd code calls mount_block_root() to
mount the ramdisk, this bit makes it decide to try to mount the real
root instead:

 if (!ROOT_DEV) {
ROOT_DEV = name_to_dev_t(saved_root_name);
create_dev(name, ROOT_DEV, root_device_name);
 }

Perhaps this should not be done until after the first attempt to
mount fails?  Sorry, I haven't had nearly enough coffee today to
attempt to make a patch. :-)

-- 
Joseph Fannin
[EMAIL PROTECTED]

"Bull in pure form is rare; there is usually some contamination by data."
-- William Graves Perry Jr.

signature.asc
Description: Digital signature

Re: 2.6.11-rc1-mm1 waiting-10s-before-mounting-root-....

2005-01-15 Thread Daniel Kirsten

> Are you using an initrd?

yes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)

2005-01-15 Thread Thomas Gleixner

On Fri, 2005-01-14 at 15:22 -0800, Tim Bird wrote:
>  but not 1) supporting infrastructure for timestamping, managing event
>  data, etc., and 2) a static list of generally useful tracepoints.

Both points are well taken. Thats the essential minimum what
instrumentation needs.

I'd like to see this infrastructure usable for all kinds of
instrumentation mechanisms which are built in to the kernel already or
functions which are used for similar purposes in experimental trees and
other instrumentation related projects. 

This requires to seperate the backend from the infrastructure, so you
can chose from a set of backends which fit best for the intended use. 

One of those backends is LTT+relayfs. 
I really respect the work you have done there, but please accept that I
just see the limitations and try to figure out a way to make it more
generic and flexible before it is cemented into the kernel and makes it
hard to use for other interesting instrumentation aspects and maybe
enforces redundant implementation of infrastructure related
functionality.

E.g. tracking down timing related issues can make use from such
functionality if the infrastructure is provided seperately.
I guess a lot of developers would be happy to use it when it is already
around in the kernel and it can help testers for giving better
information to developers.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Breakage with raid in 2.6.11-rc1-mm1 [Regression in mm]

2005-01-15 Thread Sander

Randy.Dunlap wrote (ao):
> Reuben Farrelly wrote:
> >At 12:58 a.m. 15/01/2005, Andrew Morton wrote:
> >
> >>Reuben Farrelly <[EMAIL PROTECTED]> wrote:
> >>>
> >>> Something seems to have broken with 2.6.11-rc1-mm1, which worked ok 
> >>with
> >>> 2.6.10-mm3.
> >>>
> >>> NET: Registered protocol family 17
> >>> Starting balanced_irq
> >>> BIOS EDD facility v0.16 2004-Jun-25, 2 devices found
> >>> md: Autodetecting RAID arrays.
> >>> md: autorun ...
> >>> md: ... autorun DONE.
> >>> VFS: Waiting 19sec for root device...

...

> >>> VFS: Waiting 1sec for root device...
> >>> VFS: Cannot open root device "md2" or unknown-block(0,0)
> >>> Please append a correct "root=" boot option
> >>> Kernel panic - not syncing: VFS: Unable to mount root fs on 
> >>unknown-block(0,0)
> >>>
> >>> The system is running 5 RAID-1 partitions, and md2 is the root as
> >>> per grub.conf.  Problem seems to be that raid autodetection finds
> >>> no raid partitions :(
> >>>
> >>> The two ST380013AS SATA drives are detected earlier in the boot, so 
> >>I don't
> >>> think that's the problem..
> >>
> >>hm, the only raidy thing we have in there is the below.  Maybe you could
> >>try reverting that?
> >>
> >>--- 25/drivers/md/raid5.c~raid5-overlapping-read-hack   2005-01-09 
> >>22:20:40.211246912 -0800
> >>+++ 25-akpm/drivers/md/raid5.c  2005-01-09 22:20:40.216246152 -0800

...

> >Ok the breakage occurred somewhere between 2.6.10-mm3 (works) and 
> >2.6.11-rc1 (doesn't work) ie wasn't introduced into the latest -mm 
> >patchset as I first thought.
> >
> >Are there any other patches that might be worth a try backing out?
> 
> Someone else reported that they had to back out this one:
> waiting-10s-before-mounting-root-filesystem.patch
> 
> Can you revert that one and let us know how it goes?

It Works For Me(tm). This is unpatched 2.6.11-rc1-mm1 (no patches
reverted too):

# uname -r
2.6.11-rc1-mm1
# cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid5] [multipath] [raid10] 
Event: 2   
md1 : active raid10 sdd2[3] sdc2[2] sdb2[1] sda2[0]
  70684416 blocks 128K chunks 2 near-copies [4/4] []
  
md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]
  500608 blocks [4/4] []
  
unused devices: 
# mount
/dev/md1 on / type reiser3 (rw,sync,data=journal,barrier=flush)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/md0 on /boot type ext2 (ro)
tmpfs on /tmp type tmpfs (rw)


So the problem depends on something. This system is SCSI, and I don't
use modules. I'm happy to provide more info if that would be of any
help.

-- 
Humilis IT Services and Solutions
http://www.humilis.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Thomas Gleixner

On Fri, 2005-01-14 at 20:25 -0500, Karim Yaghmour wrote:
> Thomas Gleixner wrote:
>
> You have previously demonstrated that you do not understand the
> implementation you are criticizing. You keep repeating the size
> of the patch like a mantra, yet when pressed for actual bits of
> code that need fixing, you use a circular argument to slip away.

Yeah, did you answer one of my arguments except claiming that I'm to
stupid to understand how it works ? 

I completely understand what this code does and I don't beat on the
patch size. I beat on the timing burden and restrictions which are given
by the implementation.

I have no objection against relayfs itself. I can just leave the config
switch off, so it does not affect me.

Adding instrumentation to the kernel is a good thing. 

I just dont like the idea, that instrumentation is bound on relayfs and
adds a feature to the kernel which fits for a restricted set of problems
rather than providing a generic optimized instrumentation framework,
where one can use relayfs as a backend, if it fits his needs. Making
this less glued together leaves the possibility to use other backends. 

> If you feel that there is some unncessary processing being done
> in the kernel, please show me the piece of code affected so that
> it can be fixed if it is broken.

Just doing codepath analysis shows me:

There is a loop in ltt_log_event, which enforces the processing of each
event twice. Spliting traces is postprocessing and can be done
elsewhere.

In _ltt_log_event lives quite a bunch of if(...) processing decisions
which have to be evaluated for _each_ event.

The relay_reserve code can loop in the do { } while() and even go into a
slow path where another do { } while() is found.
So it can not be used in fast paths and for timing related problem
tracking, because it adds variable time overhead.

Due to the fact, that the ltt_log_event path is not preempt safe you can
actually hit the additional go in the do { } while() loop.

I pointed out before, that it is not possible to selectively select the
events which I'm interested in during compile time. I get either nothing
or everything. If I want to use instrumentation for a particular
problem, why must I process a loop of _ltt_log_event calls for stuff I
do not need instead of just compiling it away ?

If I compile a event in, then adding a couple of checks into the
instrumentation macro itself does not hurt as much as leaving the
straight code path for a disabled event.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Thomas Gleixner

Hi Karim,

On Fri, 2005-01-14 at 20:14 -0500, Karim Yaghmour wrote:
> Gee Thomas, I guess you really want to take this one until the last
> man is standing. Feel free to use the ad-hominem tone if it suits
> you. Don't hold it against me though if I don't bite :)

No personal offence was intended.

> Thomas Gleixner wrote:
> > It's not only me, who needs constant time. Everybody interested in
> > tracing will need that. In my opinion its a principle of tracing.
> 
> relayfs is a generalized buffering mechanism. Tracing is one application
> it serves. Check out the web site: "high-speed data-relay filesystem."
> Fancy name huh ...

I do not doubt that. 

But hardwiring an instrumentation framework on it is also hardwiring
implicit restrictions on the usability of the instrumentation for
certain purposes.

> > The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces
> > locks by do { } while loops. So what ?
> 
> Well for one thing, a portion of code running in user-context won't
> disable interrupts while it's attempting to get buffer space, and
> therefore won't impact on interrupt delivery.

The do {} while loops are in the fast ltt_log_event path

> Clearly you haven't read the implementation and/or aren't familiar with
> its use. Usually, what you want to do is open(), mmap(), write(), there
> is no "conversion" to a file. The filesystem abstraction is just a
> namespace holder for us.

I have read it and tried it. I don't see a point why I can't map a
ringbuffer into user space. 
I'm not beating on the ringbuffer, but I'm using it as an example. :)

> That's not the point. You're bending backwards as far as you can reach
> trying to raise as much mud as you can, but when pressed for actual
> constructive input you hide behind a strawman argument. If you don't
> have anything to say, then stop whining.

I gave constructive criticism along with points, where I just point on
the restrictions and weakness of the implementation.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Miklos Szeredi

Sorry about the missing quotes.  It should read:

You wrote:
> Some things I'd like to see (as I am currently using the KIO
> equivalent) implemented as FUSE fs:
> - "fish", virtual file access over ssh

This is already available here: 

  http://sourceforge.net/projects/fuse

You need to dowload fuse-2.2-pre3 and sshfs-1.0.  It should work on
any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-rc1-mm1

2005-01-15 Thread Miklos Szeredi

Some things I'd like to see (as I am currently using the KIO
equivalent) implemented as FUSE fs:
- "fish", virtual file access over ssh

This is already available here: 

  http://sourceforge.net/projects/fuse

You need to dowload fuse-2.2-pre3 and sshfs-1.0.  It should work on
any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

100 matches

Mail list logo