Re: Kernel panic - r8169 on 2.6.11-rc1-mm1
Cameron Harris <[EMAIL PROTECTED]> : [r8169 crash] > Linux laptop 2.6.11-rc1-mm1 #2 SMP Sun Jan 16 22:36:26 GMT 2005 i686 ^^ [...] > I would try a newer kernel, but the command line options for > specifying the framebuffer settings seems to have changed in the > latest kernel and i haven't had enough time to work out how to specify > it. If you can not upgrade the kernel, I can not do anything for you since several fixes went in after 2.6.11-rc1-mm1. Regarding your r8169 issue, I suggest: 1) download linux kernel 2.6.12-rc1 2) apply on top of it: http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.11/r8169/patches/r8169-570.patch 3) avoid the proprietary tainting module Please Cc: netdev@oss.sgi.com for issues related to network drivers. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel panic - r8169 on 2.6.11-rc1-mm1
Every time i try to use eth1 which is r8169, i get a kernel panic, but on the actual use of it, not the configuring it. e.g. laptop ~ # ifconfig eth1 up 192.168.1.1 laptop ~ # ping 192.168.1.2 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. Oops: [#1] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq_oss seq_midi_event snd_seq snd_seq_device irtty_sir sir_dev irda pcspkr snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd snd_page_alloc wlan_wep fglrx sis_agp psmouse r8169 ath_pci ath_rate_onoe wlan ath_hal CPU:0 EIP:0060:[] Tainted: P VLI EFLAGS: 00010206(2.6.11-rc1-mm1) EIP is at rtl8169_start_xmit+0x55/0x2b0 [r8169] eax: 003f ebx: cf236140 ecx: cc9c6000 edx: esi: cf236240 edi: cfd9b280 ebp: cfd9b280 esp: c0685e00 ds: 007bes: 007bss: 0068 Process swapper (pid: 0, threadinfo=c0684000 task=c05b6ba0) Stack: c0107e50 cf236140 cf935080 cfd9b280 d14da000 cc9c6000 8000 cf236140 cf935080 cf236000 cfd9b280 c049141e cfd9b280 cf236000 cf236000 cfd9b280 cf236140 c048387f cf236000 cf935080 <0>Kernel panic - not syncing: Fatal exception in interrupt I never had time to write down the whole stack trace (and there was no core file created) This driver used to work in a previous kernel version (but it did get IRQ #x nobody cared messages, usually when there was some sort of a disconnection of my ethernet cable for whatever reason). This is always reproducable. uname -a: Linux laptop 2.6.11-rc1-mm1 #2 SMP Sun Jan 16 22:36:26 GMT 2005 i686 Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux I would try a newer kernel, but the command line options for specifying the framebuffer settings seems to have changed in the latest kernel and i haven't had enough time to work out how to specify it. -- Cameron Harris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Roman Zippel wrote: > Ok, great. > BTW I don't really expect the first version to be fully optimized (unless > you want to :) ), but once the basics are right, that can still be added > later. Agreed. Tom will post updated patches sometime this week. I'll follow up with the LTT stuff separately as agreed. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[2.6.11-rc1-mm1] Strange MCE errors
Hi all, since I've replaced my Athlon XP 1800 with a Athlon XP 3000 (Barton/FSB333), my logs are flooded with these warnings: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Bank 1: d4004152 MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Bank 2: d400417a (If it has any importance, my motherboard is a Gigabyte 7VAXP-Ultra. I've tried with another ram chip, but no change at all in behaviour) Apart from that, this system is running flawlessly, so I'm tented to just disable MCE in the kernel. But... I'd like to know if this is a kernel mistake or if I have really some configuration/hardware problem. I could not deduce anything meaningful from the parsemce (version 0.09) utility. Any help or advice would be apreciated... Vincent P.S.: here follows more information on this cpu -- # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 10 model name : AMD Athlon(tm) XP 3000+ stepping: 0 cpu MHz : 2170.470 cache size : 512 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow bogomips: 4292.60 # cpuid eax ineax ebx ecx edx 0001 68747541 444d4163 69746e65 0001 06a0 0383fbff 8000 8008 68747541 444d4163 69746e65 8001 07a0 c1c3fbff 8002 20444d41 6c687441 74286e6f 5820296d 8003 30332050 002b3030 8004 8005 0408ff08 ff20ff10 40020140 40020140 8006 41004100 02008140 8007 0001 8008 2022 Vendor ID: "AuthenticAMD"; CPUID level 1 AMD-specific functions Version 06a0: Family: 6 Model: 10 [Duron/Athlon model 10] Standard feature flags 0383fbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter (with RDTSC and CR4 disable bit) Model Specific Registers with RDMSR & WRMSR PAE - Page Address Extensions Machine Check Exception COMPXCHG8B Instruction APIC SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions MTRR - Memory Type Range Registers Global paging extension Machine Check Architecture Conditional Move Instruction PAT - Page Attribute Table PSE-36 - Page Size Extensions MMX instructions FXSAVE/FXRSTOR 25 - reserved Generation: 7 Model: 10 Extended feature flags c1c3fbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter (with RDTSC and CR4 disable bit) Model Specific Registers with RDMSR & WRMSR PAE - Page Address Extensions Machine Check Exception COMPXCHG8B Instruction APIC SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions MTRR - Memory Type Range Registers Global paging extension Machine Check Architecture Conditional Move Instruction PAT - Page Attribute Table PSE-36 - Page Size Extensions AMD MMX Instruction Extensions MMX instructions FXSAVE/FXRSTOR 3DNow! Instruction Extensions 3DNow instructions Processor name string: AMD Athlon(tm) XP 3000+ L1 Cache Information: 2/4-MB Pages: Data TLB: associativity 4-way #entries 8 Instruction TLB: associativity 255-way #entries 8 4-KB Pages: Data TLB: associativity 255-way #entries 32 Instruction TLB: associativity 255-way #entries 16 L1 Data cache: size 64 KB associativity 2-way lines per tag 1 line size 64 L1 Instruction cache: size 64 KB associativity 2-way lines per tag 1 line size 64 L2 Cache Information: 2/4-MB Pages: Data TLB: associativity L2 off #entries 0 Instruction TLB: associativity L2 off #entries 0 4-KB Pages: Data TLB: associativity Direct mapped #entries 0 Instruction TLB: associativity Direct mapped #entries 0 size 2 KB associativity L2 off lines per tag 129 line size 64 Advanced Power Management Feature Flags Has temperature sensing diode Maximum linear address: 32; maximum phys address 34 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Sun, 23 Jan 2005, Karim Yaghmour wrote: > But how does relayfs organize the namespace then? What if I have > multiple channels per CPU, each for a different type of data, will > all channels for the same CPU be under the same directory or will > each type of data have its own directory with one entry per CPU? I'd say the latter, you already do this for ltt. > I don't have an answer to that, and I don't know that we should. Why > not just leave it to the client to organize his data as he wishes. > If we must assume that everyone will have at least one channel per > CPU, then why not provide helper functions built on top of very > basic functions instead of fixing the namespace in stone? How should simple do you want to have these helper functions, isn't something like relay_create(path, num_chan) simple enough? I don't think a directory structure is that bad, as that allows to add more control files to the relay stream and still leave the option to write out all buffers into one file. > > I have to modify it a little (only the if (!buffer) part is new): > > > > cpu = get_cpu(); > > buffer = relay_get_buffer(chan, cpu); > > while(1) { > > offset = local_add_return(buffer->offset, length); > > if (likely(offset + length <= buffer->size)) > > break; > > buffer = relay_switch_buffer(chan, buffer, offset); > > if (!buffer) { > > put_cpu(); > > return; > > } > > } > > memcpy(buffer->data + offset, data, length); > > put_cpu(); > > > > This has a very short fast path and I need very good reasons to change/add > > anything here. OTOH the slow path with relay_switch_buffer() is less > > critical and still leaves a lot of flexibility. > > This is not good for any client that doesn't know beforehand the exact > size of their data units, as in the case of LTT. If LTT has to use this > code that means we are going to loose performance because we will need to > fill an intermediate data structure which will only be used for relay_write(). > Instead of zero-copy, we would have an extra unnecessary copy. There has > got to be a way for clients to directly reserve and write as they wish. Ok, let's change it a little so it's more familiar. :) void *relay_reserve(chan, length, cpu) { buffer = relay_get_buffer(chan, cpu); while(1) { offset = local_add_return(buffer->offset, length); if (likely(offset + length <= buffer->size)) return buffer->data + offset; buffer = relay_switch_buffer(chan, buffer, offset); if (!buffer) return NULL; } } All you have to do is to put between get_cpu()/put_cpu(). The same is also possible as macro, which allows you to directly jump out of it to the failure code and avoid one test. > > Look closer, it's already interrupt safe, the synchronization for the > > buffer switch is left to relay_switch_buffer(). > > Sorry, I'm still missing something. What exactly does local_add_return() > do? I assume this code has got to be interrupt safe? Something like: > #define local_add_return(OFFSET, LEN) \ > do {\ > ... > local_irq_save(); \ > OFFSET += LEN; > local_irq_restore(); \ > ... > } while(0); > > I'm assuming local_irq_XXX because we were told by quite a few people > in the related thread to avoid atomic ops because they are more expensive > on most CPUs than cli/sti. That would be about the generic implementation, but it allows archs to provide more efficient implementations in , e.g. i386 can use xadd. > Also how does relay_get_buffer() operate? #define relay_get_buffer(chan, cpu) chan->buffer[cpu] > What if I'm writing an event > from within a system call and I'm about to switch buffers and get > an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to > return the same pointer as the one obtained for the syscall, and aren't > both cases now going to effect relay_switch_buffer(), one of which will > be superfluous? The synchronization has to be done in relay_switch_buffer(), but catching it there is still cheaper as in the fast path. > > This adds a conditional and is not really needed. Above shows how to make > > it interrupt safe and if the clients wants to reuse the same buffer, leave > > the locking to the client. > > Fine, but how is the client going to be able to reuse the same buffer if > relayfs always assumes per-CPU buffer as you said above? This would be > solved if at its core relayfs' functions worked on single channels and > additional code provided helpers for making the SMP case very simple. What do you mean? Why not make SMP case simple (less to get wrong)? The client can still serialize everything with a simple spinlock. > > That's quite a lot of code with at least 14 conditions (or 13 conditions > > too much) and this is just
Re: 2.6.11-rc1-mm1
Karim Yaghmour wrote: > This is not good for any client that doesn't know beforehand the exact > size of their data units, as in the case of LTT. If LTT has to use this > code that means we are going to loose performance because we will need to > fill an intermediate data structure which will only be used for relay_write(). > Instead of zero-copy, we would have an extra unnecessary copy. There has > got to be a way for clients to directly reserve and write as they wish. > Even Zach Brown recognized this in his tracepipe proposal, here's from > his patch: > + * - let caller reserve space and get a pointer into buf Also, if the reserve is exported, then a client that chooses so, can do something like: local_irq_save(); relay_reserve(); write(); write(); write(); ... local_irq_restore(); And therefore enforce in-order events is he so chooses. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Karim Yaghmour wrote: > This is not good for any client that doesn't know beforehand the exact > size of their data units, as in the case of LTT. If LTT has to use this > code that means we are going to loose performance because we will need to > fill an intermediate data structure which will only be used for relay_write(). > Instead of zero-copy, we would have an extra unnecessary copy. There has > got to be a way for clients to directly reserve and write as they wish. > Even Zach Brown recognized this in his tracepipe proposal, here's from > his patch: > + * - let caller reserve space and get a pointer into buf Actually, come to think of it, this code is not good for any client that needs to fill complex data structures, whether they be fixed-size or not, because it requires having a prepackaged structure already available. Any client that wants to have zero-copying will want to write data directly into the buffer instead of filling an intermediate buffer first. And this requires being able to atomically reserve. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > Well, let's concentrate for a moment on the last thing and check later > if and how they fit into relayfs. Since ltt will be first main user, let's > optimize it for this. > Also since relayfs is intended for large, fast data transfers, per cpu > buffers are pretty much always required, so it would make sense to leave > this to relayfs (less to get wrong for the client). But how does relayfs organize the namespace then? What if I have multiple channels per CPU, each for a different type of data, will all channels for the same CPU be under the same directory or will each type of data have its own directory with one entry per CPU? I don't have an answer to that, and I don't know that we should. Why not just leave it to the client to organize his data as he wishes. If we must assume that everyone will have at least one channel per CPU, then why not provide helper functions built on top of very basic functions instead of fixing the namespace in stone? > I have to modify it a little (only the if (!buffer) part is new): > > cpu = get_cpu(); > buffer = relay_get_buffer(chan, cpu); > while(1) { > offset = local_add_return(buffer->offset, length); > if (likely(offset + length <= buffer->size)) > break; > buffer = relay_switch_buffer(chan, buffer, offset); > if (!buffer) { > put_cpu(); > return; > } > } > memcpy(buffer->data + offset, data, length); > put_cpu(); > > This has a very short fast path and I need very good reasons to change/add > anything here. OTOH the slow path with relay_switch_buffer() is less > critical and still leaves a lot of flexibility. This is not good for any client that doesn't know beforehand the exact size of their data units, as in the case of LTT. If LTT has to use this code that means we are going to loose performance because we will need to fill an intermediate data structure which will only be used for relay_write(). Instead of zero-copy, we would have an extra unnecessary copy. There has got to be a way for clients to directly reserve and write as they wish. Even Zach Brown recognized this in his tracepipe proposal, here's from his patch: + * - let caller reserve space and get a pointer into buf >>1) get_cpu() and put_cpu() won't do. You need to outright disable >>interrupts because you may be called from an interrupt handler. > > > Look closer, it's already interrupt safe, the synchronization for the > buffer switch is left to relay_switch_buffer(). Sorry, I'm still missing something. What exactly does local_add_return() do? I assume this code has got to be interrupt safe? Something like: #define local_add_return(OFFSET, LEN) \ do {\ ... local_irq_save(); \ OFFSET += LEN; local_irq_restore(); \ ... } while(0); I'm assuming local_irq_XXX because we were told by quite a few people in the related thread to avoid atomic ops because they are more expensive on most CPUs than cli/sti. Also how does relay_get_buffer() operate? What if I'm writing an event from within a system call and I'm about to switch buffers and get an interrupt at the if(likely(...))? Isn't relay_get_buffer() going to return the same pointer as the one obtained for the syscall, and aren't both cases now going to effect relay_switch_buffer(), one of which will be superfluous? > This adds a conditional and is not really needed. Above shows how to make > it interrupt safe and if the clients wants to reuse the same buffer, leave > the locking to the client. Fine, but how is the client going to be able to reuse the same buffer if relayfs always assumes per-CPU buffer as you said above? This would be solved if at its core relayfs' functions worked on single channels and additional code provided helpers for making the SMP case very simple. > That's quite a lot of code with at least 14 conditions (or 13 conditions > too much) and this is just relayfs. I believe Tom has refactored the code with your comments in mind, and has something ready for review. I just want to clear up the above before we make this final. Among other things, he just dropped all modes, and there's only a basic relay_write() that closely resembles what you have above. > That's not always true, where perfomance matters we provide different > functions (e.g. spinlocks), so having an alternative version of > relay_write is a possibility (although I'd like to see the user first). Sure, see above in the case of LTT. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http:
Re: 2.6.11-rc1-mm1
Hi, On Fri, 21 Jan 2005, Karim Yaghmour wrote: > I should have avoided earlier confusing the use of a certain type of > relayfs channel for a given purpose (i.e. LTT should not necessarily > depend on the managed mode.) I believe that there is a need for > more than one mode in relayfs independently of LTT. There are users > who want to be able to manage the data in a buffer (by manage I mean: > receive notification of important buffer events, be able to insert > important data at boundaries, etc.), and there are users who just > want to dump as much information as possible in as fast a way as > possible without having to deal with non-essential codepaths. Well, let's concentrate for a moment on the last thing and check later if and how they fit into relayfs. Since ltt will be first main user, let's optimize it for this. Also since relayfs is intended for large, fast data transfers, per cpu buffers are pretty much always required, so it would make sense to leave this to relayfs (less to get wrong for the client). > looking at this code: I have to modify it a little (only the if (!buffer) part is new): cpu = get_cpu(); buffer = relay_get_buffer(chan, cpu); while(1) { offset = local_add_return(buffer->offset, length); if (likely(offset + length <= buffer->size)) break; buffer = relay_switch_buffer(chan, buffer, offset); if (!buffer) { put_cpu(); return; } } memcpy(buffer->data + offset, data, length); put_cpu(); This has a very short fast path and I need very good reasons to change/add anything here. OTOH the slow path with relay_switch_buffer() is less critical and still leaves a lot of flexibility. > 1) get_cpu() and put_cpu() won't do. You need to outright disable > interrupts because you may be called from an interrupt handler. Look closer, it's already interrupt safe, the synchronization for the buffer switch is left to relay_switch_buffer(). > 3) I'm unclear about the need for local_add_return(), why not > just: > if (likely(buffer->offset + length <= buffer->size) > In any case, here's what we do in relay_write(): > write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting); Ok, let's take a closer look at the fast path of relay_write (via relay_managed.c): > rchan_get(rchan); This is not needed, it's the responsibility of the client to keep a reference to the channel. A synchronize_kernel() is enough to get rid of current users of the channel on other cpus. > relay_lock_channel(rchan, flags); what becomes: > FLAGS = 0; > if (RCHAN->flags & RELAY_USAGE_SMP) local_irq_save(FLAGS); > else spin_lock_irqsave(&(RCHAN)->mode.managed.lock, FLAGS); This adds a conditional and is not really needed. Above shows how to make it interrupt safe and if the clients wants to reuse the same buffer, leave the locking to the client. > write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting); what becomes: > if (rchan == NULL) ... Is this really needed? > if (slot_len >= rchan->buf_size) ... You can leave it to caller to check for this, a BUG_ON should be enough here. > if (rchan->initialized == 0) ... Does this really have to be in the fast path? > if (in_progress_event_size(rchan)) ... What's the point of this? You already disable interrupts, so how can anything else be in progress? > if (cur_write_pos(rchan) + slot_len > write_limit(rchan)) ... Ok. This leads to the slow path and not interesting right now. > if (likely(write_pos != NULL)) { After 7 conditions we finally have a valid write position (and that's without ltt). > relay_write_direct(write_pos, data_ptr, count); If write_pos is just a normal memory pointer, why not also just use memcpy? > relay_commit(rchan, write_pos, count, reserve_code, interrupting); what becomes: > if (rchan == NULL) > return; Hopefully no comment needed. > if (interrupting) ... Same comment as above for in_progress_event_size(). > if (deliver) ... > ... > if (deliver && waitqueue_active(&rchan->mmap_read_wait)) Why is that hook needed here? Why can't this be done by the client? A buffer switch notification can be done somewhere else. > relay_unlock_channel(rchan, flags); > rchan_put(rchan); Same comment as above. That's quite a lot of code with at least 14 conditions (or 13 conditions too much) and this is just relayfs. > The difference between these modes is akin the > difference between GFP_KERNEL, GFP_ATOMIC, GFP_USER, etc.: same API, > different underlying functionality. That's not always true, where perfomance matters we provide different functions (e.g. spinlocks), so having an alternative version of relay_write is a possibility (although I'd like
Re: 2.6.11-rc1-mm1
OK, I finally come around to answering this ... Roman Zippel wrote: > Sorry, you missunderstood me. At the moment I'm only secondarily > interested in the API details, primarily I want to work out the details of > what exactly relayfs/ltt are supposed to do. One main question here I > can't answer yet, why you insist on multiple relayfs modes. I should have avoided earlier confusing the use of a certain type of relayfs channel for a given purpose (i.e. LTT should not necessarily depend on the managed mode.) I believe that there is a need for more than one mode in relayfs independently of LTT. There are users who want to be able to manage the data in a buffer (by manage I mean: receive notification of important buffer events, be able to insert important data at boundaries, etc.), and there are users who just want to dump as much information as possible in as fast a way as possible without having to deal with non-essential codepaths. > This is what I basically have in mind for the relay_write function: > > cpu = get_cpu(); > buffer = relay_get_buffer(chan, cpu); > while(1) { > offset = local_add_return(buffer->offset, length); > if (likely(offset + length <= buffer->size)) > break; > buffer = relay_switch_buffer(chan, buffer, offset); > } > memcpy(buffer->data + offset, data, length); > put_cpu(); looking at this code: 1) get_cpu() and put_cpu() won't do. You need to outright disable interrupts because you may be called from an interrupt handler. 2) You assume that relayfs creates one buffer per cpu for each channel. We think this is wrong. Relayfs should not need to care about the number of CPUs, it's the clients' responsibility to create as many channels as they see fit, whether it be one channel per CPU or 10 channels per CPU or 1 channel per interrupt, etc. 3) I'm unclear about the need for local_add_return(), why not just: if (likely(buffer->offset + length <= buffer->size) In any case, here's what we do in relay_write(): write_pos = relay_reserve(rchan, count, &reserve_code, &interrupting); If there's any buffer switching required, that will be done in relay_reserve. This has the added advantage that clients that want to write directly to the buffer without using relay_write() can do so by calling relay_reserve() and not care about required buffer switching. 4) After securing the area, you simply go ahead and do a memcpy() and leave. We think that this is insufficient. Here's what we do: if (likely(write_pos != NULL)) { relay_write_direct(write_pos, data_ptr, count); relay_commit(rchan, write_pos, count, reserve_code, interrupting); *wrote_pos = write_pos; the relay_write_direct() is basically an memcpy(). We also do a relay_commit(). This actually effects the delivery of the event. If, for example, there had been a buffer switch at the previous relay_reserve(), then this call to relay_commit() will generate a call to the client's deliver() callback function. In the case of LTT, for example, this is how it knows that it's got to notify the user-space daemon that there are buffers to consume (i.e. write to disk.) > ltt_log_event should only be a few lines more (for writing header and > event data). Actually no, you don't want ltt_log_event using relay_write(), for one thing because is can generate variable size events. Instead, ltt_log_event does (basically): data_size = sizeof(event_id) + sizeof(time_delta) + sizeof(data_size); relay_lock_channel(); relay_reserve(); relay_write_direct(&event_id, sizeof(event_id)); relay_write_direct(&time_delta, sizeof(event_id)); if (var_data) { relay_write_direct(var_data, var_data_len); data_size += var_data_len; } relay_write_direct(&data_size, sizeof(data_size)); relay_commit(); relay_unlock_channel(); > What I'd like to know now are the reasons why you need more than this. I hope the above explanation clarifies things. > It's not the amount of data and any timing requirements have to be done by > the caller. During processing you either take the events in the order they > were recorded (often that's good enough) or you sort them which is not > that difficult. Ordering is a non-issue to be honest. Unless you've got some hardware scope in there, it's almost impossible to pinpoint exactly when an event occurred. There is no single line of code where an event occurs, so it's all an educated guess anyway. You want things to resemble what really happened in as much as possible though. > I know you don't want to touch the topic of kernel debugging, but its > requirements greatly overlap with what you want to do with ltt, e.g. one > needs very often information about scheduling events as many kernel > processes rely more and more on kernel threads. The only real
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Werner Almesberger wrote: > - if the probe target is an instruction long enough, replace it with >a jump or call (that's what I think the kprobes folks are working >on. I remember for sure that they were thinking about it.) I heard about this years ago, but I don't know that anything came of it. I suspect that this is not as simple as it looks and that the only reliable way to do it is with a trap. > Probably because everybody saw that it was good :-) Great, thanks. That's what we'll aim for then. We've already got the "disable" and "static" implemented, so now we need to figure out how do we best implement this tagging. IBM's kernel hooks allowed the NOP solution, so I'm guessing it shouldn't be that much of a stretch to extend it for marking up the code for kprobes and friends. I don't know whether this code is still maintained or not, but I'd like to hear input as to whether this is a good basis, or whether you're thinking of something like your uml-sim hooks? > So you need seeking, even in the presence of fine-grained control > over what gets traced in the first place ? (As opposed to extracting > the interesting data from the full trace, given that the latter > shouldn't contain too much noise.) The problem is that you don't necessarily know beforehand what's the problem. So here's an actual example: I had a client who had this box on which a task was always getting picked up by the OOM killer. Try as they might, the development team couldn't figure out which part of the code was causing this. So we put LTT in there and in less than 5 minutes we found the problem. It turned out that a user-space access to a memory-mapped FPGA caused an unexpected FP interrupt to occur, and the application found itself in a recursive signal handler. In this case there was an application symptom, but it was a hardware problem. This is just a simple example, but there are plenty of other examples where a sysadmin will be experiencing some weird hard to reproduce bugs on some of his systems and he'll spend a considerable amount of time trying to guess what's happening. This is especially complicated when there's no indication as to what's the root of the problem. So at that point being able to log everything and being able to rapidely browse through it is critical. Once you've done such a first trace you _may_ _possibly_ be able to refine your search requirements and relog with that in mind, but that's after the fact. > Or that they have been consumed. My question is just whether this > kind of aggregation is something you need. Absolutely. If you're thinking about short 100kb or MBs traces, then a simpler scheme would be possible. But when we're talking about GB and 100GBs spaning days, there's got to be a managed way of doing it. >>I have nothing against kprobes. People keep refering to it as if >>it magically made all the related problems go away, and it doesn't. > > > Yes, I know just too well :-) In umlsim, I have pretty much the > same problems, and the solutions aren't always nice. So far, I've > been lucky enough that I could almost always find a suitable > function entry to abuse. Glad you acknowledge as much. > However, since a kprobes-based mechanism is - in the worst case, > i.e. when needing markup - as good as direct calls to LTT, and gives > you a lot more flexibility if things aren't quite as hostile, I > think it makes sense to focus on such a solution. You certainly have a lot more experience than I do with that, so I'd like to solicit your help. As above: what's the best way to provide this in addition to the static and disable points? > Yup, but you could move even more intelligence outside the kernel. > All you really need in the kernel is a place to put the probe, > plus some debugging information to tell you where you find the > data (the latter possibly combined with gently coercing the > compiler to put it at some accessible place). Right, but then you end up with a mechanism with generalized hooks. Actually there was a time when LTT was a driver and you could either build it as a module or keep it built-in. However, when we published patches to get LTT accepted in 2.5 we were told on LKML to move LTT into kernel/ and avoid all this driver stuff. Having it, or parts of it, in the kernel makes it much simpler and much more likely that the existing ad-hoc tracing code spreading accross the sources be removed in exchange for a single agreed upon way of doing things. It must be said that like I had done with relayfs, the LTT patch will go through a major redux and I will post the patches for review like before on LKML. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.htm
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
[ 3rd try. Apologies to Karim, Thomas, and Roman, who apparently also received my previous attempts. For some reason, one of my upstream DNS servers decided to send me highly bogus MX records. ] Karim Yaghmour wrote: > Might I add that this is part of the problem ... No personal > offence intended, but there's been _A LOT_ of things said about > LTT that were based on third-hand account and no direct contact > with the toolset/code. Sigh, yes, guilty as charged ... At least today, I have a good excuse: my cable modem died, and I couldn't possibly have download things to look at :) > > As far as kprobes go, then you still need to have some form or another > > of marking the code for key events, unless you keep maintaining a set > > of kprobes-able points separately, which really makes it unusable for > > the rest of us, as the users of LTT have discovered over time (having > > to create a new patch for every new kernel that comes out.) Yes, I think you will need some set of "pads" in the code, where you can attach probes. I'm not sure how many, though. An alternative, at least in some cases, would be to move such things into separate functions, so that you could put the probe just at function entry. Then add a comment that this function isn't supposed to be torn apart without dire need. > > Generating new interrupts is simply unacceptable for LTT's functionality. Absolutely. If I remember correctly, this is in the process of being addressed in kprobes. You basically have the following choices: - if the probe target is an instruction long enough, replace it with a jump or call (that's what I think the kprobes folks are working on. I remember for sure that they were thinking about it.) - if the probe target is in a basic block with enough room after the target, see above (needs feedback from compiler or assembler) - if all else fails, add some NOPs (i.e. the marker approach) > I have received very little feedback on this suggestion, Probably because everybody saw that it was good :-) > As for the location of ltt trace points, then they are very rarely > at function boundaries. Here's a classic: > prepare_arch_switch(rq, next); > ltt_ev_schedchange(prev, next); > prev = context_switch(rq, prev, next); Yes, in some cases, you don't have a choice but to add some marker. > > Removing this data would require more data for each event to > > be logged, and require parsing through the trace before reading it in > > order to obtain markers allowing random access. So you need seeking, even in the presence of fine-grained control over what gets traced in the first place ? (As opposed to extracting the interesting data from the full trace, given that the latter shouldn't contain too much noise.) > If I understand you correctly, you are talking about the fact that > the transport layer's management of the buffers is syncrhonized > with some user-space entity that consumes the buffers produced > and talks back to relayfs (albeit indirectly) to let it know that > said buffers are now available? Or that they have been consumed. My question is just whether this kind of aggregation is something you need. > I have nothing against kprobes. People keep refering to it as if > it magically made all the related problems go away, and it doesn't. Yes, I know just too well :-) In umlsim, I have pretty much the same problems, and the solutions aren't always nice. So far, I've been lucky enough that I could almost always find a suitable function entry to abuse. However, since a kprobes-based mechanism is - in the worst case, i.e. when needing markup - as good as direct calls to LTT, and gives you a lot more flexibility if things aren't quite as hostile, I think it makes sense to focus on such a solution. > Nothing precludes us to move in this direction once something is > in the kernel, it's all currently hidden away in a .h, and it would > be the same with this. Yup, but you could move even more intelligence outside the kernel. All you really need in the kernel is a place to put the probe, plus some debugging information to tell you where you find the data (the latter possibly combined with gently coercing the compiler to put it at some accessible place). - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Wed, Jan 19, 2005 at 11:06:10PM +, Marcos D. Marado Torres wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On Fri, 14 Jan 2005, Barry K. Nathan wrote: > > >This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora > >users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning > >the computer off, on some computers. Here's the RH Bugzilla report where > >most of the discussion took place: > > > >https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761 > > This is the same bug I've talked here: > http://lkml.org/lkml/2005/1/11/88 FWIW the RH Bugzilla bug is (unfortunately) discussing several different similar but not identical bugs, as far as I can tell. > This only happens with -mm and not with vanilla sources. > > I'm reporting about this issue in an ASUS M3N laptop with Debian. > > Best regards, > Mind Booster Noori FWIW my report against -mm (where I narrowed it down to one of the kexec patches in particular) is here: http://bugme.osdl.org/show_bug.cgi?id=4041 -Barry K. Nathan <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Fri, 14 Jan 2005, Barry K. Nathan wrote: This isn't new to 2.6.11-rc1-mm1, but it has the infamous (to Fedora users) "ACPI shutdown bug" -- poweroff hangs instead of actually turning the computer off, on some computers. Here's the RH Bugzilla report where most of the discussion took place: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132761 This is the same bug I've talked here: http://lkml.org/lkml/2005/1/11/88 This only happens with -mm and not with vanilla sources. I'm reporting about this issue in an ASUS M3N laptop with Debian. Best regards, Mind Booster Noori In the Fedora kernels it turned out to be due to kexec. I'll see if I can narrow it down further. -Barry K. Nathan <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - -- /* *** */ Marcos Daniel Marado Torres AKA Mind Booster Noori http://student.dei.uc.pt/~marado - [EMAIL PROTECTED] () Join the ASCII ribbon campaign against html email, Microsoft /\ attachments and Software patents. They endanger the World. Sign a petition against patents: http://petition.eurolinux.org /* *** */ -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.1 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFB7ufzmNlq8m+oD34RAmsIAKDM55tzy957YqEXtNkz9l2O3O7V1ACeKXQB v2LuSPMWch9A7NQApq6Bm8c= =F7on -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance
Alle 13:42, mercoledì 19 gennaio 2005, bert hubert ha scritto: > On Tue, Jan 18, 2005 at 10:39:35PM +0100, Fabio Coatti wrote: > > vmstat under load is the following, and config.gz attached. Of course I > > can provide any other needed detail; many thanks for any hint. > > Looks mightily like DMA is not on, even though you compiled the PIIX driver > in, which lists > > > :00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE > > Controller > > Can you show the output of hdparm /dev/hda ? Can you show dmesg? Sure, here is it: /dev/hda: multcount= 16 (on) IO_support = 0 (default 16-bit) unmaskirq= 0 (off) using_dma= 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead= 256 (on) geometry = 65535/16/63, sectors = 60040544256, start = 0 I've cut down the ide relevant part of dmesg, please let me know if more details are needed an 19 21:43:53 kefk Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 Jan 19 21:43:53 kefk ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx Jan 19 21:43:53 kefk ICH5: IDE controller at PCI slot :00:1f.1 Jan 19 21:43:53 kefk ACPI: PCI interrupt :00:1f.1[A] -> GSI 18 (level, low) -> IRQ 169 Jan 19 21:43:53 kefk ICH5: chipset revision 2 Jan 19 21:43:53 kefk ICH5: not 100% native mode: will probe irqs later Jan 19 21:43:53 kefk ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio Jan 19 21:43:53 kefk ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio Jan 19 21:43:53 kefk Probing IDE interface ide0... Jan 19 21:43:53 kefk hda: MAXTOR 6L060J3, ATA DISK drive Jan 19 21:43:53 kefk ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Jan 19 21:43:53 kefk Probing IDE interface ide1... Jan 19 21:43:53 kefk hdc: TEAC DV-W58G, ATAPI CD/DVD-ROM drive Jan 19 21:43:53 kefk ide1 at 0x170-0x177,0x376 on irq 15 Jan 19 21:43:53 kefk Probing IDE interface ide2... Jan 19 21:43:53 kefk ide2: Wait for ready failed before probe ! Jan 19 21:43:53 kefk Probing IDE interface ide3... Jan 19 21:43:53 kefk ide3: Wait for ready failed before probe ! Jan 19 21:43:53 kefk Probing IDE interface ide4... Jan 19 21:43:53 kefk ide4: Wait for ready failed before probe ! Jan 19 21:43:53 kefk Probing IDE interface ide5... Jan 19 21:43:53 kefk ide5: Wait for ready failed before probe ! Jan 19 21:43:53 kefk hda: max request size: 128KiB Jan 19 21:43:53 kefk hda: 117266688 sectors (60040 MB) w/1819KiB Cache, CHS=65535/16/63, UDMA(100) Jan 19 21:43:53 kefk hda: cache flushes supported Jan 19 21:43:53 kefk hda: hda1 hda2 < hda5 hda6 hda7 > hda3 hda4 Jan 19 21:43:53 kefk PCI: :03:06.0 has unsupported PM cap regs version (1) Jan 19 21:43:53 kefk ACPI: PCI interrupt :03:06.0[A] -> GSI 22 (level, low) -> IRQ 177 Jan 19 21:43:53 kefk PCI: :03:06.0 has unsupported PM cap regs version (1) Jan 19 21:43:53 kefk ahc_pci:3:6:0: Host Adapter Bios disabled. Using default SCSI device parameters Jan 19 21:43:53 kefk scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 Jan 19 21:43:53 kefk Jan 19 21:43:53 kefk aic7850: Single Channel A, SCSI Id=7, 3/253 SCBs Jan 19 21:43:53 kefk Jan 19 21:43:53 kefk Vendor: Nikon Model: COOLSCANIII Rev: 1.31 Jan 19 21:43:53 kefk Type: ScannerANSI SCSI revision: 02 Jan 19 21:43:53 kefk (scsi0:A:3): 10.000MB/s transfers (10.000MHz, offset 15) Jan 19 21:43:53 kefk Vendor: PLEXTOR Model: CD-ROM PX-40TSRev: 1.01 Jan 19 21:43:53 kefk Type: CD-ROM ANSI SCSI revision: 02 Jan 19 21:43:53 kefk (scsi0:A:5): 10.000MB/s transfers (10.000MHz, offset 15) Jan 19 21:43:53 kefk Vendor: YAMAHAModel: CRW6416S Rev: 1.0c Jan 19 21:43:53 kefk Type: CD-ROM ANSI SCSI revision: 02 Jan 19 21:43:53 kefk libata version 1.10 loaded. Jan 19 21:43:53 kefk ata_piix version 1.03 Jan 19 21:43:53 kefk ACPI: PCI interrupt :00:1f.2[A] -> GSI 18 (level, low) -> IRQ 169 Jan 19 21:43:53 kefk PCI: Setting latency timer of device :00:1f.2 to 64 Jan 19 21:43:53 kefk ata1: SATA max UDMA/133 cmd 0xC000 ctl 0xC402 bmdma 0xD000 irq 169 Jan 19 21:43:53 kefk ata2: SATA max UDMA/133 cmd 0xC800 ctl 0xCC02 bmdma 0xD008 irq 169 Jan 19 21:43:53 kefk ata1: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4003 85:7c69 86:3e01 87:4003 88:207f Jan 19 21:43:53 kefk ata1: dev 0 ATA, max UDMA/133, 320173056 sectors: lba48 Jan 19 21:43:53 kefk ata1: dev 0 configured for UDMA/133 Jan 19 21:43:53 kefk scsi1 : ata_piix Jan 19 21:43:53 kefk ata2: SATA port has no device. Jan 19 21:43:53 kefk scsi2 : ata_piix Jan 19 21:43:53 kefk Vendor: ATA Model: Maxtor 6Y160M0Rev: YAR5 Jan 19 21:43:53 kefk Type: Direct-Access ANSI SCSI revision: 05 Jan 19 21:43:53 kefk SCSI device sda: 320173056 512-byte hdwr sectors (163929 MB) Jan 19 21:43:53 kefk SCSI device sda: drive cache: write back Jan 19 21:43:53 kefk SCSI device sda: 320173056 512-byte hdwr sectors (163929 MB) Jan 19 21:43:53 kefk SCSI device
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Werner Almesberger wrote: >>From all I've heard and seen of LTT (and I have to admit that most > of it comes from reading this thread, not from reading the code), Might I add that this is part of the problem ... No personal offence intended, but there's been _A LOT_ of things said about LTT that were based on third-hand account and no direct contact with the toolset/code. And part of the problem is that _many_ people on this list, and elsewhere, have done some form of tracing or another as part of their development, so they all have their idea of how this is best done. Yet, while such experience can help provide additional ideas to LTT's development, it also often requires re-explaining to every new suggestor why we added features he couldn't imagine would be useful to any of his/her own tracing needs ... Sometimes I wish my interests lied in some arcane feature that few had ever played with ;) IOW, while I don't discount anybody else's experience with tracing, please give us at least the benefit of the doubt by actually: a) Looking at the code b) Looking at the mailing list archives c) Asking us questions directly related to the code > I have the impression that it may try to be a bit too specialized, > and thus might miss opportunities for synergy. Bare with me on this one ... > You must be getting tired of people trying to redesign things from > scratch, but maybe you'll humor me anyway ;-) Hey, from you Werner I'll take anything. It's always a pleasure talking with you :) > Karim Yaghmour wrote: > >>If you really want to define layers, then there are actually four >>layers: >>1- hooking mechanism >>2- event definition / registration >>3- event management infrastructure >>4- transport mechanism > > > For 1, kprobes would seem largely sufficient. In cases where you > don't have a usable attachment point (e.g. in the middle of a > function and you need access to variables with unknown location), > you can add lightweight instrumentation that arranges the code > flow suitably. [1, 2] Let me say outright, as I said to Andi early on in the sister thread, that I have no problems with having the trace points being fed by kprobes. In fact, in 2000, way back before kprobes even existed, LTT was already interfacing with DProbes for dynamic insertion of trace points. ... There I said it ... now watch me have to repeat this yet again later on ... :/ However, kprobes is not magic: a) Like I said to Andi: > As far as kprobes go, then you still need to have some form or another > of marking the code for key events, unless you keep maintaining a set > of kprobes-able points separately, which really makes it unusable for > the rest of us, as the users of LTT have discovered over time (having > to create a new patch for every new kernel that comes out.) b) Like I said to Andrew back in July: > I've double-checked what I already knew about kprobes and have looked again > at the site and the patch, and unless there's some feature of kprobes I don't > know about that allows using something else than the debug interrupt to add > hooks, ... > Generating new interrupts is simply unacceptable for LTT's functionality. > Not to mention that it breaks LTT because tracing something will generate > events of its own, which will generating tracing events of their own ... > recursion. Ok, you can argue about the recursion thing with an "if()", but you'll have to admit that like in the case I described to Roman: > ... Say you're getting > 2MB/s of data (which is not unrealistic on a loaded system.) That means > that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour). IOW, something like 200,000events/s (average of 10bytes/event). Do I really need to explain that 200,000 traps/interrupts per second is not something you want ... ? But don't despair, like I said to Andi: > So lately I've been thinking that there may be a middle-ground here > where everyone could be happy. Define three states for the hooks: > disabled, static, marker. The third one just adds some info into > System.map for allowing the automation of the insertion of kprobes > hooks (though you would still need the debugging info to find the > values of the variables that you want to log.) Hence, you get to > choose which type of poison you prefer. For my part, I think the > noop/early-check should be sufficient to get better performance from > the existing hook-set. I have received very little feedback on this suggestion, though I really think it's worth entertaining, especially with your mention of uml-sim markers further below. As for the location of ltt trace points, then they are very rarely at function boundaries. Here's a classic: prepare_arch_switch(rq, next); ltt_ev_schedchange(prev, next); prev = context_switch(rq, prev, next); > 2 and 3 should be the main domain of LTT, with 2 sitting on top > of kprobes. kprobes currently doesn't have a nice way for > describing handlers, but that c
Re: 2.6.11-rc1-mm1
Christoph Hellwig wrote: On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote: One of the things that uses these functions to read from a channel from within the kernel is the relayfs code that implements read(2), so taking them away means you wouldn't be able to use read() on a relayfs file. Removing them from the public API is different from disallowing the read operation. Right, but we were planning on removing all that code in the interest of stripping relayfs down to its bare minimum as a high-speed data transfer mechanism. That wouldn't matter for ltt since it mmaps the file, but there are existing users of relayfs that do use relayfs this way. In fact, most of the bug reports I've gotten are from people using it in this mode. That doesn't mean though that it's necessarily the right thing for relayfs or these users to be doing if they have suitable alternatives for passing lower-volume messages in this way. As others have mentioned, that seems to be the major question - should relayfs concentrate on being solely a high-speed data relay mechanism or should it try to be more, as it currently is implemented? I'd say let it do one thing well, that is high-volume data transfer. Yes, I think that's the one thing everyone's agreed on. If the former, then I wonder if you need a filesystem at all - all you have is a collection of mmappable buffers and the only thing the filesystem provides is the namespace. Removing read()/write() and filesystem support would of course greatly simplify the code; I'd like to hear from any existing users though and see what they'd be missing. What else would manage the namespace? I have to confess I haven't had the time to look at it in detail, but I previously suggested that we might be able to recover the read() operations by providing them in userspace on top of the mmapped relayfs buffer, using FUSE. If we did that, our FUSE filesystem could also provide the namespace, I assume. Anyway, I don't think I've seen any objections in principal to the filesystem part of relayfs, so maybe it's not an issue - any other suggestions would be welcome, of course... Tom - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance
On Tue, Jan 18, 2005 at 10:39:35PM +0100, Fabio Coatti wrote: > vmstat under load is the following, and config.gz attached. Of course I can > provide any other needed detail; many thanks for any hint. Looks mightily like DMA is not on, even though you compiled the PIIX driver in, which lists > :00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE > Controller Can you show the output of hdparm /dev/hda ? Can you show dmesg? -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, Jan 16, 2005 at 01:05:19PM -0600, Tom Zanussi wrote: > One of the things that uses these functions to read from a channel > from within the kernel is the relayfs code that implements read(2), so > taking them away means you wouldn't be able to use read() on a relayfs > file. Removing them from the public API is different from disallowing the read operation. > That wouldn't matter for ltt since it mmaps the file, but there > are existing users of relayfs that do use relayfs this way. In fact, > most of the bug reports I've gotten are from people using it in this > mode. That doesn't mean though that it's necessarily the right thing > for relayfs or these users to be doing if they have suitable > alternatives for passing lower-volume messages in this way. As others > have mentioned, that seems to be the major question - should relayfs > concentrate on being solely a high-speed data relay mechanism or > should it try to be more, as it currently is implemented? I'd say let it do one thing well, that is high-volume data transfer. > If the > former, then I wonder if you need a filesystem at all - all you have > is a collection of mmappable buffers and the only thing the filesystem > provides is the namespace. Removing read()/write() and filesystem > support would of course greatly simplify the code; I'd like to hear > from any existing users though and see what they'd be missing. What else would manage the namespace? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, Jan 16, 2005 at 02:30:33PM -0600, Tom Zanussi wrote: > This would allow an application to write trace events of its own to a > trace stream for instance. I don't think this is a good idea. Userspace could aswell easily write its trace into shared memory segments. > Also, I added a user-requested 'feature' > whereby write()s on a relayfs channel would be sent to a callback that > could be used to interpret 'out-of-band' commands sent from the > userspace application. Now write as a control channel makes lots of sense, but I'd encapsulate that differently. Basically a net ctl file for each stream (and get rid of ioctl in favour of this one while we're at it) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
>From all I've heard and seen of LTT (and I have to admit that most of it comes from reading this thread, not from reading the code), I have the impression that it may try to be a bit too specialized, and thus might miss opportunities for synergy. You must be getting tired of people trying to redesign things from scratch, but maybe you'll humor me anyway ;-) Karim Yaghmour wrote: > If you really want to define layers, then there are actually four > layers: > 1- hooking mechanism > 2- event definition / registration > 3- event management infrastructure > 4- transport mechanism For 1, kprobes would seem largely sufficient. In cases where you don't have a usable attachment point (e.g. in the middle of a function and you need access to variables with unknown location), you can add lightweight instrumentation that arranges the code flow suitably. [1, 2] 2 and 3 should be the main domain of LTT, with 2 sitting on top of kprobes. kprobes currently doesn't have a nice way for describing handlers, but that can be fixed [3]. But you probably don't need a "nice" interface right now, but might be satisfied with one that works and is fast (?) >From the discussion, it seems that the management is partially done by relayfs. I find this a little strange. E.g. instead of filtering events, you may just not generate them in the first place, e.g. by not placing a probe, or by filtering in LTT, before submitting the event. Timestamps may be fine either way. Restoring sequence should be a task user-space can handle: in the worst case, you'd have to read and merge from #cpus streams. Seeking works in that context, too. Last but not least, 4 should be simple. Particularly since you're worried about extreme speeds, there should be as little processing as you can afford. If you need to seek efficiently (do you, really ?), you may not even want message boundaries at that level. Something that isn't entirely clear to me is if you also need to aggregate information in buffers. E.g. by updating a record until is has been retrieved by user space, or by updating a record when there is no space to create a new one. Such functionality would add complexity and needs tight sychronization with the transport. [1] I've seen the argument that kprobes aren't portable. This strikes me a highly questionable. Even if an architecture doesn't have a trap instruction (or equivalent code sequence) that is at least as short as the shortest instruction, you can always fall back to adding instrumentation [2]. Also, if you know where your basic blocks are, you may be able to use traps that span multiple instructions. I recall that things of this kind are already planned for kprobes. [2] See the "reliable markers" of umlsim from umlsim.sf.net. Implementation: cd umlsim/lib; make; tail -50 markers_kernel.h Examples: cd umlsim/sim/tests; cat sbug.marker They're basically extra-light markup in the source code. Works on ia32, but I haven't found a way to get the assembler to cooperate for amd64, yet. [3] I've already solved this problem in umlsim: there, I have a Perl/C-like scripting language that allows handlers to do pretty much anything they want. Of course, kprobes would want pre-compiled C code, not some scripts, but I think the design could be developped in a direction that would allow both. Will take a while, but since I'll eventually have to rewrite the "microcode" anyway, ... So my comments are basically as follows: 1) kprobes seems like a suitable and elegant mechanism for placing all the hooks LTT needs, so I think that it would be better to build on this basis, and extend it where necessary, than to build yet another specialized variant in parallel. 2) LTT should do what it is good at, and not have to worry about the rest (i.e. supporting infrastructure). 3) relayfs should be lean and fast, as you intend it to be, so that non-LTT tracing or fnord debugging fnord code may find it useful, too. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.11-rc1-mm1 (and others): heavy disk I/O -> poor performance
Under heavy disk I/O, the system becomes very unresponsive (i.e. even a drop down menu takes several seconds to open). I've noticed this under 2.6.11-rc1-mm1 and 2.6.10-mm2, but I can try whatever version is suggested. The way to reproduce this is quite simple: I'm using gentoo, when emerge --sync rebuilds cache the systems slows like a crawl; the same behaviour can be seen during a updatedb operation. with top, bdflush is often stuck in "D" state, as well the I/O bound process (say, emerge or updatedb). vmstat under load is the following, and config.gz attached. Of course I can provide any other needed detail; many thanks for any hint. [EMAIL PROTECTED] ~ $ vmstat 1 procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0628 5252 499696 217712001914 8060 3 1 95 1 0 1628 25764 498764 20538400 444 1252 2121 943 7 6 48 39 0 1628 24412 498812 20662800 596 948 2032 1634 11 5 58 27 0 1628 23584 498816 20737200 380 2604 2045 1408 6 5 70 18 0 1628 23360 498816 2075760056 1528 1982 559 3 2 50 45 0 1628 22292 498820 20859200 496 980 2092 1120 11 5 51 33 0 1628 20372 498856 21012000 772 1504 2293 1621 21 9 49 21 0 1628 18964 498912 21135600 620 1432 2170 1615 13 7 53 28 0 1628 18340 498920 21189200 292 2924 2137 883 5 4 57 34 0 0628 17636 498956 21253600 264 712 2018 954 5 3 65 28 0 1628 17316 498968 21279600 148 1096 1983 607 2 3 51 44 0 1628 16356 499032 21354800 416 952 2061 1417 7 3 58 32 0 0628 15708 499060 21413200 256 1912 1993 1409 4 4 53 38 1 0628 14804 499068 21473600 352 2644 2136 1475 7 4 72 16 0 1628 14548 499076 21513600 196 1676 2046 526 4 2 49 45 0 1628 13972 499104 21585600 384 816 2062 1033 9 4 51 37 0 1628 12916 499172 21680800 504 1056 2135 1311 14 5 51 30 0 1628 12020 499236 21756000 448 1044 2111 1280 17 5 51 27 0 0628 11380 499268 21807200 256 2048 2039 838 10 4 62 24 1 0628 11060 499288 21839200 156 2436 2043 832 7 4 83 5 0 1628 10612 499328 21869200 124 2180 1899 442 5 2 50 44 1 0628 10292 499336 2100 104 368 1883 599 2 2 50 47 0 1628 8292 499384 22054000 788 1536 2283 1524 18 8 49 27 0 0628 7652 499388 22108000 276 2044 2039 796 5 4 69 22 0 1628 6948 499392 22168800 288 2352 2086 783 6 4 52 38 1 0628 6308 499396 2800 256 356 2008 797 7 3 50 41 1 0628 5024 499404 22310400 476 1012 2092 983 13 5 49 32 0 1628 9848 498300 22393600 420 1096 2075 1243 8 4 53 34 0 1628 9344 498312 22440000 236 3744 2097 1181 5 4 73 19 To be honest I can't say when this started, I've installed gentoo and seen emerge --sync load only with 2.6.10-mm2 system: P4 IV 2.8/1Gb ram/i875p MB (abit IC7-g) ide: hda: MAXTOR 6L060J3 hdc: TEAC DV-W58G scsi/Sata: PLEXTOR CD-ROM PX-40TS 1.01 YAMAHA CRW6416S1.0c ATA Maxtor 6Y160M0 YAR5 lspci -v: kefk ide # lspci -v :00:00.0 Host bridge: Intel Corp. 82875P/E7210 Memory Controller Hub (rev 02) Subsystem: ABIT Computer Corp.: Unknown device 1014 Flags: bus master, fast devsel, latency 0 Memory at d000 (32-bit, prefetchable) Capabilities: [e4] #09 [2106] Capabilities: [a0] AGP version 3.0 :00:01.0 PCI bridge: Intel Corp. 82875P Processor to AGP Controller (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, 66Mhz, fast devsel, latency 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 Memory behind bridge: f000-f1ff Prefetchable memory behind bridge: e800-efff :00:03.0 PCI bridge: Intel Corp. 82875P/E7210 Processor to PCI to CSA Bridge (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, 66Mhz, fast devsel, latency 32 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 I/O behind bridge: 9000-9fff Memory behind bridge: f200-f20f Expansion ROM at 9000 [disabled] [size=4K] :00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI]) Subsystem: ABIT Computer Corp.: Unknown device 1014 Flags: bus master, medium devsel, latency 0, IRQ 193 I/O ports at bc00 [size=32] :00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > Tom Zanussi wrote: > > I have to disagree. Awhile back, if you remember, I posted a patch to > > the LTT daemon that would monitor the trace stream in real time, and > > process it using an embedded Perl interpreter, no less: > > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2 > > > > It didn't seem to have any problems keeping up with the trace stream > > even though it was monitoring all LTT event types (and a couple of > > others - custom events injected using kprobes) and not doing any > > filtering in the kernel, through kernel compiles, normal X traffic, > > etc. I don't know what volume of event traffic would cause this model > > to break down, but I think it shows that at least some level of > > non-trivial live processing is possible... > > Good Point. > > My bad. Thanks for bringing this up. Obviously this didn't get as > much attention as it should've had the last time it was posted, > especially as it allows very easy scripting of filtering in userspace. > That email you refer to is pretty loaded and I'm sure those who > are interested will dig through it. But in the interest of helping > everyone get a rapid understanding of what it does and how it does it, > can you break it down in to a short description, possibly with a > diagram? I'm sure many will find this very interesting. It's so simple it doesn't really deserve a diagram, which I'm pretty bad at anyway... Basically all it does is loop around the received buffer, reading each event and sending it off to a handler. In this case the handler massages the data into a form that allows it to be passed to the Perl interpreter as arguments to a Perl function that in turn acts as callback handler in the Perl interpreter. At that point, the Perl callback can do whatever it wants with the data - save events matching a certain pid and discard everything else, keep running counts or time totals e.g. total syscall counts for each pid, function call tracing (if you dynamically instrumented function call entry/exit with kprobes for example), etc, etc, etc. Probably even more useful is the ability to monitor the event stream looking for sporadically occuring events, again under the control of the Perl interpreter, so your criteria for deciding what an 'important event' is can be arbitrarily complex and incorporate past history. It also means that you don't have to save anything at all to disk until you detect your specified condition (which makes tracing for days or weeks on end more practical), at which point you can dump out the currently mapped buffer containing the last bufsize number of events most likely to be of interest anyway. Perl makes this kind of quick and dirty processing extremely easy and it has a lot of powerful language features such as nested hashes built in, which is why I chose it, but you could of course avoid the extra layer and the interpreter and do your filtering in straight C, or create a binding for any language you want. IMHO being able to do most of the filtering in user space like this opens up a lot of avenues for not only one-off problem determination hacks, but a proliferation of more substantial tools, considering how easy it is to put together applications using for instance the copious number of Perl modules available. Tom > > Thanks, > > Karim > -- > Author, Speaker, Developer, Consultant > Pushing Embedded and Real-Time Linux Systems Beyond the Limits > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 -- Regards, Tom Zanussi <[EMAIL PROTECTED]> IBM Linux Technology Center/RAS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Tom Zanussi wrote: > I have to disagree. Awhile back, if you remember, I posted a patch to > the LTT daemon that would monitor the trace stream in real time, and > process it using an embedded Perl interpreter, no less: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2 > > It didn't seem to have any problems keeping up with the trace stream > even though it was monitoring all LTT event types (and a couple of > others - custom events injected using kprobes) and not doing any > filtering in the kernel, through kernel compiles, normal X traffic, > etc. I don't know what volume of event traffic would cause this model > to break down, but I think it shows that at least some level of > non-trivial live processing is possible... Good Point. My bad. Thanks for bringing this up. Obviously this didn't get as much attention as it should've had the last time it was posted, especially as it allows very easy scripting of filtering in userspace. That email you refer to is pretty loaded and I'm sure those who are interested will dig through it. But in the interest of helping everyone get a rapid understanding of what it does and how it does it, can you break it down in to a short description, possibly with a diagram? I'm sure many will find this very interesting. Thanks, Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Thomas, Thomas Gleixner wrote: > Yes, I did already start cleaning > > cat ../broken-out/ltt* | patch -p1 -R :D If it gives you a warm and fuzzy feeling to have the last cheap-shot, then I'm all for it, it is of no consequence anyway. And _please_ don't forget to answer this very email with something of the same substance. For my part I consider that I've invested a substantial amount of time in responding to both your conceptual and practical feedback, as the archives clearly show. That being said, I have to thank you for making sure that all the obvious questions have been asked. I now have more than a dozen archive links of my answers to those. I'll sure come in handy when writing an FAQ. Thanks again, Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Mon, 17 Jan 2005, Karim Yaghmour wrote: > With that said, I hope we've agreed that we'll have a callback for > letting relayfs clients know that they need to write the begining of > the buffer event. There won't be any associated reserve. Conversly, > I hope it is not too much to ask to have an end-of-buffer callback. There of course has to be some kind of end marker, but that's less critical as it's not the active buffer anymore. > Roman, of all people I've been more than happy to change my stuff following > your recommendations. Do I have to list how far down relayfs has been > stripped down? Sorry, you missunderstood me. At the moment I'm only secondarily interested in the API details, primarily I want to work out the details of what exactly relayfs/ltt are supposed to do. One main question here I can't answer yet, why you insist on multiple relayfs modes. This is what I basically have in mind for the relay_write function: cpu = get_cpu(); buffer = relay_get_buffer(chan, cpu); while(1) { offset = local_add_return(buffer->offset, length); if (likely(offset + length <= buffer->size)) break; buffer = relay_switch_buffer(chan, buffer, offset); } memcpy(buffer->data + offset, data, length); put_cpu(); ltt_log_event should only be a few lines more (for writing header and event data). What I'd like to know now are the reasons why you need more than this. It's not the amount of data and any timing requirements have to be done by the caller. During processing you either take the events in the order they were recorded (often that's good enough) or you sort them which is not that difficult. > You ask what compromises can be found from both sides to obtain a > single implementation. I have looked at this, and given how > stripped down it has become, anything less from relayfs will make > it useless for LTT. IOW, I would have to reimplement a buffering > scheme within LTT outside of relayfs. I know you don't want to touch the topic of kernel debugging, but its requirements greatly overlap with what you want to do with ltt, e.g. one needs very often information about scheduling events as many kernel processes rely more and more on kernel threads. The only real requirement for kernel debugging is low runtime overhead, which you certainly like to have as well. So what exactly are these requirements and why can't there be no reasonable alternative? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lkst-develop] Re: 2.6.11-rc1-mm1
Hi, Andi Kleen wrote: On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote: Hello, I?m a developer of yet another kernel tracer, LKST. I and co-developers are very glad to hear that LTT was merged into -mm tree and to talk about the kernel tracer on this ML. Because we think that the kernel event tracer is useful to debug Linux systems, and to improve the kernel reliability. I haven't looked at your code, but I would suggest you also post for review it so that it can be evaluated in the same way as other more noisy proposals. Perhaps Andrew can test both for some time in MM like he used to do for the various schedulers. Thanks to your advice. The latest release package of LKST baesd on linux-2.6.9 can be downloaded from http://sourceforge.net/projects/lkst/ I'll release the LKST based on the latest kernel as soon as possible. Regards, -- Masami HIRAMATSU Hitachi, Ltd., Systems Development Laboratory E-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Tue, Jan 18, 2005 at 08:19:18PM +0900, Masami Hiramatsu wrote: > Hello, > > I?m a developer of yet another kernel tracer, LKST. I and co-developers > are very glad to hear that LTT was merged into -mm tree and to talk > about the kernel tracer on this ML. Because we think that the kernel > event tracer is useful to debug Linux systems, and to improve the kernel > reliability. I haven't looked at your code, but I would suggest you also post for review it so that it can be evaluated in the same way as other more noisy proposals. Perhaps Andrew can test both for some time in MM like he used to do for the various schedulers. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello, I’m a developer of yet another kernel tracer, LKST. I and co-developers are very glad to hear that LTT was merged into -mm tree and to talk about the kernel tracer on this ML. Because we think that the kernel event tracer is useful to debug Linux systems, and to improve the kernel reliability. Andi Kleen wrote: Andrew Morton <[EMAIL PROTECTED]> writes: - Added the Linux Trace Toolkit (and hence relayfs). Mainly because I haven't yet taken as close a look at LTT as I should have. Probably neither have you. I think it would be better to have a standard set of kprobes instead of all the ugly LTT hooks. kprobes could then log to relayfs or another fast logging mechanism. I agree. I’m interested in kprobes. Currently, LKST can switch off and on each hook. But, even if a hook was disabled, there is a little overhead-time (one conditional-jump instruction should be executed). I think kprobes-based hooks can completely remove this overhead-time. Moreover, kprobes-based hooks can be inserted dynamically into the code-point specified by user. This feature is greatly useful for debugging. So, I have an idea to renew LKST to kprobes-based hooks. Also, I’m developing a prototype implementation. The problem relayfs has IMHO is that it is too complicated. It seems to either suffer from a overfull specification or second system effect. There are lots of different options to do everything, instead of a nice simple fast path that does one thing efficiently. IMHO before merging it should go through a diet and only keep the paths that are actually needed and dropping a lot of the current baggage. Preferably that would be only the fastest options (extremly simple per CPU buffer with inlined fast path that drop data on buffer overflow), with leaving out anything more complicated. My ideal is something like the old SGI ktrace which was an extremly simple mechanism to do lockless per CPU logging of binary data efficiently and reading that from a user daemon. LKST’s logging buffer is (much) simpler than relayfs. It is just the linked-perCPU-buffer. If you are interested in this, please try LKST. -- Masami HIRAMATSU Hitachi, Ltd., Systems Development Laboratory E-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
On Mon, 2005-01-17 at 18:57 -0500, Karim Yaghmour wrote: > Thomas Gleixner wrote: > > If we add another hardwired implementation then we do not have said > > benefits. > > Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman, > and others actually made specific requests for changes in the code. > What makes you think you're so special that you think you are > entitled to stay on the side and handwave about concepts. So the points you added to your todo list which were brought up by me are worthless ? I'm not handwaving. I started this RFC to move the discussion into a general discussion about instrumentation. A couple of people are seriosly interested to do this. If you are not interested then ignore the thread, but you're way not in a position to tell me to shut up. You turned this thread into your LTT prayer wheel. Roman pointed out your unwillingness to create a common framework before. But I have to disagree with him in one point. It's not amazing, it's annoying. > If there is a limitation with the code, please present actual > snippets that need to be changed and suggest alternatives. That's > what everyone else does on this list. I pointed you to actually broken code and you accused me of throwing mud. > Save the bandwidth Please remove me from cc, it's a good start to save bandwidth. > and start cleaning. Yes, I did already start cleaning cat ../broken-out/ltt* | patch -p1 -R tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > Aaron Cohen wrote: > > I've got a quick question and I just want to be clear that it > > doesn't have a political agenda behind it. > > :) > > > Here goes, why can't LTT and/or relayfs, work similar to the way > > syslog does and just fill a buffer (aka ring-buffer or whatever is > > appropriate), while a userspace daemon of some kind periodically reads > > that buffer and massages it. I'm probably being naive but if the > > difficulty is with huge several hundred-gig files, the daemon if it > > monitors the buffer often enough could stuff it into a database or > > whatever high-performance format you need. > > Because of the bandwidth it is not possible to do any sort of live > processing of any kind. The only thing the daemon can possibly do > is write large blocks of tracing info to disk as rapidly as possible. I have to disagree. Awhile back, if you remember, I posted a patch to the LTT daemon that would monitor the trace stream in real time, and process it using an embedded Perl interpreter, no less: http://marc.theaimsgroup.com/?l=linux-kernel&m=109405724500237&w=2 It didn't seem to have any problems keeping up with the trace stream even though it was monitoring all LTT event types (and a couple of others - custom events injected using kprobes) and not doing any filtering in the kernel, through kernel compiles, normal X traffic, etc. I don't know what volume of event traffic would cause this model to break down, but I think it shows that at least some level of non-trivial live processing is possible... Tom > > > It also seems to me that Linus' nascent "splice and tee" work would > > be really useful for something like this to avoid a lot of unnecessary > > copying by the userspace daemon. > > There is no copying by the userspace daemon. All it does is open(), > then mmap(), and then it sleeps until it is woken up by the ltt > kernel subsystem. When that happens, it only does a write() on the > mmaped area, tells the ltt subsystem that it commited X number of > sub-buffers and goes back asleep. This is all zero-copy. > > Karim > -- > Author, Speaker, Developer, Consultant > Pushing Embedded and Real-Time Linux Systems Beyond the Limits > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 -- Regards, Tom Zanussi <[EMAIL PROTECTED]> IBM Linux Technology Center/RAS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Aaron Cohen wrote: > I've got a quick question and I just want to be clear that it > doesn't have a political agenda behind it. :) > Here goes, why can't LTT and/or relayfs, work similar to the way > syslog does and just fill a buffer (aka ring-buffer or whatever is > appropriate), while a userspace daemon of some kind periodically reads > that buffer and massages it. I'm probably being naive but if the > difficulty is with huge several hundred-gig files, the daemon if it > monitors the buffer often enough could stuff it into a database or > whatever high-performance format you need. Because of the bandwidth it is not possible to do any sort of live processing of any kind. The only thing the daemon can possibly do is write large blocks of tracing info to disk as rapidly as possible. > It also seems to me that Linus' nascent "splice and tee" work would > be really useful for something like this to avoid a lot of unnecessary > copying by the userspace daemon. There is no copying by the userspace daemon. All it does is open(), then mmap(), and then it sleeps until it is woken up by the ltt kernel subsystem. When that happens, it only does a write() on the mmaped area, tells the ltt subsystem that it commited X number of sub-buffers and goes back asleep. This is all zero-copy. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, I'm very much a newbie to all of this, but I'm finding this discussion fairly interesting. I've got a quick question and I just want to be clear that it doesn't have a political agenda behind it. Here goes, why can't LTT and/or relayfs, work similar to the way syslog does and just fill a buffer (aka ring-buffer or whatever is appropriate), while a userspace daemon of some kind periodically reads that buffer and massages it. I'm probably being naive but if the difficulty is with huge several hundred-gig files, the daemon if it monitors the buffer often enough could stuff it into a database or whatever high-performance format you need. It also seems to me that Linus' nascent "splice and tee" work would be really useful for something like this to avoid a lot of unnecessary copying by the userspace daemon. On Mon, 17 Jan 2005 23:03:46 -0500, Karim Yaghmour <[EMAIL PROTECTED]> wrote: > > Hello Roman, > > Roman Zippel wrote: > > Why is so important that it's at the start of the buffer? What's wrong > > with a special event _near_ the start of a buffer? > [snip] > > What gives you the idea, that you can't do this with what I proposed? > > You can still seek freely within the data at buffer boundaries and you > > only have to search a little into the buffer to find the delimiter. Events > > are not completely at random, so that the little reordering can be done at > > runtime. Sorry, but I don't get what kind of unsolvable problems you see > > here. > > Actually I just checked the code and this is a non-issue. The callback > can only be called when the condition is met, which itself happens only > on buffer switch, which itself only happens when we try to reserve > something bigger than what is left in the buffer. IOW, there is no need > for reserving anything. Here's what the code does: > if (!finalizing) { > bytes_written = rchan->callbacks->buffer_start ... > cur_write_pos(rchan) += bytes_written; > } > > With that said, I hope we've agreed that we'll have a callback for > letting relayfs clients know that they need to write the begining of > the buffer event. There won't be any associated reserve. Conversly, > I hope it is not too much to ask to have an end-of-buffer callback. > > > Wrong question. What compromises can be made on both sides to create a > > common simple framework? Your unwillingness to compromise a little on the > > ltt requirements really amazes me. > > Roman, of all people I've been more than happy to change my stuff following > your recommendations. Do I have to list how far down relayfs has been > stripped down? I mean, we got rid of the lockless scheme (which was > one of ltt's explicit requirements), we got rid of the read/write capabilities > for user-space, etc. And we are now only left with the bare-bones API: > rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks); > intrelay_close(*rchan); > intrelay_reset(*rchan); > intrelay_write(*rchan, *data_ptr, count, **wrote-pos); > > char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting); > void relay_commit(*rchan, *from, len, reserve_code, interrupting); > void relay_buffers_consumed(*rchan, u32); > > #define relay_write_direct(DEST, SRC, SIZE) \ > #define relay_lock_channel(RCHAN, FLAGS) \ > #define relay_unlock_channel(RCHAN, FLAGS) \ > > This is a far-cry from what we had before, have a look at the > relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if > you want to compare. Please at least acknowledge as much. > > I'm more than willing to compromise, but at least give me something > substantive to feed on. I've explained why I believe there needs to be > two modes for relayfs. If you don't think they are appropriate, then > please explain why. Either my experience blinds me or it rightly > compels me to continue defending it. > > You ask what compromises can be found from both sides to obtain a > single implementation. I have looked at this, and given how > stripped down it has become, anything less from relayfs will make > it useless for LTT. IOW, I would have to reimplement a buffering > scheme within LTT outside of relayfs. > > Can't you see that not all buffering schemes are adapted to all > applications and that it's preferable to have a single API > transparently providing separate mechanisms instead of a single > mechanism that doesn't satisfy any of its users? > > If I can't convince you of the concept, can I at least convince > you to withhold your final judgement until you actually see the > code f
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > Why is so important that it's at the start of the buffer? What's wrong > with a special event _near_ the start of a buffer? [snip] > What gives you the idea, that you can't do this with what I proposed? > You can still seek freely within the data at buffer boundaries and you > only have to search a little into the buffer to find the delimiter. Events > are not completely at random, so that the little reordering can be done at > runtime. Sorry, but I don't get what kind of unsolvable problems you see > here. Actually I just checked the code and this is a non-issue. The callback can only be called when the condition is met, which itself happens only on buffer switch, which itself only happens when we try to reserve something bigger than what is left in the buffer. IOW, there is no need for reserving anything. Here's what the code does: if (!finalizing) { bytes_written = rchan->callbacks->buffer_start ... cur_write_pos(rchan) += bytes_written; } With that said, I hope we've agreed that we'll have a callback for letting relayfs clients know that they need to write the begining of the buffer event. There won't be any associated reserve. Conversly, I hope it is not too much to ask to have an end-of-buffer callback. > Wrong question. What compromises can be made on both sides to create a > common simple framework? Your unwillingness to compromise a little on the > ltt requirements really amazes me. Roman, of all people I've been more than happy to change my stuff following your recommendations. Do I have to list how far down relayfs has been stripped down? I mean, we got rid of the lockless scheme (which was one of ltt's explicit requirements), we got rid of the read/write capabilities for user-space, etc. And we are now only left with the bare-bones API: rchan* relay_open(channel_path, bufsize, nbufs, flags, *callbacks); intrelay_close(*rchan); intrelay_reset(*rchan); intrelay_write(*rchan, *data_ptr, count, **wrote-pos); char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting); void relay_commit(*rchan, *from, len, reserve_code, interrupting); void relay_buffers_consumed(*rchan, u32); #define relay_write_direct(DEST, SRC, SIZE) \ #define relay_lock_channel(RCHAN, FLAGS) \ #define relay_unlock_channel(RCHAN, FLAGS) \ This is a far-cry from what we had before, have a look at the relayfs.txt file in 2.6.11-rc1-mm1's Documentation/filesystems if you want to compare. Please at least acknowledge as much. I'm more than willing to compromise, but at least give me something substantive to feed on. I've explained why I believe there needs to be two modes for relayfs. If you don't think they are appropriate, then please explain why. Either my experience blinds me or it rightly compels me to continue defending it. You ask what compromises can be found from both sides to obtain a single implementation. I have looked at this, and given how stripped down it has become, anything less from relayfs will make it useless for LTT. IOW, I would have to reimplement a buffering scheme within LTT outside of relayfs. Can't you see that not all buffering schemes are adapted to all applications and that it's preferable to have a single API transparently providing separate mechanisms instead of a single mechanism that doesn't satisfy any of its users? If I can't convince you of the concept, can I at least convince you to withhold your final judgement until you actually see the code for the managed vs. ad-hoc schemes? Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Thomas Gleixner wrote: > Provide a hook, export it and load your filters as a module, but keep > the filters out of the mainline kernel code. Great idea! I will do exactly that. Thanks, Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > An additional comment about the order of events. What you're doing in > lockless_reserve is bogus anyway. There is no single correct time to > write into the event. By artificially synchronizing event order and event > time you only cheat yourself. You either take it into account during > postprocessing that events can be interrupted or the time stamp doesn't > seem to be that important, but there is nothing you can do during the > recording of the event except of completely disabling interrupts. Correct and like I said before, we are dropping the lockless scheme. Ergo, disabling interrupts we will. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Mon, 17 Jan 2005, Karim Yaghmour wrote: > a) create indexes, b) reorder events, and likely c) have to rewrite An additional comment about the order of events. What you're doing in lockless_reserve is bogus anyway. There is no single correct time to write into the event. By artificially synchronizing event order and event time you only cheat yourself. You either take it into account during postprocessing that events can be interrupted or the time stamp doesn't seem to be that important, but there is nothing you can do during the recording of the event except of completely disabling interrupts. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
J.A. Magallon wrote: This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10 and your code. Could you revamp it against -mm1, please ? I looked at it but seems out of my understanding... My patch replaces the one in -mm1. Just revert the waiting-10s-... patch that is in 2.6.11-rc1-mm1 using patch -p1 -R Then apply the one I attached to the last mail normally. I'll also be sending in a cleaner version of the patch shortly. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Mon, 2005-01-17 at 18:41 -0500, Karim Yaghmour wrote: > Thomas Gleixner wrote: > > I know, what I have said. I said reduce the filtering to the absolute > > minimum and do the rest in userspace. > > You keep adopting the interpretation which best suits you, taking > quotes out of context, and keep repeating things that have already > been answered. There are limits to one's patience. I said before: "Sorting out disabled events is the filtering you have to do in kernel and you should do it in the hot path or remove the unneccecary tracepoints at compiletime." This is exactly what I stated above. I omitted the addon of "do the rest in userspace", as it was obvious enough. > What you did is change your position twice. It's there for anyone to see. Sorry, I didn't know that you are representing anyone. > > The now builtin filters are defined to fit somebodys needs or idea of > > what the user should / wants to see. They will not fit everybodys > > needs / ideas. So we start modifying, adding and #ifdefing kernel > > filters, which is a scary vision. > > Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just > export a ltt_set_filter(*callback) and rewrite the if in > _ltt_log_event() to: > if ((ltt_filter != NULL) && !(return -EINVAL; > > You're always welcome to do the following from anywhere in your code: > ltt_set_filter(NULL); Provide a hook, export it and load your filters as a module, but keep the filters out of the mainline kernel code. > > Enabling and disabling events is a valid basic filter request, which > > should live in the kernel. Anything else should go into userspace, IMO. > > What you are suggesting is that a system administator that wants to > monitor his sendmail server over a period of three weeks should > just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't > like the idea of kernel event filtering based on anything but events. A real common scenario with a broad range of users. And everybody has to like the idea of hardwired filters in the kernel code to make the life of this sysadmin easier. See above about hooks. Maybe some simple pipe would be helpful too: read_stream | prefilter | buildbuffers | storeit tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Mon, 17 Jan 2005, Karim Yaghmour wrote: > > Periodically can also mean a buffer start call back from relayfs > > (although that would mean the first entry is not guaranteed) or a > > (per cpu) eventcnt from the subsystem. The amount of needed search would > > be limited. The main point is from the relayfs POV the buffer structure > > has always the same (simple) structure. > > But two e-mails ago, you told us to drop the start_reserve and end_reserve > and move the details of the buffer management into relayfs and out of > ltt? Either we have a callback, like you suggest, and then we need to > reserve some space to make sure that the callback is guaranteed to have > the first entry, or we drop the callback and provide an option to the > user for relayfs to write this first entry for him. Providing a callback > without reservation is no different than relying purely on the heartbeat, > which, like I said before and for the reasons illustrated below, is > unrealistic. Why is so important that it's at the start of the buffer? What's wrong with a special event _near_ the start of a buffer? > > Why is it "totally unrealistic"? > > Ok, let's expand a little here on the amount of data. Say you're getting > 2MB/s of data (which is not unrealistic on a loaded system.) That means > that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour). > In practice, users aren't necessarily interested in plowing through the > entire 345GB, they just want to view a given portion of it. Now, if I > follow what you are suggesting, I have to go through the entire 345GB to: > a) create indexes, b) reorder events, and likely c) have to rewrite > another 345GB of data. And I haven't yet discussed the kind of problems > you would encounter in trying to reorder such a beast that contains, > by definition, variable-sized events. For one thing, if event N+1 doesn't > follow N, then you would be forced to browse forward until you actually > found it before you could write a properly ordered trace. And it just > takes a few processes that are interrupted and forced to sleep here and > there to make this unusable. That's without the RAM or fs space required > to store those index tables ... At 3 to 12 bytes per events, that's a lot > of space for indexes ... > > If I keep things as they are with ordered events and delimiters on buffer > boundaries, I can skip to any place within this 345GB and start processing > from there. What gives you the idea, that you can't do this with what I proposed? You can still seek freely within the data at buffer boundaries and you only have to search a little into the buffer to find the delimiter. Events are not completely at random, so that the little reordering can be done at runtime. Sorry, but I don't get what kind of unsolvable problems you see here. > Rhetorical: Couldn't the ad-hoc mode case be a special case of the > managed mode? Wrong question. What compromises can be made on both sides to create a common simple framework? Your unwillingness to compromise a little on the ltt requirements really amazes me. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Thomas Gleixner wrote: > If we add another hardwired implementation then we do not have said > benefits. Please stop handwaving. Folks like Andrew, Christoph, Zwane, Roman, and others actually made specific requests for changes in the code. What makes you think you're so special that you think you are entitled to stay on the side and handwave about concepts. If there is a limitation with the code, please present actual snippets that need to be changed and suggest alternatives. That's what everyone else does on this list. If you want to clean-up the existing tracing code in the kernel, then here are some ltt calls you may be interested in: int ltt_create_event(char *event_type, char *event_desc, int format_type, char *format_data); int ltt_log_raw_event(int event_id, int event_size, void *event_data); And here's an actual example: ... delta_id = ltt_create_event("Delta", NULL, CUSTOM_EVENT_FORMAT_TYPE_HEX, NULL); ... ltt_log_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event); ... ltt_destroy_event(delta_id); You can then use LibLTT to read the trace and extract your custom events and format your binary data as it suits you. Save the bandwidth and start cleaning. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Mon, 2005-01-17 at 17:42 -0500, Robert Wisniewski wrote: > I believe (and Karim can correct me if I'm wrong) the idea is to have > groups of events that can be disabled and enabled via a one word mask. No > checking multiple variables, no #ifdefing, something very streamlined. By > userspace I assume you mean post-processing, i.e., if the user/library/etc > needs to log events they use the same simple facility. Yes, I was talking about postprocessing in userspace. The logging of userspace events is a complete seperate issue. You have to solve the timestamp problem and do the correlation to kernel events in the postprocessing. > I think we agree to optimize/streamline performance for the gathering and > do work in the post processing. There is an outstanding patch that makes > strides in this direction. Ack. Have you any plans to seperate the layers into different pieces, so they provide better reusability ? tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Thomas Gleixner wrote: > I know, what I have said. I said reduce the filtering to the absolute > minimum and do the rest in userspace. You keep adopting the interpretation which best suits you, taking quotes out of context, and keep repeating things that have already been answered. There are limits to one's patience. What you did is change your position twice. It's there for anyone to see. > The now builtin filters are defined to fit somebodys needs or idea of > what the user should / wants to see. They will not fit everybodys > needs / ideas. So we start modifying, adding and #ifdefing kernel > filters, which is a scary vision. Ah, finally. Here's an actual suggestion. _IF_ you want, I'll just export a ltt_set_filter(*callback) and rewrite the if in _ltt_log_event() to: if ((ltt_filter != NULL) && !(Enabling and disabling events is a valid basic filter request, which > should live in the kernel. Anything else should go into userspace, IMO. What you are suggesting is that a system administator that wants to monitor his sendmail server over a period of three weeks should just postprocess 1.8TB (1MB/s) of data because Thomas Gleixner didn't like the idea of kernel event filtering based on anything but events. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On 2005.01.16, Daniel Drake wrote: > Hi, > > Joseph Fannin wrote: > > On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote: > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/ > > > > > >>waiting-10s-before-mounting-root-filesystem.patch > >> retry mounting the root filesystem at boot time > > > > > > With this patch, initrds seem to get 'skipped'. I think this is > > probably the cause for the reports of problems with RAID too. > > This patch should do the job. Replaces the existing > waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1. > > Daniel > > Retry up to 20 times if mounting the root device fails. This fixes booting > from usb-storage devices, which no longer make their partitions immediately > available. Also cleans up the mount_block_root() function. > > Based on an earlier patch from William Park <[EMAIL PROTECTED]> > > Signed-off-by: Daniel Drake <[EMAIL PROTECTED]> > This does not patch against -mm1. -mm1 looks like a mix of plain 2.6.10 and your code. Could you revamp it against -mm1, please ? I looked at it but seems out of my understanding... TIA -- J.A. Magallon \ Software is like sex: werewolf!able!es \ It's better when it's free Mandrakelinux release 10.2 (Cooker) for i586 Linux 2.6.10-jam4 (gcc 3.4.3 (Mandrakelinux 10.2 3.4.3-3mdk)) #2 pgpJTZVivsc8z.pgp Description: PGP signature
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
On Mon, 2005-01-17 at 15:34 -0500, Karim Yaghmour wrote: > Thomas Gleixner wrote: > > Thats the point. Adding another hardwired implementation does not give > > us a possibility to solve the hardwired problem of the already available > > stuff. > > Well then, like I said before, you know what you need to do: > http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/ Oh, I guess my English must be really bad. I was talking about seperation of layers, so why do I need kernelhooks ? The seperation of layers makes it possible to actually reuse functionality and gives the possibility that existing hardwired stuff can be cleaned up to use the new functionality too. If we add another hardwired implementation then we do not have said benefits. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
n <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <[EMAIL PROTECTED]> From: Robert Wisniewski <[EMAIL PROTECTED]> Bcc: [EMAIL PROTECTED],[EMAIL PROTECTED] Thomas Gleixner writes: > On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote: > > You're either on crack or I don't know how to read english. Here's what > > you said: > > Maybe you should read your own comment about ad-hominem attacks earlier > in this thread and consider if it might apply to you. > > I know, what I have said. I said reduce the filtering to the absolute > minimum and do the rest in userspace. > > The now builtin filters are defined to fit somebodys needs or idea of > what the user should / wants to see. They will not fit everybodys > needs / ideas. So we start modifying, adding and #ifdefing kernel > filters, which is a scary vision. > > Enabling and disabling events is a valid basic filter request, which > should live in the kernel. Anything else should go into userspace, IMO. > > tglx I believe (and Karim can correct me if I'm wrong) the idea is to have groups of events that can be disabled and enabled via a one word mask. No checking multiple variables, no #ifdefing, something very streamlined. By userspace I assume you mean post-processing, i.e., if the user/library/etc needs to log events they use the same simple facility. I think we agree to optimize/streamline performance for the gathering and do work in the post processing. There is an outstanding patch that makes strides in this direction. -bob Robert Wisniewski The K42 MP OS Project http://www.research.ibm.com/K42/ [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Mon, 2005-01-17 at 15:32 -0500, Karim Yaghmour wrote: > You're either on crack or I don't know how to read english. Here's what > you said: Maybe you should read your own comment about ad-hominem attacks earlier in this thread and consider if it might apply to you. I know, what I have said. I said reduce the filtering to the absolute minimum and do the rest in userspace. The now builtin filters are defined to fit somebodys needs or idea of what the user should / wants to see. They will not fit everybodys needs / ideas. So we start modifying, adding and #ifdefing kernel filters, which is a scary vision. Enabling and disabling events is a valid basic filter request, which should live in the kernel. Anything else should go into userspace, IMO. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Fri, Jan 14, 2005 at 06:58:10PM -0800, William Lee Irwin III wrote: > No idea what hit me just yet. x86-64 doesn't boot. Still going through > the various architectures. The same system (including the initrd FPOS > bullcrap, though, of course, I'm using an initrd built just for this > kernel) boots various 2.6.x up to 2.6.10-mm1. There are vague indications > something in/around SCSI and/or initrd's has violently exploded in my face. With the waiting 10s patch backed out, things seem to be going well: $ ssh analyticity Last login: Mon Jan 17 14:03:13 2005 from meromorphy Linux analyticity 2.6.11-rc1-mm1 #5 SMP Sat Jan 15 01:25:23 PST 2005 sparc64 GNU/Linux $ uptime 14:10:55 up 10 min, 7 users, load average: 0.10, 0.40, 0.31 Now I just have to remember to set up ip route add 192.168.1.0/24 dev eth3 via 192.168.1.1 instead of just ip route add 192.168.1.0/24 dev eth3 so I can tftpboot the thing (well, it took all of 10s to figure out, but it may not next time). Routing changes are painful. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Chistoph, Christoph Hellwig wrote: > The thing I'm unhappy with is what the code does currently. I haven't > looked at the code enough nor through about the problem enough to tell > you what's the right thing to do. Knowing that will involve review of > the architecture and serious benchmarking on a few plattforms. Like I was saying elswhere, we are likely going to drop the lockless code for now (i.e. the code that does the cmpxchg). Instead we will depend on normal cli/sti abstractions. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > Periodically can also mean a buffer start call back from relayfs > (although that would mean the first entry is not guaranteed) or a > (per cpu) eventcnt from the subsystem. The amount of needed search would > be limited. The main point is from the relayfs POV the buffer structure > has always the same (simple) structure. But two e-mails ago, you told us to drop the start_reserve and end_reserve and move the details of the buffer management into relayfs and out of ltt? Either we have a callback, like you suggest, and then we need to reserve some space to make sure that the callback is guaranteed to have the first entry, or we drop the callback and provide an option to the user for relayfs to write this first entry for him. Providing a callback without reservation is no different than relying purely on the heartbeat, which, like I said before and for the reasons illustrated below, is unrealistic. > You have to be more specific, what's so special about this amount of data. > You likely want to (incrementally) build an index file, so you don't have > to repeat the searches, but even with your current format you would > benefit from such an index file. [snip] >>As above, restoring the original order of events is fine if you are >>looking at mbs or kbs of data. It's just totally unrealistic for >>the amounts of data we want to handle. > > > Why is it "totally unrealistic"? Ok, let's expand a little here on the amount of data. Say you're getting 2MB/s of data (which is not unrealistic on a loaded system.) That means that if I'm tracing for 2 days, I've got 345GB of data (~7.5GB/hour). In practice, users aren't necessarily interested in plowing through the entire 345GB, they just want to view a given portion of it. Now, if I follow what you are suggesting, I have to go through the entire 345GB to: a) create indexes, b) reorder events, and likely c) have to rewrite another 345GB of data. And I haven't yet discussed the kind of problems you would encounter in trying to reorder such a beast that contains, by definition, variable-sized events. For one thing, if event N+1 doesn't follow N, then you would be forced to browse forward until you actually found it before you could write a properly ordered trace. And it just takes a few processes that are interrupted and forced to sleep here and there to make this unusable. That's without the RAM or fs space required to store those index tables ... At 3 to 12 bytes per events, that's a lot of space for indexes ... If I keep things as they are with ordered events and delimiters on buffer boundaries, I can skip to any place within this 345GB and start processing from there. And that's for two days. If you're a sysadmin encountering a transient problem on a server, you may actually want more than that. >>But like I said earlier, the added relayfs mode (kdebug) would allow >>for exactly what you are suggesting: >> event_id = atomic_inc_return(&event_cnt); > > > Actually that would be already too much for low level kernel debugging. > Why do you want to put this into relayfs? I don't. I was just saying that with the adhoc mode, a relayfs client could use the code snippet you were suggesting. > What are the _specific_ reasons you need these various modes, why can't > you build any special requirements on top of a very light weight relay > mechanism? Because of the opposite requirements. Here are the two modes I'm suggesting in relayfs and how they operate: Managed: - Presumes active user-space daemon interested in catching _all_ events. - Allows N buffers in buffer ring - Provides limit-checking (callback on end of sub-buffer) - Provides buffer delimiters (writes timestamp at beg and end) - Suited for all types of event sizes (both fixed and variable) at very high frequency. - Daemon is woken up when buffer is ready for writing, executes a write() on an mmaped area and notifies relevant kernel subsystem, which in turn notifies relayfs that buffer can now be reused. - Relies on proper abstraction of cli/sti. Ad-Hoc: - Presumes transient userspace tool interested in event snapshots. - Single circular buffer. - No limits checking (or very basic: as in stop if overwrite). - No buffer delimiters. - Best suited for fixed-size events at extreme high frequency. - User-space tool simply does a write() on an mmaped area and exits or goes back to sleep. - Relies on proper abstraction of cli/sti. Basically, the ad-hoc modes abides by the principles of KISS, whereas the managed is a more elaborate for clients like LTT. Rhetorical: Couldn't the ad-hoc mode case be a special case of the managed mode? In theory yes, in practice no. The various conditionals and code paths for switching buffers, invoking callbacks, writing delimiters and the likes, which make this mode useful to client like LTT, will always be a problem for those seeking the shortest path to buffer comital. In the case of Ingo, for example, I'm sure he'd
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Thomas Gleixner wrote: > Thats the point. Adding another hardwired implementation does not give > us a possibility to solve the hardwired problem of the already available > stuff. Well then, like I said before, you know what you need to do: http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/ Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Thomas Gleixner wrote: > Sorting out disabled events is the filtering you have to do in kernel > and you should do it in the hot path or remove the unneccecary > tracepoints at compiletime. Do you actually read my replies or do you just grep for something you can object to? If you care to read my replies you will see that this has already been answered. > You are not answering my argument. 8MB/sec is an event frequency of > 128hz when we assume 64byte/event. It's one event every 8us. So every > unneccecary computation, every leaving the hotpath for nothing is just > giving you performance loss. I have, you just choose not to read. Here's what I said earlier: > Note, however, that we are thinking of dropping the lockless scheme > for now. We will pick up this discussion separately further down the > road. IOW, we will be using cli/sti. So there is no "leaving the hotpath". > I said: > >>>Sorting out disabled events in the hot path > > > s/Sorting/Filtering/ > > I never said this should not be done. You're either on crack or I don't know how to read english. Here's what you said: > Sorting out disabled events in the hot path and moving the if > (pid/gid/grp) whatever stuff into userspace postprocessing is not an > alien request. Clearly you are suggesting to moving the filtering into user-space. > Seperating layers as I suggested before is not making it a generic > debugging tool. It makes parts of those layers available for other usage > and gives us the chance to reuse the parts for cleaning up already > available code which has the same hardwired structure. This has already been answered. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, Andrew Morton schrub am Fri, 14 Jan 2005 10:35:34 -0800: > What filesystem(s) do you use, and why? sshfs (best idea for file access through firewalls). gmailfs (best free off-site backup facility). Will use encfs as soon as FUSE is in mainline (I'm using cryptoloop now, but that's not sanely backupable.) -- Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > Hello Roman, > > > What we are dropping for later review: read/write semantics from > user-space. It has to be understood that we believe that this is > a major drawback. For one thing, you won't be able to do something > like: > $ cat /relayfs/xchg/my-file > ~/test-data > > Instead, you will have to write a custom app that does open(), > mmap(), write(). We could still provide a small app/library that > did this automagically, but you've got to admit that nothing > beats the real thing. > Maybe we could use FUSE to provide read()/write() for relayfs files - opening a FUSE relayfs file would open and mmap the actual relayfs file, read() would move around in the buffer using basically the current relayfs read logic moved down into the FUSE filesystem read fileop, and write() could write directly to the buffer... Tom > Also note that there are people who currently use this already, > so there will be some unhappy campers. > > Karim > -- > Author, Speaker, Developer, Consultant > Pushing Embedded and Real-Time Linux Systems Beyond the Limits > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 -- Regards, Tom Zanussi <[EMAIL PROTECTED]> IBM Linux Technology Center/RAS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Mon, Jan 17, 2005 at 10:48:52AM -0500, Robert Wisniewski wrote: > Wow - disabling interrupts is handfuls to tens of cycles, so that means > some architectures take thousands of cycles to do atomic operations. Then > I would definitely agree we should not be using atomic operations on those, > fwiw, out of curiosity, what archs make atomic ops so expensive. > > Andrew, on the broader note. If the community feels disabling interrupts > is the better way to go for the variables (I think it's index and count) we > were protecting with atomic ops then as the code stands things should be > fine with that approach and we can make that change. The thing I'm unhappy with is what the code does currently. I haven't looked at the code enough nor through about the problem enough to tell you what's the right thing to do. Knowing that will involve review of the architecture and serious benchmarking on a few plattforms. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Arjan van de Ven writes: > On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote: > > > :-) - as above. Furthermore, it seems that reducing the places where > > interrupts are disabled would be a good thing? > > depends at the price. On several cpus, disabling interupts is hundreds > of times cheaper than doing an atomic op. Wow - disabling interrupts is handfuls to tens of cycles, so that means some architectures take thousands of cycles to do atomic operations. Then I would definitely agree we should not be using atomic operations on those, fwiw, out of curiosity, what archs make atomic ops so expensive. Andrew, on the broader note. If the community feels disabling interrupts is the better way to go for the variables (I think it's index and count) we were protecting with atomic ops then as the code stands things should be fine with that approach and we can make that change. Thanks for your attention to looking through this. -bob Robert Wisniewski The K42 MP OS Project http://www.research.ibm.com/K42/ [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Sun, 16 Jan 2005, Karim Yaghmour wrote: > > You can make it even simpler by dropping this completely. Every buffer is > > simply a list of events and you can let ltt write periodically a timer > > event. In userspace you can randomly seek at buffer boundaries and search > > for the timer events. It will require a bit more work for userspace, but > > even large amount of tracing data stays managable. > > We already do write a heartbeat event periodically to have readable > traces in the case where the lower 32 bits of the TSC wrap-around. > > As I mentioned elsewhere, please don't think of this in terms of > kbs or mbs of data. What we're talking about here is gbs if not > 100gbs of data. Having to start reading each sub-buffer until you > hit a heartbeat really is a killer for such large traces. If there > was a significant impact on relayfs for having this I would have > understood the argument, but relayfs needs to do buffer-management > anyway, so I don't see that much complexity being added by allowing > the channel user to ask relayfs for delimiters. Periodically can also mean a buffer start call back from relayfs (although that would mean the first entry is not guaranteed) or a (per cpu) eventcnt from the subsystem. The amount of needed search would be limited. The main point is from the relayfs POV the buffer structure has always the same (simple) structure. You have to be more specific, what's so special about this amount of data. You likely want to (incrementally) build an index file, so you don't have to repeat the searches, but even with your current format you would benefit from such an index file. > > Userspace can then easily restore the original order of events. > > As above, restoring the original order of events is fine if you are > looking at mbs or kbs of data. It's just totally unrealistic for > the amounts of data we want to handle. Why is it "totally unrealistic"? > But like I said earlier, the added relayfs mode (kdebug) would allow > for exactly what you are suggesting: > event_id = atomic_inc_return(&event_cnt); Actually that would be already too much for low level kernel debugging. Why do you want to put this into relayfs? What are the _specific_ reasons you need these various modes, why can't you build any special requirements on top of a very light weight relay mechanism? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, 2005-01-16 at 21:24 -0500, Karim Yaghmour wrote: > > Sorting out disabled events in the hot path and moving the if > > (pid/gid/grp) whatever stuff into userspace postprocessing is not an > > alien request. > > It is. Have you even read what I suggested to change in my other mail: > if ((any_filtering) && !(ltt_filter(event_id, event_struct, data))) > return -EINVAL; Sorting out disabled events is the filtering you have to do in kernel and you should do it in the hot path or remove the unneccecary tracepoints at compiletime. > > 4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us) > > 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us) > Actually, on a PII-350MHz, I was already generating 0.5MB/s of data > just by running an X session. If we assume that a machine 10 times > faster generates 10 times as many events, we've already got 5MB/s, > and I'm sure that there are heavier cases than X. You are not answering my argument. 8MB/sec is an event frequency of 128hz when we assume 64byte/event. It's one event every 8us. So every unneccecary computation, every leaving the hotpath for nothing is just giving you performance loss. > Not even Ingo hinted at getting rid of filtering. Remember the earlier > e-mail I refered to? Here's what he was suggesting: I said: > > Sorting out disabled events in the hot path s/Sorting/Filtering/ I never said this should not be done. > Like I said, we are willing to accomodate those who want to be able > to use relayfs for kernel debugging purposes, but we can hardly > be blamed for not making LTT a generic kernel debugging tool as this > is exactly the excuse many kernel developers had for not including > LTT to start with. It's just totally dissengenious for giving us > grief for claiming that we are doing something and then later turn > around and blame us for not doing it ... cheesh ... Seperating layers as I suggested before is not making it a generic debugging tool. It makes parts of those layers available for other usage and gives us the chance to reuse the parts for cleaning up already available code which has the same hardwired structure. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
On Sun, 2005-01-16 at 20:54 -0500, Karim Yaghmour wrote: > If you really want to define layers, then there are actually four > layers: > 1- hooking mechanism > 2- event definition / registration > 3- event management infrastructure > 4- transport mechanism > > LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to > earlier, there is code in the kernel that already does 1, 2, 3, > and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking > for them to remove it. We're offering 4 separately and are putting > LTT on top of it. If you want to get 1 & 2 separately, have a look > at kernel hooks and genevent: I know that there is enough code which does x,y,z hardcoded/hardwired already. Thats the point. Adding another hardwired implementation does not give us a possibility to solve the hardwired problem of the already available stuff. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi Karim, > Thomas Gleixner wrote: >> It's not only me, who needs constant time. Everybody interested in >> tracing will need that. In my opinion its a principle of tracing. > > relayfs is a generalized buffering mechanism. Tracing is one application > it serves. Check out the web site: "high-speed data-relay filesystem." > Fancy name huh ... > >> The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces >> locks by do { } while loops. So what ? > How about combining "buffering mechansim of relayfs" and "kernel-> user space tranport by debugfs" This will also remove lots of compilcated code from realyfs. Thanks Prasanna -- Prasanna S Panchamukhi Linux Technology Center India Software Labs, IBM Bangalore Ph: 91-80-25044636 <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Thomas Gleixner wrote: > Which is every 1.42 seconds on a 3GHz machine. I guess we don't have > GB's of data when the 1.42 seconds elapse without an event. My argument was about being able to browse the amount of data I was refering to. The hearbeat thing was an asside to Roman as to the fact that we already do what he's suggesting. > I still don't see the point. The implicit ability of LTT to allow > tracing of up to 8192 bytes user data, strings and XML makes this > neccecary. I do not see any neccecarity to integrate this special usage > modes instead of an generic usable instrumentation implementation. I've already clarified your mischaracterization of custom events, you are being dissengenious here. If you want a generalized hooking mechanism, feel free to ask Andrew to take kernel hooks: http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/ > If relayfs is giving those users the ability to do so then they can do > it, but I object the fact that LTT/relayfs is occupying the place of a > more generic implementation in the way it is implemeted now. Again, damned if we do, damned if don't. LTT isn't meant for kernel debugging per se, though you can use it to that end to a certain extent. However, if you are kernel debugging, you will find the ad-hoc mode I'm talking about adding to relayfs quite useful. > For normal event tracing you have about 32-64 byte of data per event. So > disabling interrupts in order to copy this amount of imformation into a > buffer is cheaper on most architectures than doing the whole magic in > LTT and relayfs. This also keeps your buffers consistent and does not > need any magic for postprocessing. Oh, now you want to lighten the weight on postprocessing? Common Thomas, please stop wasting my time. Note, however, that we are thinking of dropping the lockless scheme for now. We will pick up this discussion separately further down the road. > Sorting out disabled events in the hot path and moving the if > (pid/gid/grp) whatever stuff into userspace postprocessing is not an > alien request. It is. Have you even read what I suggested to change in my other mail: if ((any_filtering) && !(ltt_filter(event_id, event_struct, data))) return -EINVAL; You're not honestly telling me that checking for any_filtering is going to ruin your day. > You are talking of Gigabytes of data. In what time ? > > Let's do some math. > > For simplicity all events use 64 Byte event space. > > ~ 64kB/sec for 1000 events/s (event frequency 1kHz) ( 1 ms) > 1024kB/sec for 16 events/ms (event frequency 16kHz) (62 us) > 2048kB/sec for 32 events/ms (event frequency 32kHz) (31 us) > 4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us) > 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us) > > where a 100Mbit network can theoretically transport 10240kB/sec and > practically does 4000-8000 kB/sec. > > An event frequency of 8us even on a 3 GHz machine is complete illusion, > because we spend already a couple of usecs in servicing the legacy 8254 > timer. > > So the realistic assumption on a 3Ghz machine is definitely below 64kHz, > which means we have to handle max. 4Mb of data per second. Actually, on a PII-350MHz, I was already generating 0.5MB/s of data just by running an X session. If we assume that a machine 10 times faster generates 10 times as many events, we've already got 5MB/s, and I'm sure that there are heavier cases than X. Here's the paper if you want to read it: http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz > I'm not impressed. Disabling interrupts for a couple of nano seconds to > store the trace data in the buffer does not hurt at all. Running through > a big bunch of out of cache line instructions does. Like I said above, fighting for/against lockless is not our immediate goal, and we will likely remove it. > If you try to trace more than this amount you are toast anyway. > > Please beware me of "reality has bitten" arguments. The whole if(..) > scenario in _ltt_event_log() is doing postprocessing, which can be done > in userspace. I don't care about the required time as long as it does > not introduce additional burden into the kernel. Not even Ingo hinted at getting rid of filtering. Remember the earlier e-mail I refered to? Here's what he was suggesting: > void trace(event, data1, data2, data3) > { > int cpu = smp_processor_id(); > int idx, pending, *curr = curr_idx + cpu; > struct trace_event *t; > unsigned long flags; > > if (!event_wanted(current, event, data1, data2, data3)) > return; > > local_irq_save(flags); > > idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1); > pending = ++curr_pending[cpu]; > > t = trace_ring[cpu] + idx; > > t->event = event; > rdtscll(t->timestamp); > t->data1 = data1; > t->data2 = data2; > t->data3 = data3; > > if (curr_pending == TRACE_LOW_WATERMARK &
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Thomas Gleixner wrote: > This implies to seperate > > - infrastructure > - event registration > - transport mechanism Like I said in my first response: we can't be everything for everbody, the requirements are just too broad. ISO tried it with OSI. Have a look at net/* for the result. Currently, LTT provides the first two in one piece, and relayfs provides the third. Like I acknowledged earlier, there is room for generalizing the transport mechanism, and I'm thinking of amending the relayfs API proposal further and rename the modes to make them more straight-forward: - Managed (locking or lockless.) - Ad-Hoc (which works like Ingo, yourself, and others have requested.) If you really want to define layers, then there are actually four layers: 1- hooking mechanism 2- event definition / registration 3- event management infrastructure 4- transport mechanism LTT currently does 1, 2 & 3. Clearly, as in the mail I refered to earlier, there is code in the kernel that already does 1, 2, 3, and 4 in very hardwired/ad-hoc fashion and there isn't anyone asking for them to remove it. We're offering 4 separately and are putting LTT on top of it. If you want to get 1 & 2 separately, have a look at kernel hooks and genevent: http://www-124.ibm.com/developerworks/oss/linux/projects/kernelhooks/ http://www.listserv.shafik.org/pipermail/ltt-dev/2003-January/000408.html We'd gladly take a serious look at using the former if it was included, and there is work in progress being conducted on getting the latter being the standard way for declaring LTT events instead of using a static ltt-events.h. Five years ago, there was a discussion about integrating GKHI into the kernel (the kernel hooks ancestor). Have a look for yourself as to the response to this suggestion (basically people weren't ready to accept a generalized hooking mechanism without a defined set of hooks, and then others didn't like the idea at all because creating general hooks in the kernel which anybody can register to creates legal and maintenance problems ... basically it's a can of worms): http://marc.theaimsgroup.com/?l=linux-kernel&m=97371908916365&w=2 There's only so much we can push into the kernel in the same time. Not to mention that before you can be generic, you've got to have some specific implementation to start working off on. I believe that what we've ironed out through the discussion of the past two days is a good basis. There is some irony in all this. For years, we were told that we couldn't make it into the kernel because we were perceived as providing a kernel debugging tool, and now that we're starting to get our things seriously reviewed we're being told that maybe it ain't really that useful because those who want to do kernel debugging can't use it as-is ... go figure. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, 2005-01-16 at 16:18 -0500, Karim Yaghmour wrote: > We already do write a heartbeat event periodically to have readable > traces in the case where the lower 32 bits of the TSC wrap-around. Which is every 1.42 seconds on a 3GHz machine. I guess we don't have GB's of data when the 1.42 seconds elapse without an event. > > Userspace can then easily restore the original order of events. > > As above, restoring the original order of events is fine if you are > looking at mbs or kbs of data. It's just totally unrealistic for > the amounts of data we want to handle. I still don't see the point. The implicit ability of LTT to allow tracing of up to 8192 bytes user data, strings and XML makes this neccecary. I do not see any neccecarity to integrate this special usage modes instead of an generic usable instrumentation implementation. If relayfs is giving those users the ability to do so then they can do it, but I object the fact that LTT/relayfs is occupying the place of a more generic implementation in the way it is implemeted now. For normal event tracing you have about 32-64 byte of data per event. So disabling interrupts in order to copy this amount of imformation into a buffer is cheaper on most architectures than doing the whole magic in LTT and relayfs. This also keeps your buffers consistent and does not need any magic for postprocessing. Sorting out disabled events in the hot path and moving the if (pid/gid/grp) whatever stuff into userspace postprocessing is not an alien request. You are talking of Gigabytes of data. In what time ? Let's do some math. For simplicity all events use 64 Byte event space. ~ 64kB/sec for 1000 events/s (event frequency 1kHz) ( 1 ms) 1024kB/sec for 16 events/ms (event frequency 16kHz) (62 us) 2048kB/sec for 32 events/ms (event frequency 32kHz) (31 us) 4096kB/sec for 64 events/ms (event frequency 64kHz) (15 us) 8192kB/sec for 128 events/ms (event frequency 128kHz) ( 8 us) where a 100Mbit network can theoretically transport 10240kB/sec and practically does 4000-8000 kB/sec. An event frequency of 8us even on a 3 GHz machine is complete illusion, because we spend already a couple of usecs in servicing the legacy 8254 timer. So the realistic assumption on a 3Ghz machine is definitely below 64kHz, which means we have to handle max. 4Mb of data per second. I'm not impressed. Disabling interrupts for a couple of nano seconds to store the trace data in the buffer does not hurt at all. Running through a big bunch of out of cache line instructions does. If you try to trace more than this amount you are toast anyway. Please beware me of "reality has bitten" arguments. The whole if(..) scenario in _ltt_event_log() is doing postprocessing, which can be done in userspace. I don't care about the required time as long as it does not introduce additional burden into the kernel. > Also note that there are people who currently use this already, > so there will be some unhappy campers. Be aware that there are some unhappy campers in the kernel community too when the special purpose tracing is included instead of a general usable framework. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
On Sat, 2005-01-15 at 23:23 -0500, Karim Yaghmour wrote: > > Well, that's really a core problem. We don't want to duplicate > > infrastructure, which practically does the same. So if relayfs isn't > > usable in this kind of situation, it really raises the question whether > > relayfs is usable at all. We need to make relayfs generally usable, > > otherwise it will join the fate of devfs. > > Hmm, coming from you I will take this is a pretty strong endorsement > for what I was suggesting earlier: provide a basic buffering mode > in relayfs to be used in kernel debugging. However, it must be > understood that this is separate from the existing modes and ltt, > for example, could not use such a basic infrastructure. If this is > ok with you, and no one wants to complain too loudly about this, I > will go ahead and add this to our to-do list for relayfs. This implies to seperate - infrastructure - event registration - transport mechanism tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, 2005-01-16 at 16:06 -0500, Robert Wisniewski wrote: > :-) - as above. Furthermore, it seems that reducing the places where > interrupts are disabled would be a good thing? depends at the price. On several cpus, disabling interupts is hundreds of times cheaper than doing an atomic op. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Christoph Hellwig writes: > On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote: > > int global_val; > > > > modify_val_spin() > > { > >acquire_spin_lock() > >// calculate some_value based on global_val > >// for example c=global_val; if (c%0) some_value=10; else some_value=20; > >global_val = global_val + some_value > >release_spin_lock() > > } > > > > modify_val_atomic() > > { > >do > >// calculate some_value based on global_val > >// for example c=global_val; if (c%0) some_value=10; else some_value=20; > >global_val = global_val + some_value > >while (compare_and_store(global_val, , )) > > } > > > > What's the difference. The deal is if two processes execute this code > > simultaneously and one gets interrupted in the middle of modify_val_spin, > > then the other wastes its entire quantum spinning for the lock. In the > > modify_val_atomic if one process gets interrupted, no problem, the other > > process can proceed through, then when the first one runs again the CAS > > will fail, and it will go around the loop again. Now imagine it was the > > kernel involved... > > Just prevent that with spin_lock_irq. But anyway this example doesn't > fit the ltt code. cmpxchg loops can make lots of sense for such simple > loops, but as soon as you need to do significant work in the loop it > starts to get problematic. Your example would btw be better off using The loop in question is where we grab the current (old) index, perform a computation (or three). The only expensive operation is the timestamp acquisition which has been modified to use the cheaper rtsc, so I still think that's within the realm of a reasonably simply loop. I think what you want to avoid is starting to walk a (potentially indeterminate) data structure in such atomic op loop. > atomic_t and it's primitives so you abstract away the actual implementation > and the architecture can chose the most efficient implementation. > That's an interesting thought because it might address Andrew's concern. We'll investigate. Thanks. -bob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > It seems we first need to specify, what relayfs actually is supposed to > be. Is it a relaying mechanism for large amount of data from kernel to > user space or is it a general communication channel between kernel and > user space? You have to choose one, if you mix contradicting requirements, > you'll never get a simple abstraction layer and relayfs will always be a > pain to work with. I think we want to concentrate on the former, though I suspect the latter will happen eventually. But let's keep our focus on providing a mechanism for relaying large amounts of data from the kernel to user-space. > You can make it even simpler by dropping this completely. Every buffer is > simply a list of events and you can let ltt write periodically a timer > event. In userspace you can randomly seek at buffer boundaries and search > for the timer events. It will require a bit more work for userspace, but > even large amount of tracing data stays managable. We already do write a heartbeat event periodically to have readable traces in the case where the lower 32 bits of the TSC wrap-around. As I mentioned elsewhere, please don't think of this in terms of kbs or mbs of data. What we're talking about here is gbs if not 100gbs of data. Having to start reading each sub-buffer until you hit a heartbeat really is a killer for such large traces. If there was a significant impact on relayfs for having this I would have understood the argument, but relayfs needs to do buffer-management anyway, so I don't see that much complexity being added by allowing the channel user to ask relayfs for delimiters. > Userspace can then easily restore the original order of events. As above, restoring the original order of events is fine if you are looking at mbs or kbs of data. It's just totally unrealistic for the amounts of data we want to handle. But like I said earlier, the added relayfs mode (kdebug) would allow for exactly what you are suggesting: event_id = atomic_inc_return(&event_cnt); So here's the new API based on input from Christoph and Tom: rchan* relay_open(channel_path, bufsize, nbufs); intrelay_close(*rchan); intrelay_reset(*rchan) intrelay_write(*rchan, *data_ptr, count, **wrote-pos); intrelay_info(*rchan, *channel_info) void relay_set_property(*rchan, property, value); void relay_get_property(*rchan, property, *value); For direct writing (currently already used by ltt, for example): char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting) void relay_commit(*rchan, *from, len, reserve_code, interrupting); void relay_buffers_consumed(*rchan, u32) These are the related macros: #define relay_write_direct(DEST, SRC, SIZE) \ #define relay_lock_channel(RCHAN, FLAGS) \ #define relay_unlock_channel(RCHAN, FLAGS) \ What we are dropping for later review: read/write semantics from user-space. It has to be understood that we believe that this is a major drawback. For one thing, you won't be able to do something like: $ cat /relayfs/xchg/my-file > ~/test-data Instead, you will have to write a custom app that does open(), mmap(), write(). We could still provide a small app/library that did this automagically, but you've got to admit that nothing beats the real thing. Also note that there are people who currently use this already, so there will be some unhappy campers. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Andrew Morton writes: > Robert Wisniewski <[EMAIL PROTECTED]> wrote: > > > > modify_val_spin() > > { > >acquire_spin_lock() > >// calculate some_value based on global_val > >// for example c=global_val; if (c%0) some_value=10; else some_value=20; > >global_val = global_val + some_value > >release_spin_lock() > > } > > > > modify_val_atomic() > > { > >do > >// calculate some_value based on global_val > >// for example c=global_val; if (c%0) some_value=10; else some_value=20; > >global_val = global_val + some_value > >while (compare_and_store(global_val, , )) > > } > > > > What's the difference. The deal is if two processes execute this code > > simultaneously and one gets interrupted in the middle of modify_val_spin, > > then the other wastes its entire quantum spinning for the lock. In the > > modify_val_atomic if one process gets interrupted, no problem, the other > > process can proceed through, then when the first one runs again the CAS > > will fail, and it will go around the loop again. > > One could use spin_lock_irq(). The performance would be similar. Yes on some architectures I think you right (on some archs though I'm not so sure) - Ingo and I had that debate a bit ago. But as you astutely noted or asked below, the original intent was to be able to use a single shared buffer for user and kernel space. In fact, the lockless design of tracing in K42, which motivated this design does that. For a couple of reasons we have not (yet?) done that for LTT. But, for example, NPTL could have made use of it when they were investigating a tracing facility. Recently, another company using LTT for device driver and video debugging is very interested in cheap user space tracing in conjunction with kernel tracing because they need both sets of events to understand what is up. The debate is still open for the best way to get cheap user space logging, but there seems to be an increasing need for it by the community. > > > Now imagine it was the kernel involved... > > Or are you saying that userspace does the above as well? :-) - as above. Furthermore, it seems that reducing the places where interrupts are disabled would be a good thing? By not introducing additional disable interrupts tracing has less of an impact. I was also pointing out Christoph's statement that spin locks and atomic ops are the same is not accurate (except for perhaps limited cases, but then you must make such arguments - not necessarily good), and we had good reasons for using an atomic op. Thanks. -bob Robert Wisniewski The K42 MP OS Project http://www.research.ibm.com/K42/ [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sun, Jan 16, 2005 at 03:11:00PM -0500, Robert Wisniewski wrote: > int global_val; > > modify_val_spin() > { > acquire_spin_lock() > // calculate some_value based on global_val > // for example c=global_val; if (c%0) some_value=10; else some_value=20; > global_val = global_val + some_value > release_spin_lock() > } > > modify_val_atomic() > { > do > // calculate some_value based on global_val > // for example c=global_val; if (c%0) some_value=10; else some_value=20; > global_val = global_val + some_value > while (compare_and_store(global_val, , )) > } > > What's the difference. The deal is if two processes execute this code > simultaneously and one gets interrupted in the middle of modify_val_spin, > then the other wastes its entire quantum spinning for the lock. In the > modify_val_atomic if one process gets interrupted, no problem, the other > process can proceed through, then when the first one runs again the CAS > will fail, and it will go around the loop again. Now imagine it was the > kernel involved... Just prevent that with spin_lock_irq. But anyway this example doesn't fit the ltt code. cmpxchg loops can make lots of sense for such simple loops, but as soon as you need to do significant work in the loop it starts to get problematic. Your example would btw be better off using atomic_t and it's primitives so you abstract away the actual implementation and the architecture can chose the most efficient implementation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Robert Wisniewski <[EMAIL PROTECTED]> wrote: > > modify_val_spin() > { > acquire_spin_lock() > // calculate some_value based on global_val > // for example c=global_val; if (c%0) some_value=10; else some_value=20; > global_val = global_val + some_value > release_spin_lock() > } > > modify_val_atomic() > { > do > // calculate some_value based on global_val > // for example c=global_val; if (c%0) some_value=10; else some_value=20; > global_val = global_val + some_value > while (compare_and_store(global_val, , )) > } > > What's the difference. The deal is if two processes execute this code > simultaneously and one gets interrupted in the middle of modify_val_spin, > then the other wastes its entire quantum spinning for the lock. In the > modify_val_atomic if one process gets interrupted, no problem, the other > process can proceed through, then when the first one runs again the CAS > will fail, and it will go around the loop again. One could use spin_lock_irq(). The performance would be similar. > Now imagine it was the kernel involved... Or are you saying that userspace does the above as well? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > Christoph Hellwig wrote: > > the lockless mode is really just loops around cmpxchg. It's spinlocks > > reinvented poorly. Christoph, Sadly they're not the same, atomic operations provide a set of functionality that simple spin locks do not give you. Consider two different processes each executing the following code int global_val; modify_val_spin() { acquire_spin_lock() // calculate some_value based on global_val // for example c=global_val; if (c%0) some_value=10; else some_value=20; global_val = global_val + some_value release_spin_lock() } modify_val_atomic() { do // calculate some_value based on global_val // for example c=global_val; if (c%0) some_value=10; else some_value=20; global_val = global_val + some_value while (compare_and_store(global_val, , )) } What's the difference. The deal is if two processes execute this code simultaneously and one gets interrupted in the middle of modify_val_spin, then the other wastes its entire quantum spinning for the lock. In the modify_val_atomic if one process gets interrupted, no problem, the other process can proceed through, then when the first one runs again the CAS will fail, and it will go around the loop again. Now imagine it was the kernel involved... I don't claim to have all the answers and am happy to have discussion on something, but the attitude expressed by "It's spinlocks reinvented poorly." is not conducive to a useful exchange even if you were correct. > > I beg to differ. You have to use different spinlocks depending on > where you are: > - serving user-space > - bh-derivatives > - irq > > lockless is the same primitive regardless of your current state, > it's not the same as spinlocks. > > Karim > -- > Author, Speaker, Developer, Consultant > Pushing Embedded and Real-Time Linux Systems Beyond the Limits > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 Robert Wisniewski The K42 MP OS Project http://www.research.ibm.com/K42/ [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Christoph Hellwig writes: > On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote: > >Where does this appear in relayfs and what rights do > >user-space apps have over it (rwx). > > Why would you want anything but read access? This would allow an application to write trace events of its own to a trace stream for instance. Also, I added a user-requested 'feature' whereby write()s on a relayfs channel would be sent to a callback that could be used to interpret 'out-of-band' commands sent from the userspace application. And if lockless logging were being used, this could provide a cheaper way for applications to write to the trace buffer than having to do it via syscall. > > > bufsize, nbufs: > >Usually things have to be subdivided in sub-buffers to make > >both writing and reading simple. LTT uses this to allow, > >among other things, random trace access. > > I think random access is overkill. Keeping the code simple is more > important and user-space can post-process it. > > > resize_min, resize_max: > >Allow for dynamic resizing of buffer. > > Auto-resizing sounds like a really bad idea. It also doesn't seem to be really useful to anyone, so we should probably remove it. Tom > > > init_buf, init_buf_size: > >Is there an initial buffer containing some data that should > >be used to initialize the channel's content. If you're doing > >init-time tracing, for example, you need to have a pre-allocated > >static buffer that is copied to relayfs once relayfs is mounted. > > And why can't you do this from that code? It just needs an initcall-like > thing that runs after mounting of relayfs. > -- Regards, Tom Zanussi <[EMAIL PROTECTED]> IBM Linux Technology Center/RAS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Christoph Hellwig wrote: > the lockless mode is really just loops around cmpxchg. It's spinlocks > reinvented poorly. I beg to differ. You have to use different spinlocks depending on where you are: - serving user-space - bh-derivatives - irq lockless is the same primitive regardless of your current state, it's not the same as spinlocks. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Christoph, Christoph Hellwig wrote: > Why would you want anything but read access? Fine, we can put it read-only, we'll drop the "mode" field. > I think random access is overkill. Keeping the code simple is more > important and user-space can post-process it. it's overkill if you're thinking in terms of kbs or mbs of data. it isn't if you're looking at gbs and 100gbs. please read my other posting as to who is using this and how. but regardless of access, you have to have some way of telling relayfs of the size of the channel you want. bufsize, nbufs just tell relayfs the size of the buffers you want and how many buffers there are in the ring. both of which are really basic to any sort of buffering scheme. > Auto-resizing sounds like a really bad idea. Ok, it will go. > And why can't you do this from that code? It just needs an initcall-like > thing that runs after mounting of relayfs. Ok, we'll leave it to the caller to do a relay_write() with his init-bufs at startup. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Joseph Fannin wrote: >>With this patch, initrds seem to get 'skipped'. I think this is >> probably the cause for the reports of problems with RAID too. On Sun, Jan 16, 2005 at 07:09:31PM +, Daniel Drake wrote: > This seems likely and is probably also the cause of wli's problems > mentioned elsewhere in this thread. > I had overlooked the way that initrd's work in that part of the boot > sequence. Will investigate. akpm suspected this immediately, and my tests confirmed it. I should probably do the work to make the box boot with CONFIG_MODULES=n as I don't like initrd's or modules anyway (new points of failure). -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, Joseph Fannin wrote: On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/ waiting-10s-before-mounting-root-filesystem.patch retry mounting the root filesystem at boot time With this patch, initrds seem to get 'skipped'. I think this is probably the cause for the reports of problems with RAID too. This patch should do the job. Replaces the existing waiting-10s-before-mounting-root-filesystem.patch in 2.6.11-rc1-mm1. Daniel Retry up to 20 times if mounting the root device fails. This fixes booting from usb-storage devices, which no longer make their partitions immediately available. Also cleans up the mount_block_root() function. Based on an earlier patch from William Park <[EMAIL PROTECTED]> Signed-off-by: Daniel Drake <[EMAIL PROTECTED]> --- linux-2.6.10/init/do_mounts.c.orig 2005-01-16 19:18:57.0 + +++ linux-2.6.10/init/do_mounts.c 2005-01-16 21:04:29.198471440 + @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -261,6 +262,9 @@ static void __init get_fs_names(char *pa static int __init do_mount_root(char *name, char *fs, int flags, void *data) { int err = sys_mount(name, "/root", fs, flags, data); + if (err == -EACCES && (flags | MS_RDONLY) == 0) + err = sys_mount(name, "/root", fs, flags | MS_RDONLY, data); + if (err) return err; @@ -273,38 +277,57 @@ static int __init do_mount_root(char *na return 0; } +static int __init mount_root_try_all_fs(char *name, char *fs_names, int flags, void *data) +{ + char *p; + int err = -EFAULT; + + for (p = fs_names; *p; p += strlen(p)+1) { + err = do_mount_root(name, p, flags, root_mount_data); + if (err != -EINVAL) + break; + } + + return err; +} + void __init mount_block_root(char *name, int flags) { char *fs_names = __getname(); - char *p; char b[BDEVNAME_SIZE]; + int tryagain = 20; get_fs_names(fs_names); -retry: - for (p = fs_names; *p; p += strlen(p)+1) { - int err = do_mount_root(name, p, flags, root_mount_data); - switch (err) { - case 0: -goto out; - case -EACCES: -flags |= MS_RDONLY; -goto retry; - case -EINVAL: -continue; + + while (1) { + int err = mount_root_try_all_fs(name, fs_names, flags, root_mount_data); + if (err == 0) + break; + + /* + * The root device may not be ready yet, so we retry a number of times + */ + if (--tryagain) { + printk(KERN_WARNING "VFS: Waiting %dsec for root device...\n", + tryagain); + ssleep(1); + if (!ROOT_DEV) { +ROOT_DEV = name_to_dev_t(saved_root_name); +create_dev(name, ROOT_DEV, root_device_name); + } + continue; } - /* + + /* * Allow the user to distinguish between failed sys_open * and bad superblock on root device. */ __bdevname(ROOT_DEV, b); - printk("VFS: Cannot open root device \"%s\" or %s\n", -root_device_name, b); - printk("Please append a correct \"root=\" boot option\n"); - + printk(KERN_CRIT "VFS: Cannot open root device \"%s\" or %s\n", + root_device_name, b); + printk(KERN_CRIT "Please append a correct \"root=\" boot option\n"); panic("VFS: Unable to mount root fs on %s", b); } - panic("VFS: Unable to mount root fs on %s", __bdevname(ROOT_DEV, b)); -out: putname(fs_names); }
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > What I'm dropping for now is all the functions that allow a > subsystem to read from a channel from within the kernel. So, > for example, if you want to obtain large amounts of data from > user-space via a relayfs channel you won't be able to. Here > are the functions that would go: > > rchan_reader *add_rchan_reader(channel_id, auto_consume) > intremove_rchan_reader(rchan_reader *reader) > rchan_reader *add_map_reader(channel_id) > intremove_map_reader(rchan_reader *reader) > intrelay_read(reader, buf, count, wait, *actual_read_offset) > void relay_buffers_consumed(reader, buffers_consumed) > void relay_bytes_consumed(reader, bytes_consumed, read_offset) > intrelay_bytes_avail(reader) > intrchan_full(reader) > intrchan_empty(reader) > > We could add these at a later time when/if needed. Removing > these changes nothing for ltt. One of the things that uses these functions to read from a channel from within the kernel is the relayfs code that implements read(2), so taking them away means you wouldn't be able to use read() on a relayfs file. That wouldn't matter for ltt since it mmaps the file, but there are existing users of relayfs that do use relayfs this way. In fact, most of the bug reports I've gotten are from people using it in this mode. That doesn't mean though that it's necessarily the right thing for relayfs or these users to be doing if they have suitable alternatives for passing lower-volume messages in this way. As others have mentioned, that seems to be the major question - should relayfs concentrate on being solely a high-speed data relay mechanism or should it try to be more, as it currently is implemented? If the former, then I wonder if you need a filesystem at all - all you have is a collection of mmappable buffers and the only thing the filesystem provides is the namespace. Removing read()/write() and filesystem support would of course greatly simplify the code; I'd like to hear from any existing users though and see what they'd be missing. ltt would still need at least relay_buffers_consumed() though. This is used to support the 'no-overwrite' option, which means that when the buffers are full i.e. the daemon has fallen behind and needs to catch up, channel writing is 'suspended' until it catches up. > > Also, we should try to get rid of the following. They are there > for allowing dynamically-resizable buffers, but if we are to > make buffer-management opaque, then this should be done > internally (Tom: I can't remember the rationale for these. Let > me know if there's a reason why the must be kept.) > > intrelay_realloc_buffer(*rchan, nbufs, async) > intrelay_replace_buffer(*rchan) relay_realloc_buffer actually does the work of allocating the new buffer space for used for resizing, and since it can sleep, it's done in the background using a work queue. When everything's ready, the channel buffer can then be replaced, thus relay_replace_buffer(). The only user of channel resizing that I know of is the 'dynamically resizeable printk replacement' I posted awhile back, and that apparently doesn't have any users, so I'd be happy to get rid of all the resizing code. Tom > > I think this is a pretty major change and simplification of the > API along the lines of what others have asked for. Let me know > what you think. > > Karim > -- > Author, Speaker, Developer, Consultant > Pushing Embedded and Real-Time Linux Systems Beyond the Limits > http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 -- Regards, Tom Zanussi <[EMAIL PROTECTED]> IBM Linux Technology Center/RAS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi, On Sun, 16 Jan 2005, Karim Yaghmour wrote: > The per-cpu buffering issue is really specific to the client. It just > so happens that LTT creates one channel for each CPU. Not everyone > who needs to ship lots of data to user-space needs/wants one channel > per cpu. You could, for example, use a relayfs channel as a big > chunk of memory visible to both a user-space app and its kernel buddy > in order to exchange data without ever using either needing more > than one such channel for your entire subsystem. It seems we first need to specify, what relayfs actually is supposed to be. Is it a relaying mechanism for large amount of data from kernel to user space or is it a general communication channel between kernel and user space? You have to choose one, if you mix contradicting requirements, you'll never get a simple abstraction layer and relayfs will always be a pain to work with. > > Why not just move the ltt buffer management into relayfs and provide a > > small library, which extracts the event stream again? Otherwise you have > > to duplicate this work for every serious relayfs user anyway. > > Ok, I've been meditating over what you say above for some time in order > to understand how best to follow what you are suggesting. So here's > what I've been able to come up with. Let me know if you have other > suggestions: > > Drop the buffer-start/end callbacks altogether. Instead, allow user > to specify in the channel properties whether they want to have > sub-buffer delimiters. If so, relayfs would automatically prepend > and append the structures currently written by ltt: > /* Start of trace buffer information */ > typedef struct _ltt_buffer_start { > struct timeval time;/* Time stamp of this buffer */ > u32 tsc;/* TSC of this buffer, if applicable */ > u32 id; /* Unique buffer ID */ > } LTT_PACKED_STRUCT ltt_buffer_start; > > /* End of trace buffer information */ > typedef struct _ltt_buffer_end { > struct timeval time;/* Time stamp of this buffer */ > u32 tsc;/* TSC of this buffer, if applicable */ > } LTT_PACKED_STRUCT ltt_buffer_end; You can make it even simpler by dropping this completely. Every buffer is simply a list of events and you can let ltt write periodically a timer event. In userspace you can randomly seek at buffer boundaries and search for the timer events. It will require a bit more work for userspace, but even large amount of tracing data stays managable. > As for lockless vs. locking there is a need for both. Not having > to get locks has obvious advantages, but if you require strict > timing you will want to use the locking scheme because its logging > time is linear (see Thomas' complaints about lockless elsewhere > in this thread, and Ingo's complaints about relayfs somewhere back > in October.) But why has it to be done in relayfs? Simply leave it to the user to write an extra id field: event_id = atomic_inc_return(&event_cnt); Userspace can then easily restore the original order of events. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Joseph Fannin wrote: On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/ waiting-10s-before-mounting-root-filesystem.patch retry mounting the root filesystem at boot time With this patch, initrds seem to get 'skipped'. I think this is probably the cause for the reports of problems with RAID too. This seems likely and is probably also the cause of wli's problems mentioned elsewhere in this thread. I had overlooked the way that initrd's work in that part of the boot sequence. Will investigate. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Fri, Jan 14, 2005 at 06:09:23PM -0500, Karim Yaghmour wrote: > relayfs implements two schemes: lockless and locking. The later uses > standard linear locking mechanisms. If you need stringent constant > time, you know what to do. the lockless mode is really just loops around cmpxchg. It's spinlocks reinvented poorly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Sat, Jan 15, 2005 at 01:24:16AM +0100, Thomas Gleixner wrote: > Putting a 200k patch into the kernel for limited usage and maybe > restricting a generic simple non intrusive and more generic > implementation by its mere presence is making it inapplicable enough. > > Merge the instrumentation points from ltt and other projects like DSKI > and the places where in kernel instrumentation for specific purposes is > already available and use a simple and effective framework which moves > the burden into postprocessing and provides a simple postmortem dump > interface, is the goal IMHO. > > When this is available, trace tool developers can concentrate on > postprocessing improvement rather than moving postprocessing > incapabilities into the kernel. I completely agree with that statement. We've been working in most areas of the kernel to move or keep complexity and policy in userspace. The same should be true for a tracing framework. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Fri, Jan 14, 2005 at 04:11:38PM -0500, Karim Yaghmour wrote: > Where does this appear in relayfs and what rights do > user-space apps have over it (rwx). Why would you want anything but read access? > bufsize, nbufs: > Usually things have to be subdivided in sub-buffers to make > both writing and reading simple. LTT uses this to allow, > among other things, random trace access. I think random access is overkill. Keeping the code simple is more important and user-space can post-process it. > resize_min, resize_max: > Allow for dynamic resizing of buffer. Auto-resizing sounds like a really bad idea. > init_buf, init_buf_size: > Is there an initial buffer containing some data that should > be used to initialize the channel's content. If you're doing > init-time tracing, for example, you need to have a pre-allocated > static buffer that is copied to relayfs once relayfs is mounted. And why can't you do this from that code? It just needs an initcall-like thing that runs after mounting of relayfs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Karim Yaghmour writes: > > Hello Thomas, > > In the interest of avoiding expanding the thread too thin, I'm replying to > both emails in the same time. > > Thomas Gleixner wrote: > >>relayfs is a generalized buffering mechanism. Tracing is one application > >>it serves. Check out the web site: "high-speed data-relay filesystem." > >>Fancy name huh ... > > > > > > I do not doubt that. > > > > But hardwiring an instrumentation framework on it is also hardwiring > > implicit restrictions on the usability of the instrumentation for > > certain purposes. > > To a certain extent this is true. Please refer to my reply to your RFC > for a discussion of this. > > >>Well for one thing, a portion of code running in user-context won't > >>disable interrupts while it's attempting to get buffer space, and > >>therefore won't impact on interrupt delivery. > > > > > > The do {} while loops are in the fast ltt_log_event path As Greg's comments implicitly involved this issue as well, maybe it's worth expanding on what is going on here. The idea behind the lockless tracing is for each process/thread to atomically reserve space in the buffer, then write in the events. Also note that buffers are per-processor. So the do {} while loop loads the current index, does a calculation and attempts to use the calculated value (which is the old index + length of current event) to atomically compare_and_swap with the actual index pointer. As Karim correctly notes, the only way this will fail is if an interrupt occurred during the couple of instruction calculation, i.e., between when the old value was loaded and when we do the CAS, so it's unlikely, but even much more unlikely that, as he notes, this process would be woken up only for a couple of instructions and re-interrupted. Back to Greg's volatile issue: The reason the index needs to be volatile (or as was originally coded the reason we clobbered the registers) is to make sure the compiler knows the index value needs to get reloaded from memory each time around the loop. Hope this helps. I'm certainly happy to discuss in more length if there's any concerns/questions. -bob Robert Wisniewski The K42 MP OS Project http://www.research.ibm.com/K42/ [EMAIL PROTECTED] > > You mean that it would impact on interrupt deliver? This code's behavior > has actually been carefully studied, and what has been seen is that > there code almost never loops, and when it does, it very rarely does > it more than twice. In the case of an interrupt, you'd have to receive > an interrupt while reserving space for logging a current's interrupt > occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski > on this as he's the one that implemented this code and studied its > behavior in depth. > > > Yeah, did you answer one of my arguments except claiming that I'm to > > stupid to understand how it works ? > > If I miss-spoke, then I appologize. For one thing, I've never thought > of you as stupid. I'm just trying to get specifics here. > > > I just dont like the idea, that instrumentation is bound on relayfs and > > adds a feature to the kernel which fits for a restricted set of problems > > rather than providing a generic optimized instrumentation framework, > > where one can use relayfs as a backend, if it fits his needs. Making > > this less glued together leaves the possibility to use other backends. > > Yes, I understand and I hope my other mail properly addresses this issue. > > > There is a loop in ltt_log_event, which enforces the processing of each > > event twice. Spliting traces is postprocessing and can be done > > elsewhere. > > Sorry, this is not postprocessing. Let me explain: > > Basically, the ltt framework allows only one tracing session to be active > at all times. IOW, if you were planning on starting a 2 week trace and > after doing so wanted to trace a short 10s on an application then you are > screwed, LTT won't allow you to do that. Currently this is a limitation > which we haven't heard any complaints about, so we're not going to > generalize it until there is proof that people really need this. > > However, there are cases where you want to have tracing running at _all_ > times in what is refered to as flight-recorder mode and only dump the > content of the buffers when something special happens. Yet, those who > are interested in having this 24x7 mode also know enough about tracing > that they do need to actually trace other things for short periods > without disrupting their flight-recording. That's why there's a loop. > An event will be processed twice only if you're tracing AND flight- > recording in the same time. > > There is no way to do an equivalent of what I just described with any > form of postprocessing. > > Here's the proper snippet from include/linux/ltt-events.h: > /* We currently support 2 traces, normal trace and flight recorder */ > #define NR_TRACES
Re: 2.6.11-rc1-mm1 waiting-10s-before-mounting-root-....
Daniel Kirsten <[EMAIL PROTECTED]> writes: >> Are you using an initrd? > yes. Then read Documentation/initrd.txt ... Your initrd must be deprecated, i guess you have to use root=/dev/whatever/your_final_root_fs with it while it should be root=/dev/ram0. (pretty sure it doesn't use pivot_root either :) ) FYI it works here with an updated initrd without reversing a patch... -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Breakage with raid in 2.6.11-rc1-mm1 [Regression in mm]
Hi, Reuben Farrelly wrote: At 12:58 a.m. 15/01/2005, Andrew Morton wrote: Reuben Farrelly <[EMAIL PROTECTED]> wrote: > > Something seems to have broken with 2.6.11-rc1-mm1, which worked ok with > 2.6.10-mm3. > > NET: Registered protocol family 17 > Starting balanced_irq > BIOS EDD facility v0.16 2004-Jun-25, 2 devices found > md: Autodetecting RAID arrays. > md: autorun ... > md: ... autorun DONE. > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) > > The system is running 5 RAID-1 partitions, and md2 is the root as per > grub.conf. Problem seems to be that raid autodetection finds no raid > partitions :( > > The two ST380013AS SATA drives are detected earlier in the boot, so I don't > think that's the problem.. hm, the only raidy thing we have in there is the below. Maybe you could try reverting that? --- 25/drivers/md/raid5.c~raid5-overlapping-read-hack 2005-01-09 22:20:40.211246912 -0800 +++ 25-akpm/drivers/md/raid5.c 2005-01-09 22:20:40.216246152 -0800 @@ -232,6 +232,7 @@ static struct stripe_head *__find_stripe } static void unplug_slaves(mddev_t *mddev); +static void raid5_unplug_device(request_queue_t *q); static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int pd_idx, int noblock) Ok the breakage occurred somewhere between 2.6.10-mm3 (works) and 2.6.11-rc1 (doesn't work) ie wasn't introduced into the latest -mm patchset as I first thought. Are there any other patches that might be worth a try backing out? reuben I did a full untar of the source and rebuilt my (crusty old) config file from scratch, and it seems to have come right now. Can't really explain it though...but obviously wasn't a problem with the -mm release as I first though. Now running -rc1-mm1 with no problems and no other patches. Thanks to those who helped on what turned out to be a false alarm. reuben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Roman, Roman Zippel wrote: > It's interesting to read more about ltt's requirements, but I still think > it's possible to leave this work to the relayfs layer. Ok, I'm willing to play ball, but can you be a little bit more specific. > Why not just move the ltt buffer management into relayfs and provide a > small library, which extracts the event stream again? Otherwise you have > to duplicate this work for every serious relayfs user anyway. Ok, I've been meditating over what you say above for some time in order to understand how best to follow what you are suggesting. So here's what I've been able to come up with. Let me know if you have other suggestions: Drop the buffer-start/end callbacks altogether. Instead, allow user to specify in the channel properties whether they want to have sub-buffer delimiters. If so, relayfs would automatically prepend and append the structures currently written by ltt: /* Start of trace buffer information */ typedef struct _ltt_buffer_start { struct timeval time;/* Time stamp of this buffer */ u32 tsc;/* TSC of this buffer, if applicable */ u32 id; /* Unique buffer ID */ } LTT_PACKED_STRUCT ltt_buffer_start; /* End of trace buffer information */ typedef struct _ltt_buffer_end { struct timeval time;/* Time stamp of this buffer */ u32 tsc;/* TSC of this buffer, if applicable */ } LTT_PACKED_STRUCT ltt_buffer_end; This would also allow dropping the start_reserve, end_reserve, and channel_start_reserve. The latter can be added by ltt as its first event. Is this what you are looking for and is there something else we should be doing. > Completely abstracting the buffer management would the make whole > interface simpler and it would be a lot easier to change without breaking > everything. E.g. it would be possible to use per cpu buffers and remove > the need for different locking mechanisms, for a good tracing mechanism > it's not just important that it's lockless, but also that the cpus don't > share cache lines in the fast path. In this regard relayfs/ltt has really > still too much overhead and the complex relayfs API isn't really making it > easy to fix this. The per-cpu buffering issue is really specific to the client. It just so happens that LTT creates one channel for each CPU. Not everyone who needs to ship lots of data to user-space needs/wants one channel per cpu. You could, for example, use a relayfs channel as a big chunk of memory visible to both a user-space app and its kernel buddy in order to exchange data without ever using either needing more than one such channel for your entire subsystem. As for lockless vs. locking there is a need for both. Not having to get locks has obvious advantages, but if you require strict timing you will want to use the locking scheme because its logging time is linear (see Thomas' complaints about lockless elsewhere in this thread, and Ingo's complaints about relayfs somewhere back in October.) But in trying to make things simpler, here's a reworked API: rchan* relay_open(channel_path, mode, bufsize, nbufs); intrelay_close(*rchan); intrelay_reset(*rchan) intrelay_write(*rchan, *data_ptr, count, **wrote-pos); intrelay_info(*rchan, *channel_info) void relay_set_property(*rchan, property, value); void relay_get_property(*rchan, property, *value); For direct writing (currently already used by ltt, for example): char* relay_reserve(*rchan, len, *ts, *td, *err, *interrupting) void relay_commit(*rchan, *from, len, reserve_code, interrupting); These are the related macros: #define relay_write_direct(DEST, SRC, SIZE) \ #define relay_lock_channel(RCHAN, FLAGS) \ #define relay_unlock_channel(RCHAN, FLAGS) \ As I hinted elsewhere, we would now have three modes for relayfs channels: - locking => relies on local_irq_save. - lockless => relies on try_reserve/fail->retry (based on cmpxchg). - kdebug => this is for kernel debugging. The last one could be based on Ingo's tracing code, or any implementation suggestions by Thomas. It wouldn't do all the checks and provide all the capabilities of the other two mechanisms, but would really be a hot-path logger with only minimalistic provisions for content loss and other such things. (note to Tom: time_delta_offset that used to be in relay_write should be a property set using relay_set_property). What I'm dropping for now is all the functions that allow a subsystem to read from a channel from within the kernel. So, for example, if you want to obtain large amounts of data from user-space via a relayfs channel you won't be able to. Here are the functions that would go: rchan_reader *add_rchan_reader(channel_id, auto_consume) intremove_rchan_reader(rchan_reader *reader) rchan_reader *add_map_reader(channel_id) intremove_map_reader(rchan_reader *reader) intrelay_read(reader, buf, count, wait, *actual_read_offset) void relay_buffers_consumed
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Hello Roman, Roman Zippel wrote: > On Sat, 15 Jan 2005, Karim Yaghmour wrote: >>In addition, and this is a very important issue, quite a few >>kernel developers mistook LTT for a kernel debugging tool, which >>it was never meant to be. When, in fact, if you ask those who have >>looked at using it for that purpose (try Marcelo or Andrea) you will >>see that they didn't find it to be appropriate for them. And >>rightly so, it was never meant for that purpose. Even lately, when >>I suggested Ingo try using relayfs instead of his custom tracing >>code for his preemption work, he looked at it and said that it >>wasn't suited, but would consider reusing parts of it if it were >>in the kernel. > > Well, that's really a core problem. We don't want to duplicate > infrastructure, which practically does the same. So if relayfs isn't > usable in this kind of situation, it really raises the question whether > relayfs is usable at all. We need to make relayfs generally usable, > otherwise it will join the fate of devfs. Hmm, coming from you I will take this is a pretty strong endorsement for what I was suggesting earlier: provide a basic buffering mode in relayfs to be used in kernel debugging. However, it must be understood that this is separate from the existing modes and ltt, for example, could not use such a basic infrastructure. If this is ok with you, and no one wants to complain too loudly about this, I will go ahead and add this to our to-do list for relayfs. Karim -- Author, Speaker, Developer, Consultant Pushing Embedded and Real-Time Linux Systems Beyond the Limits http://www.opersys.com || [EMAIL PROTECTED] || 1-866-677-4546 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hello Thomas, In the interest of avoiding expanding the thread too thin, I'm replying to both emails in the same time. Thomas Gleixner wrote: >>relayfs is a generalized buffering mechanism. Tracing is one application >>it serves. Check out the web site: "high-speed data-relay filesystem." >>Fancy name huh ... > > > I do not doubt that. > > But hardwiring an instrumentation framework on it is also hardwiring > implicit restrictions on the usability of the instrumentation for > certain purposes. To a certain extent this is true. Please refer to my reply to your RFC for a discussion of this. >>Well for one thing, a portion of code running in user-context won't >>disable interrupts while it's attempting to get buffer space, and >>therefore won't impact on interrupt delivery. > > > The do {} while loops are in the fast ltt_log_event path You mean that it would impact on interrupt deliver? This code's behavior has actually been carefully studied, and what has been seen is that there code almost never loops, and when it does, it very rarely does it more than twice. In the case of an interrupt, you'd have to receive an interrupt while reserving space for logging a current's interrupt occurrence for the loop to be done twice. I've CC'ed Bob Wisniewski on this as he's the one that implemented this code and studied its behavior in depth. > Yeah, did you answer one of my arguments except claiming that I'm to > stupid to understand how it works ? If I miss-spoke, then I appologize. For one thing, I've never thought of you as stupid. I'm just trying to get specifics here. > I just dont like the idea, that instrumentation is bound on relayfs and > adds a feature to the kernel which fits for a restricted set of problems > rather than providing a generic optimized instrumentation framework, > where one can use relayfs as a backend, if it fits his needs. Making > this less glued together leaves the possibility to use other backends. Yes, I understand and I hope my other mail properly addresses this issue. > There is a loop in ltt_log_event, which enforces the processing of each > event twice. Spliting traces is postprocessing and can be done > elsewhere. Sorry, this is not postprocessing. Let me explain: Basically, the ltt framework allows only one tracing session to be active at all times. IOW, if you were planning on starting a 2 week trace and after doing so wanted to trace a short 10s on an application then you are screwed, LTT won't allow you to do that. Currently this is a limitation which we haven't heard any complaints about, so we're not going to generalize it until there is proof that people really need this. However, there are cases where you want to have tracing running at _all_ times in what is refered to as flight-recorder mode and only dump the content of the buffers when something special happens. Yet, those who are interested in having this 24x7 mode also know enough about tracing that they do need to actually trace other things for short periods without disrupting their flight-recording. That's why there's a loop. An event will be processed twice only if you're tracing AND flight- recording in the same time. There is no way to do an equivalent of what I just described with any form of postprocessing. Here's the proper snippet from include/linux/ltt-events.h: /* We currently support 2 traces, normal trace and flight recorder */ #define NR_TRACES 2 #define TRACE_HANDLE0 #define FLIGHT_HANDLE 1 > In _ltt_log_event lives quite a bunch of if(...) processing decisions > which have to be evaluated for _each_ event. Correct, and I'm honest enough with myself to admit that this is the bit of code that I think needs the most reviewing. So, in order to help you help me, here's the various code snippets and things I can think of which would help make the code faster/simpler: Here's the preamble where we check some make some basic sanity checks: if (!trace) return -ENOMEDIUM; if (trace->paused) return -EBUSY; tracer_handle = trace->trace_handle; if (!trace->flight_recorder && (trace->daemon_task_struct == NULL)) return -ENODEV; channel_handle = trace_channel_handle(tracer_handle, cpu_id); if ((trace->tracer_started == 1) || (event_id == LTT_EV_START) || (event_id == LTT_EV_BUFFER_START)) goto trace_event; return -EBUSY; trace_event: if (!ltt_test_bit(event_id, &trace->traced_events)) return 0; Basically, unless we've succeeded in all those if's, we're not going to write anything. I think we could get rid of the first 4 ones by simply maintaining a state-machine for the tracer. Then we could either have a single if or even use function pointers (though I think this costs more) to call or not call _ltt_log_event. As for checking whether the event has a certain ID (EV_START or EV_BUFFER_STAR
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Hi, On Sat, 15 Jan 2005, Karim Yaghmour wrote: > In addition, and this is a very important issue, quite a few > kernel developers mistook LTT for a kernel debugging tool, which > it was never meant to be. When, in fact, if you ask those who have > looked at using it for that purpose (try Marcelo or Andrea) you will > see that they didn't find it to be appropriate for them. And > rightly so, it was never meant for that purpose. Even lately, when > I suggested Ingo try using relayfs instead of his custom tracing > code for his preemption work, he looked at it and said that it > wasn't suited, but would consider reusing parts of it if it were > in the kernel. Well, that's really a core problem. We don't want to duplicate infrastructure, which practically does the same. So if relayfs isn't usable in this kind of situation, it really raises the question whether relayfs is usable at all. We need to make relayfs generally usable, otherwise it will join the fate of devfs. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][2.6.11-rc1-mm1] relayfs - remove klog debugging channel
Andrew, This patch removes from relayfs the 'klog debugging channel', which is a relayfs 'application' that doesn't belong in the main code. Please apply. Signed-off-by: Tom Zanussi <[EMAIL PROTECTED]> diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/Kconfig linux-2.6.11-rc1-mm1-cur/fs/Kconfig --- linux-2.6.11-rc1-mm1-vanilla/fs/Kconfig Fri Jan 14 06:13:12 2005 +++ linux-2.6.11-rc1-mm1-cur/fs/Kconfig Fri Jan 14 09:28:25 2005 @@ -923,8 +923,7 @@ config RELAYFS_FS an efficient mechanism for tools and facilities to relay large amounts of data from kernel space to user space. It's not useful on its own, and should only be enabled if other facilities that - need it are enabled, such as for example klog or the Linux Trace - Toolkit. + need it are enabled, such as for example the Linux Trace Toolkit. See for further information. @@ -935,37 +934,6 @@ config RELAYFS_FS module, say M here and read . If unsure, say N. - -config KLOG_CHANNEL - bool "Enable klog debugging support" - depends on RELAYFS_FS - default n - help - If you say Y to this, a relayfs channel named klog will be created - in the root of the relayfs file system. You can write to the klog - channel using klog() or klog_raw() from within the kernel or - kernel modules, and read from the klog channel by mounting relayfs - and using read(2) to read from it (or using cat). If you're not - sure, say N. - -config KLOG_CHANNEL_AUTOENABLE - bool "Enable klog logging on startup" - depends on KLOG_CHANNEL - default y - help - If you say Y to this, the klog channel will be automatically enabled - on startup. Otherwise, to turn klog logging on, you need use - sysctl (fs.relayfs.klog_enabled). This option is used in cases where - you don't actually want the channel to be written to until it's - enabled. If you're not sure, say Y. - -config KLOG_CHANNEL_SHIFT - depends on KLOG_CHANNEL - int "klog debugging channel size (14 => 16KB, 22 => 4MB)" - range 14 22 - default 21 - help - Select klog debugging channel size as a power of 2. endmenu diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/Makefile linux-2.6.11-rc1-mm1-cur/fs/relayfs/Makefile --- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/MakefileFri Jan 14 06:13:13 2005 +++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/MakefileFri Jan 14 09:30:25 2005 @@ -5,4 +5,4 @@ obj-$(CONFIG_RELAYFS_FS) += relayfs.o relayfs-y := relay.o relay_lockless.o relay_locking.o inode.o resize.o -relayfs-$(CONFIG_KLOG_CHANNEL) += klog.o + diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/inode.c linux-2.6.11-rc1-mm1-cur/fs/relayfs/inode.c --- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/inode.c Fri Jan 14 06:13:13 2005 +++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/inode.c Fri Jan 14 09:29:17 2005 @@ -604,19 +604,12 @@ static int __init init_relayfs_fs(void) { int err = register_filesystem(&relayfs_fs_type); -#ifdef CONFIG_KLOG_CHANNEL - if (!err) - create_klog_channel(); -#endif return err; } static void __exit exit_relayfs_fs(void) { -#ifdef CONFIG_KLOG_CHANNEL - remove_klog_channel(); -#endif unregister_filesystem(&relayfs_fs_type); } diff -urpN -X dontdiff linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/klog.c linux-2.6.11-rc1-mm1-cur/fs/relayfs/klog.c --- linux-2.6.11-rc1-mm1-vanilla/fs/relayfs/klog.c Fri Jan 14 06:13:13 2005 +++ linux-2.6.11-rc1-mm1-cur/fs/relayfs/klog.c Wed Dec 31 18:00:00 1969 @@ -1,206 +0,0 @@ -/* - * KLOGGeneric Logging facility built upon the relayfs infrastructure - * - * Authors:Hubertus Franke ([EMAIL PROTECTED]) - * Tom Zanussi ([EMAIL PROTECTED]) - * - * Please direct all questions/comments to [EMAIL PROTECTED] - * - * Copyright (C) 2003, IBM Corp - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* klog channel id */ -static int klog_channel = -1; - -/* maximum size of klog formatting buffer beyond which truncation will occur */ -#define KLOG_BUF_SIZE (512) -/* per-cpu klog formatting buffer */ -static char buf[NR_CPUS][KLOG_BUF_SIZE]; - -/* - * klog_enabled determines whether klog()/klog_raw() actually do write - * to the klog channel at any given time. If klog_enabled == 1
Re: 2.6.11-rc1-mm1
Hi, On Fri, 14 Jan 2005, Karim Yaghmour wrote: > > Why should a subsystem care about the details of the buffer management? > > Because it wants to enforce a data format on buffer boundaries. It's interesting to read more about ltt's requirements, but I still think it's possible to leave this work to the relayfs layer. Why not just move the ltt buffer management into relayfs and provide a small library, which extracts the event stream again? Otherwise you have to duplicate this work for every serious relayfs user anyway. Completely abstracting the buffer management would the make whole interface simpler and it would be a lot easier to change without breaking everything. E.g. it would be possible to use per cpu buffers and remove the need for different locking mechanisms, for a good tracing mechanism it's not just important that it's lockless, but also that the cpus don't share cache lines in the fast path. In this regard relayfs/ltt has really still too much overhead and the complex relayfs API isn't really making it easy to fix this. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
Hello Thomas, I don't mind having a general discussion about instrumentation, but it has to be understood that the topic is so general and means so many different things to different people that we are unlikely to reach any useful consensus. Believe me, it's not for the lack of trying. More below. Thomas Gleixner wrote: > :D > One of those backends is LTT+relayfs. > I really respect the work you have done there, but please accept that I > just see the limitations and try to figure out a way to make it more > generic and flexible before it is cemented into the kernel and makes it > hard to use for other interesting instrumentation aspects and maybe > enforces redundant implementation of infrastructure related > functionality. > > E.g. tracking down timing related issues can make use from such > functionality if the infrastructure is provided seperately. > I guess a lot of developers would be happy to use it when it is already > around in the kernel and it can help testers for giving better > information to developers. I would invite you to review the history behind LTT and the history behind the efforts to get LTT integrated in the kernel (which are two separate topics.) If you look back, you will see that I worked very hard trying to get people to think about a common framework and that I and others made numerous suggestions in this regard. Here are a few examples: - DProbes (kprobes ancestor): Shortly after dprobes came out in 2000, I was one of the first to suggest that there could be interfacing between both to allow dynamically added trace points. We worked with, and eventually joined forces with, the IBM team working on this and very early on, LTT and DProbes were interfacing: http://marc.theaimsgroup.com/?l=linux-kernel&m=97079714009328&w=2 - OProfile: When time came to integrate oprofile in the kernel, I tried to push for oprofile to use ltt as it's logging engine (to John's utter horror.) relayfs didn't exist at the time, and obviously oprofile made it in without relying on ltt. Here's a posting from July 2002 where I suggested oprofile rely on ltt. In that same posting I listed a number of drivers/subsystems that already contained tracing statements. Obviously I was pointing out that there was an opportunity to create a common, uniform infrastructure based on ltt: http://marc.theaimsgroup.com/?l=linux-kernel&m=102624656615567&w=2 - Syscalltrack: In replying to a posting of someone looking for tracing info, there was a brief discussion as to how syscalltrack could use ltt instead of: a) redirecting the syscall table, b) have its own buffering mechanism. Again, relayfs didn't exist at the time: http://marc.theaimsgroup.com/?l=linux-kernel&m=102822343523369&w=2 - Event logging: When there was discussion about event logging, there was suggestion to use ltt's engine. Again, relayfs wasn't there: http://marc.theaimsgroup.com/?l=linux-kernel&m=101836133400796&w=2 And there are many other cases. As you can see, it's not as if I didn't try to have this discussion before. Unfortunately, interest in this was rather limited. In addition, and this is a very important issue, quite a few kernel developers mistook LTT for a kernel debugging tool, which it was never meant to be. When, in fact, if you ask those who have looked at using it for that purpose (try Marcelo or Andrea) you will see that they didn't find it to be appropriate for them. And rightly so, it was never meant for that purpose. Even lately, when I suggested Ingo try using relayfs instead of his custom tracing code for his preemption work, he looked at it and said that it wasn't suited, but would consider reusing parts of it if it were in the kernel. So, in general, one thing I learned over the years is to not touch the topic of kernel debugging even with a 10 foot poll when discussing LTT. What you are hinting at here (mention of developers vs. testers, for example), and your stated preference for the type of ring-buffer you described earlier clearly goes in the direction I've learned to avoid: buffering support for the general purpose of kernel debugging. Let me say outright that I see the relevance of what you are looking for, but let me also say that what we tried to achieve with relayfs is to provide a general mechanism for kernel subsystems that need to convey large amounts of data to user-space. We did not attempt to solve the problem of providing a buffering framework for core kernel debugging. As I mentioned to Ingo in the mail I referred to earlier regarding the type of buffering you are looking for: > The above tracer may indeed be very appropriate for kernel development, > but it doesn't provide enough functionality for the requirements of > mainstream users. If there is interest for using either relayfs and/or ltt for that purpose, then this is an entirely different mandate and a few things would need to be added for that to happen. For starters, we could add another mode to relayfs. Currently, it supports a locking and
Re: 2.6.11-rc1-mm1
On Fri, Jan 14, 2005 at 12:23:52AM -0800, Andrew Morton wrote: > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/ > waiting-10s-before-mounting-root-filesystem.patch > retry mounting the root filesystem at boot time With this patch, initrds seem to get 'skipped'. I think this is probably the cause for the reports of problems with RAID too. Just after loading the initrd (RAMDISK: Loading 5284KiB [1 disk] into ram disk...) the kernel tries to mount the real root fs -- if the necessary drivers are built-in, it proceeds from there; if not, not. I'm guessing that when the initrd code calls mount_block_root() to mount the ramdisk, this bit makes it decide to try to mount the real root instead: if (!ROOT_DEV) { ROOT_DEV = name_to_dev_t(saved_root_name); create_dev(name, ROOT_DEV, root_device_name); } Perhaps this should not be done until after the first attempt to mount fails? Sorry, I haven't had nearly enough coffee today to attempt to make a patch. :-) -- Joseph Fannin [EMAIL PROTECTED] "Bull in pure form is rare; there is usually some contamination by data." -- William Graves Perry Jr. signature.asc Description: Digital signature
Re: 2.6.11-rc1-mm1 waiting-10s-before-mounting-root-....
> Are you using an initrd? yes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Instrumentation (was Re: 2.6.11-rc1-mm1)
On Fri, 2005-01-14 at 15:22 -0800, Tim Bird wrote: > but not 1) supporting infrastructure for timestamping, managing event > data, etc., and 2) a static list of generally useful tracepoints. Both points are well taken. Thats the essential minimum what instrumentation needs. I'd like to see this infrastructure usable for all kinds of instrumentation mechanisms which are built in to the kernel already or functions which are used for similar purposes in experimental trees and other instrumentation related projects. This requires to seperate the backend from the infrastructure, so you can chose from a set of backends which fit best for the intended use. One of those backends is LTT+relayfs. I really respect the work you have done there, but please accept that I just see the limitations and try to figure out a way to make it more generic and flexible before it is cemented into the kernel and makes it hard to use for other interesting instrumentation aspects and maybe enforces redundant implementation of infrastructure related functionality. E.g. tracking down timing related issues can make use from such functionality if the infrastructure is provided seperately. I guess a lot of developers would be happy to use it when it is already around in the kernel and it can help testers for giving better information to developers. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Breakage with raid in 2.6.11-rc1-mm1 [Regression in mm]
Randy.Dunlap wrote (ao): > Reuben Farrelly wrote: > >At 12:58 a.m. 15/01/2005, Andrew Morton wrote: > > > >>Reuben Farrelly <[EMAIL PROTECTED]> wrote: > >>> > >>> Something seems to have broken with 2.6.11-rc1-mm1, which worked ok > >>with > >>> 2.6.10-mm3. > >>> > >>> NET: Registered protocol family 17 > >>> Starting balanced_irq > >>> BIOS EDD facility v0.16 2004-Jun-25, 2 devices found > >>> md: Autodetecting RAID arrays. > >>> md: autorun ... > >>> md: ... autorun DONE. > >>> VFS: Waiting 19sec for root device... ... > >>> VFS: Waiting 1sec for root device... > >>> VFS: Cannot open root device "md2" or unknown-block(0,0) > >>> Please append a correct "root=" boot option > >>> Kernel panic - not syncing: VFS: Unable to mount root fs on > >>unknown-block(0,0) > >>> > >>> The system is running 5 RAID-1 partitions, and md2 is the root as > >>> per grub.conf. Problem seems to be that raid autodetection finds > >>> no raid partitions :( > >>> > >>> The two ST380013AS SATA drives are detected earlier in the boot, so > >>I don't > >>> think that's the problem.. > >> > >>hm, the only raidy thing we have in there is the below. Maybe you could > >>try reverting that? > >> > >>--- 25/drivers/md/raid5.c~raid5-overlapping-read-hack 2005-01-09 > >>22:20:40.211246912 -0800 > >>+++ 25-akpm/drivers/md/raid5.c 2005-01-09 22:20:40.216246152 -0800 ... > >Ok the breakage occurred somewhere between 2.6.10-mm3 (works) and > >2.6.11-rc1 (doesn't work) ie wasn't introduced into the latest -mm > >patchset as I first thought. > > > >Are there any other patches that might be worth a try backing out? > > Someone else reported that they had to back out this one: > waiting-10s-before-mounting-root-filesystem.patch > > Can you revert that one and let us know how it goes? It Works For Me(tm). This is unpatched 2.6.11-rc1-mm1 (no patches reverted too): # uname -r 2.6.11-rc1-mm1 # cat /proc/mdstat Personalities : [raid0] [raid1] [raid5] [multipath] [raid10] Event: 2 md1 : active raid10 sdd2[3] sdc2[2] sdb2[1] sda2[0] 70684416 blocks 128K chunks 2 near-copies [4/4] [] md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0] 500608 blocks [4/4] [] unused devices: # mount /dev/md1 on / type reiser3 (rw,sync,data=journal,barrier=flush) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/md0 on /boot type ext2 (ro) tmpfs on /tmp type tmpfs (rw) So the problem depends on something. This system is SCSI, and I don't use modules. I'm happy to provide more info if that would be of any help. -- Humilis IT Services and Solutions http://www.humilis.net - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
On Fri, 2005-01-14 at 20:25 -0500, Karim Yaghmour wrote: > Thomas Gleixner wrote: > > You have previously demonstrated that you do not understand the > implementation you are criticizing. You keep repeating the size > of the patch like a mantra, yet when pressed for actual bits of > code that need fixing, you use a circular argument to slip away. Yeah, did you answer one of my arguments except claiming that I'm to stupid to understand how it works ? I completely understand what this code does and I don't beat on the patch size. I beat on the timing burden and restrictions which are given by the implementation. I have no objection against relayfs itself. I can just leave the config switch off, so it does not affect me. Adding instrumentation to the kernel is a good thing. I just dont like the idea, that instrumentation is bound on relayfs and adds a feature to the kernel which fits for a restricted set of problems rather than providing a generic optimized instrumentation framework, where one can use relayfs as a backend, if it fits his needs. Making this less glued together leaves the possibility to use other backends. > If you feel that there is some unncessary processing being done > in the kernel, please show me the piece of code affected so that > it can be fixed if it is broken. Just doing codepath analysis shows me: There is a loop in ltt_log_event, which enforces the processing of each event twice. Spliting traces is postprocessing and can be done elsewhere. In _ltt_log_event lives quite a bunch of if(...) processing decisions which have to be evaluated for _each_ event. The relay_reserve code can loop in the do { } while() and even go into a slow path where another do { } while() is found. So it can not be used in fast paths and for timing related problem tracking, because it adds variable time overhead. Due to the fact, that the ltt_log_event path is not preempt safe you can actually hit the additional go in the do { } while() loop. I pointed out before, that it is not possible to selectively select the events which I'm interested in during compile time. I get either nothing or everything. If I want to use instrumentation for a particular problem, why must I process a loop of _ltt_log_event calls for stuff I do not need instead of just compiling it away ? If I compile a event in, then adding a couple of checks into the instrumentation macro itself does not hurt as much as leaving the straight code path for a disabled event. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Hi Karim, On Fri, 2005-01-14 at 20:14 -0500, Karim Yaghmour wrote: > Gee Thomas, I guess you really want to take this one until the last > man is standing. Feel free to use the ad-hominem tone if it suits > you. Don't hold it against me though if I don't bite :) No personal offence was intended. > Thomas Gleixner wrote: > > It's not only me, who needs constant time. Everybody interested in > > tracing will need that. In my opinion its a principle of tracing. > > relayfs is a generalized buffering mechanism. Tracing is one application > it serves. Check out the web site: "high-speed data-relay filesystem." > Fancy name huh ... I do not doubt that. But hardwiring an instrumentation framework on it is also hardwiring implicit restrictions on the usability of the instrumentation for certain purposes. > > The "lockless" mechanism is _FAKE_ as I already pointed out. It replaces > > locks by do { } while loops. So what ? > > Well for one thing, a portion of code running in user-context won't > disable interrupts while it's attempting to get buffer space, and > therefore won't impact on interrupt delivery. The do {} while loops are in the fast ltt_log_event path > Clearly you haven't read the implementation and/or aren't familiar with > its use. Usually, what you want to do is open(), mmap(), write(), there > is no "conversion" to a file. The filesystem abstraction is just a > namespace holder for us. I have read it and tried it. I don't see a point why I can't map a ringbuffer into user space. I'm not beating on the ringbuffer, but I'm using it as an example. :) > That's not the point. You're bending backwards as far as you can reach > trying to raise as much mud as you can, but when pressed for actual > constructive input you hide behind a strawman argument. If you don't > have anything to say, then stop whining. I gave constructive criticism along with points, where I just point on the restrictions and weakness of the implementation. tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Sorry about the missing quotes. It should read: You wrote: > Some things I'd like to see (as I am currently using the KIO > equivalent) implemented as FUSE fs: > - "fish", virtual file access over ssh This is already available here: http://sourceforge.net/projects/fuse You need to dowload fuse-2.2-pre3 and sshfs-1.0. It should work on any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11-rc1-mm1
Some things I'd like to see (as I am currently using the KIO equivalent) implemented as FUSE fs: - "fish", virtual file access over ssh This is already available here: http://sourceforge.net/projects/fuse You need to dowload fuse-2.2-pre3 and sshfs-1.0. It should work on any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/