Duplicated functions (was: fix memory corruption from misinterpreted bad_inode_ops return values)
Al Viro wrote: > On Thu, Jan 04, 2007 at 03:21:06PM -0800, Mitchell Blank Jr wrote: > > The real (but harder) fix for the wasted space issue > > would be to get the toolchain to automatically combine functions that > > end up compiling into identical assembly. > > Can't do. [...] > Comparing pointers to > functions for equality is a well-defined operation and it's not > to be messed with. I agree -- it definitely could never be an "always on" optimization. What I was postulating would be a special post-processing step that could be run by people interested in squeezing the kernel as small as possible. I'd bet that there's not many functions in the kernel that rely on having a unique address. If some exist they could be put in their own ELF section via a "__unique_function_address_needed" macro or something. > You _can_ compile g into jump to f, but that's it. Actually if you don't mind sacrificing alignment (which we do anyway with -Os, I think) you could give a function two unique addresses just by prepending it with a single-byte noop instruction. But anyway... how about some actual data? Just out of curiosity I threw together a quick script (attached) that parses the output of "objdump --disassemble vmlinux" and tries to find functions that share the same opcodes. The short summary of the results: a bit of .text savings but not an amazing amount -- on my tiny nfsroot test kernel it found 4034 bytes worth of possible savings (out of a total of 1460830 bytes of opcodes in .text) About what I expected, actually. One line in its output really stands out, however: | 218 bytes: __copy_from_user_ll_nozero, __copy_from_user_ll, __copy_to_user_ll Those three big functions are exactly the same, which actually makes sense if you think about how they're implemented. That's a bunch of *very* hot Icache lines we're wasting there! Anyway, hopefully some folks will find the script interesting. In particular maybe the janitors can play with it. I'd wager some of the things it turns up are good candidates for code re-use. -Mitch Script's output: Duplicated functions: 98 bytes: nlmsvc_proc_granted, nlm4svc_proc_granted 10 bytes: return_EIO, hung_up_tty_write 10 bytes: free_pid_ns, mempool_kfree, free_bd_holder, nfs4_free_open_state, ipc_immediate_free, dir_release, pci_release_bus_bridge_dev, put_tty_driver, device_create_release, class_create_release, class_device_create_release, isa_dev_release, neigh_parms_destroy, unx_destroy_cred, pmap_map_release, rpc_free_iostats, free 15 bytes: xor_64, xor_64 7 bytes: native_set_pte, native_set_pmd, debugfs_u32_set 24 bytes: part_uevent_store, store_uevent 54 bytes: atm_init_aal34, atm_init_aal5 22 bytes: nfs4_encode_void, nlmsvc_encode_void, nlm4svc_encode_void 42 bytes: seq_release_private, content_release 10 bytes: profile_event_register, profile_event_unregister, nocrypt, nocrypt_iv 7 bytes: nfs_proc_commit_setup, udp_lib_hash, udp_lib_hash 179 bytes: nlmsvc_proc_sm_notify, nlm4svc_proc_sm_notify 29 bytes: nfs4_decode_void, nlmsvc_decode_void, nlm4svc_decode_void 62 bytes: nfs3_commit_done, nfs3_write_done 7 bytes: print_trace_stack, i8237A_suspend, store, module_frob_arch_sections, get_gate_vma, in_gate_area, in_gate_area_no_task, no_blink, noop_ret, no_action, next_online_pgdat, __pud_alloc, __pmd_alloc, generic_pipe_buf_pin, simple_sync_file, nfs4_callback_null, fuse_prepare_write, default_read_file, dummycon_dummy, vgacon_dummy, read_null, hung_up_tty_read, con_chars_in_buffer, class_device_create_uevent, anon_transport_dummy_function, ramdisk_writepages, sock_no_poll, noop_dequeue, ipv4_dst_check, rpcproc_encode_null, rpcproc_decode_null, pvc_shutdown, svc_shutdown 20 bytes: fn_show_ptregs, sysrq_handle_showregs 29 bytes: part_next, diskstats_next 10 bytes: atomic_notifier_call_chain, raw_notifier_call_chain 107 bytes: nlm_decode_cookie, nlm4_decode_cookie 5 bytes: do_spurious_interrupt_bug, c_stop, machine_shutdown, machine_halt, syscall_vma_close, native_nop, sighand_ctor, r_stop, s_stop, audit_log_task_context, noop, default_unplug_io_fn, frag_stop, single_stop, t_stop, devinfo_stop, int_seq_stop, parse_solaris_x86, parse_freebsd, parse_netbsd, parse_openbsd, parse_unixware, parse_minix, nfs_server_list_stop, nfs_volume_list_stop, nfs3_forget_cached_acls, fuse_read_inode, ipc_lock_by_ptr, ipc_unlock, crypto_exit_cipher_ops, crypto_exit_digest_ops, ioport_unmap, pci_remove_legacy_files, k_ignore, con_throttle, nv_vlan_rx_kill_vid, input_devices_seq_stop, input_handlers_seq_stop, proto_seq_stop, dev_seq_stop, softnet_seq_stop, dev_mc_seq_stop, neigh_stat_seq_stop, netlink_seq_stop, rt_cpu_seq_stop, tcp_twsk_destructor, raw_seq_stop, udp_seq_stop, icmp_address, icmp_discard, fib_seq_stop, unix_seq_stop, packet_seq_stop, rpc_default_callback, nul_destroy, nul_destroy_cred, c_stop, vcc_seq_stop, smp_setup_pr
Re: [UPDATED PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
Linus Torvalds wrote: > Well, that probably would work, but it's also true that returning a 64-bit > value on a 32-bit platform really _does_ depend on more than the size. Yeah, obviously this is restricted to the signed-integer case. My point was just that you could have the compiler figure out which variant to pick for loff_t automatically. > "let's not play tricks with function types at all". I think I agree. The real (but harder) fix for the wasted space issue would be to get the toolchain to automatically combine functions that end up compiling into identical assembly. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [UPDATED PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
Al Viro wrote: > At least 3 versions, unless you want to mess with ifdefs to reduce them to > two. I don't think you need to do fancy #ifdef's: static s32 return_eio_32(void) { return -EIO; } static s64 return_eio_64(void) { return -EIO; } extern void return_eio_unknown(void); /* Doesn't exist */ #define return_eio(type)((sizeof(type) == 4)\ ? ((void *) return_eio_32) \ : ((sizeof(type) == 8) \ ? ((void *) return_eio_64) \ : ((void *) return_eio_unknown))) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [UPDATED PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
Al Viro wrote: At least 3 versions, unless you want to mess with ifdefs to reduce them to two. I don't think you need to do fancy #ifdef's: static s32 return_eio_32(void) { return -EIO; } static s64 return_eio_64(void) { return -EIO; } extern void return_eio_unknown(void); /* Doesn't exist */ #define return_eio(type)((sizeof(type) == 4)\ ? ((void *) return_eio_32) \ : ((sizeof(type) == 8) \ ? ((void *) return_eio_64) \ : ((void *) return_eio_unknown))) -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [UPDATED PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
Linus Torvalds wrote: Well, that probably would work, but it's also true that returning a 64-bit value on a 32-bit platform really _does_ depend on more than the size. Yeah, obviously this is restricted to the signed-integer case. My point was just that you could have the compiler figure out which variant to pick for loff_t automatically. let's not play tricks with function types at all. I think I agree. The real (but harder) fix for the wasted space issue would be to get the toolchain to automatically combine functions that end up compiling into identical assembly. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Duplicated functions (was: fix memory corruption from misinterpreted bad_inode_ops return values)
Al Viro wrote: On Thu, Jan 04, 2007 at 03:21:06PM -0800, Mitchell Blank Jr wrote: The real (but harder) fix for the wasted space issue would be to get the toolchain to automatically combine functions that end up compiling into identical assembly. Can't do. [...] Comparing pointers to functions for equality is a well-defined operation and it's not to be messed with. I agree -- it definitely could never be an always on optimization. What I was postulating would be a special post-processing step that could be run by people interested in squeezing the kernel as small as possible. I'd bet that there's not many functions in the kernel that rely on having a unique address. If some exist they could be put in their own ELF section via a __unique_function_address_needed macro or something. You _can_ compile g into jump to f, but that's it. Actually if you don't mind sacrificing alignment (which we do anyway with -Os, I think) you could give a function two unique addresses just by prepending it with a single-byte noop instruction. But anyway... how about some actual data? Just out of curiosity I threw together a quick script (attached) that parses the output of objdump --disassemble vmlinux and tries to find functions that share the same opcodes. The short summary of the results: a bit of .text savings but not an amazing amount -- on my tiny nfsroot test kernel it found 4034 bytes worth of possible savings (out of a total of 1460830 bytes of opcodes in .text) About what I expected, actually. One line in its output really stands out, however: | 218 bytes: __copy_from_user_ll_nozero, __copy_from_user_ll, __copy_to_user_ll Those three big functions are exactly the same, which actually makes sense if you think about how they're implemented. That's a bunch of *very* hot Icache lines we're wasting there! Anyway, hopefully some folks will find the script interesting. In particular maybe the janitors can play with it. I'd wager some of the things it turns up are good candidates for code re-use. -Mitch Script's output: Duplicated functions: 98 bytes: nlmsvc_proc_granted, nlm4svc_proc_granted 10 bytes: return_EIO, hung_up_tty_write 10 bytes: free_pid_ns, mempool_kfree, free_bd_holder, nfs4_free_open_state, ipc_immediate_free, dir_release, pci_release_bus_bridge_dev, put_tty_driver, device_create_release, class_create_release, class_device_create_release, isa_dev_release, neigh_parms_destroy, unx_destroy_cred, pmap_map_release, rpc_free_iostats, free 15 bytes: xor_64, xor_64 7 bytes: native_set_pte, native_set_pmd, debugfs_u32_set 24 bytes: part_uevent_store, store_uevent 54 bytes: atm_init_aal34, atm_init_aal5 22 bytes: nfs4_encode_void, nlmsvc_encode_void, nlm4svc_encode_void 42 bytes: seq_release_private, content_release 10 bytes: profile_event_register, profile_event_unregister, nocrypt, nocrypt_iv 7 bytes: nfs_proc_commit_setup, udp_lib_hash, udp_lib_hash 179 bytes: nlmsvc_proc_sm_notify, nlm4svc_proc_sm_notify 29 bytes: nfs4_decode_void, nlmsvc_decode_void, nlm4svc_decode_void 62 bytes: nfs3_commit_done, nfs3_write_done 7 bytes: print_trace_stack, i8237A_suspend, store, module_frob_arch_sections, get_gate_vma, in_gate_area, in_gate_area_no_task, no_blink, noop_ret, no_action, next_online_pgdat, __pud_alloc, __pmd_alloc, generic_pipe_buf_pin, simple_sync_file, nfs4_callback_null, fuse_prepare_write, default_read_file, dummycon_dummy, vgacon_dummy, read_null, hung_up_tty_read, con_chars_in_buffer, class_device_create_uevent, anon_transport_dummy_function, ramdisk_writepages, sock_no_poll, noop_dequeue, ipv4_dst_check, rpcproc_encode_null, rpcproc_decode_null, pvc_shutdown, svc_shutdown 20 bytes: fn_show_ptregs, sysrq_handle_showregs 29 bytes: part_next, diskstats_next 10 bytes: atomic_notifier_call_chain, raw_notifier_call_chain 107 bytes: nlm_decode_cookie, nlm4_decode_cookie 5 bytes: do_spurious_interrupt_bug, c_stop, machine_shutdown, machine_halt, syscall_vma_close, native_nop, sighand_ctor, r_stop, s_stop, audit_log_task_context, noop, default_unplug_io_fn, frag_stop, single_stop, t_stop, devinfo_stop, int_seq_stop, parse_solaris_x86, parse_freebsd, parse_netbsd, parse_openbsd, parse_unixware, parse_minix, nfs_server_list_stop, nfs_volume_list_stop, nfs3_forget_cached_acls, fuse_read_inode, ipc_lock_by_ptr, ipc_unlock, crypto_exit_cipher_ops, crypto_exit_digest_ops, ioport_unmap, pci_remove_legacy_files, k_ignore, con_throttle, nv_vlan_rx_kill_vid, input_devices_seq_stop, input_handlers_seq_stop, proto_seq_stop, dev_seq_stop, softnet_seq_stop, dev_mc_seq_stop, neigh_stat_seq_stop, netlink_seq_stop, rt_cpu_seq_stop, tcp_twsk_destructor, raw_seq_stop, udp_seq_stop, icmp_address, icmp_discard, fib_seq_stop, unix_seq_stop, packet_seq_stop, rpc_default_callback, nul_destroy, nul_destroy_cred, c_stop, vcc_seq_stop, smp_setup_processor_id, remapped_pgdat_init, pre_setup_arch_hook, trap_init_hook, sched_init_smp
Re: [PATCH 2.6.20-rc2] [BUGFIX] drivers/atm/firestream.c: Fix infinite recursion when alignment passed is 0.
First, if you want to get patches merged you should send them to the subsystem maintained (in this case Chas, who I've cc:'ed) Also if you feel it needs to be sent to mailing list you should usually use a more specific list first (like the ATM list or maybe netdev) Please see Documentation/SubmittingPatches Amit Choudhary wrote: > Description: Fix infinite recursion when alignment passed is 0 in function > aligned_kmalloc(), I'm curious how you hit this -- the only caller to aligned_kmalloc() passes a constant "0x10" Looking at aligned_kmalloc() it seems to be pretty badly broken -- its fallback if it gets a non-aligned buffer is to just try a larger size which doesn't necessarily fix the problem. It looks like explicitly aligning the buffer is a better solution. Could you test this patch? If it works feel free to forward it on to Chas. -Mitch [PATCH] [ATM] remove firestream.c's aligned_kmalloc() Signed-off-by: Mitchell Blank Jr <[EMAIL PROTECTED]> diff --git a/drivers/atm/firestream.c b/drivers/atm/firestream.c index 9c67df5..df8b0c0 100644 --- a/drivers/atm/firestream.c +++ b/drivers/atm/firestream.c @@ -1379,38 +1379,22 @@ static void reset_chip (struct fs_dev *d } } -static void __devinit *aligned_kmalloc (int size, gfp_t flags, int alignment) -{ - void *t; - - if (alignment <= 0x10) { - t = kmalloc (size, flags); - if ((unsigned long)t & (alignment-1)) { - printk ("Kmalloc doesn't align things correctly! %p\n", t); - kfree (t); - return aligned_kmalloc (size, flags, alignment * 4); - } - return t; - } - printk (KERN_ERR "Request for > 0x10 alignment not yet implemented (hard!)\n"); - return NULL; -} - static int __devinit init_q (struct fs_dev *dev, struct queue *txq, int queue, int nentries, int is_rq) { - int sz = nentries * sizeof (struct FS_QENTRY); + unsigned sz = nentries * sizeof (struct FS_QENTRY); struct FS_QENTRY *p; + void *vp; func_enter (); fs_dprintk (FS_DEBUG_INIT, "Inititing queue at %x: %d entries:\n", queue, nentries); - - p = aligned_kmalloc (sz, GFP_KERNEL, 0x10); + vp = kmalloc(sz + 0xF, GFP_KERNEL); + p = (struct FS_QENTRY *) ALIGN((unsigned long) vp, 0x10); fs_dprintk (FS_DEBUG_ALLOC, "Alloc queue: %p(%d)\n", p, sz); - if (!p) return 0; + if (!vp) return 0; write_fs (dev, Q_SA(queue), virt_to_bus(p)); write_fs (dev, Q_EA(queue), virt_to_bus(p+nentries-1)); @@ -1423,8 +1407,7 @@ static int __devinit init_q (struct fs_d write_fs (dev, Q_CNF(queue), 0 ); } - txq->sa = p; - txq->ea = p; + txq->buf = vp; txq->offset = queue; func_exit (); @@ -1529,8 +1512,8 @@ static void __devexit free_queue (struct write_fs (dev, Q_WP(txq->offset), 0); /* Configuration ? */ - fs_dprintk (FS_DEBUG_ALLOC, "Free queue: %p\n", txq->sa); - kfree (txq->sa); + fs_dprintk (FS_DEBUG_ALLOC, "Free queue: %p\n", txq->buf); + kfree (txq->buf); func_exit (); } diff --git a/drivers/atm/firestream.h b/drivers/atm/firestream.h index 49e783e..5408766 100644 --- a/drivers/atm/firestream.h +++ b/drivers/atm/firestream.h @@ -455,7 +455,7 @@ struct fs_vcc { struct queue { - struct FS_QENTRY *sa, *ea; + void *buf; int offset; }; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc2] [BUGFIX] drivers/atm/firestream.c: Fix infinite recursion when alignment passed is 0.
First, if you want to get patches merged you should send them to the subsystem maintained (in this case Chas, who I've cc:'ed) Also if you feel it needs to be sent to mailing list you should usually use a more specific list first (like the ATM list or maybe netdev) Please see Documentation/SubmittingPatches Amit Choudhary wrote: Description: Fix infinite recursion when alignment passed is 0 in function aligned_kmalloc(), I'm curious how you hit this -- the only caller to aligned_kmalloc() passes a constant 0x10 Looking at aligned_kmalloc() it seems to be pretty badly broken -- its fallback if it gets a non-aligned buffer is to just try a larger size which doesn't necessarily fix the problem. It looks like explicitly aligning the buffer is a better solution. Could you test this patch? If it works feel free to forward it on to Chas. -Mitch [PATCH] [ATM] remove firestream.c's aligned_kmalloc() Signed-off-by: Mitchell Blank Jr [EMAIL PROTECTED] diff --git a/drivers/atm/firestream.c b/drivers/atm/firestream.c index 9c67df5..df8b0c0 100644 --- a/drivers/atm/firestream.c +++ b/drivers/atm/firestream.c @@ -1379,38 +1379,22 @@ static void reset_chip (struct fs_dev *d } } -static void __devinit *aligned_kmalloc (int size, gfp_t flags, int alignment) -{ - void *t; - - if (alignment = 0x10) { - t = kmalloc (size, flags); - if ((unsigned long)t (alignment-1)) { - printk (Kmalloc doesn't align things correctly! %p\n, t); - kfree (t); - return aligned_kmalloc (size, flags, alignment * 4); - } - return t; - } - printk (KERN_ERR Request for 0x10 alignment not yet implemented (hard!)\n); - return NULL; -} - static int __devinit init_q (struct fs_dev *dev, struct queue *txq, int queue, int nentries, int is_rq) { - int sz = nentries * sizeof (struct FS_QENTRY); + unsigned sz = nentries * sizeof (struct FS_QENTRY); struct FS_QENTRY *p; + void *vp; func_enter (); fs_dprintk (FS_DEBUG_INIT, Inititing queue at %x: %d entries:\n, queue, nentries); - - p = aligned_kmalloc (sz, GFP_KERNEL, 0x10); + vp = kmalloc(sz + 0xF, GFP_KERNEL); + p = (struct FS_QENTRY *) ALIGN((unsigned long) vp, 0x10); fs_dprintk (FS_DEBUG_ALLOC, Alloc queue: %p(%d)\n, p, sz); - if (!p) return 0; + if (!vp) return 0; write_fs (dev, Q_SA(queue), virt_to_bus(p)); write_fs (dev, Q_EA(queue), virt_to_bus(p+nentries-1)); @@ -1423,8 +1407,7 @@ static int __devinit init_q (struct fs_d write_fs (dev, Q_CNF(queue), 0 ); } - txq-sa = p; - txq-ea = p; + txq-buf = vp; txq-offset = queue; func_exit (); @@ -1529,8 +1512,8 @@ static void __devexit free_queue (struct write_fs (dev, Q_WP(txq-offset), 0); /* Configuration ? */ - fs_dprintk (FS_DEBUG_ALLOC, Free queue: %p\n, txq-sa); - kfree (txq-sa); + fs_dprintk (FS_DEBUG_ALLOC, Free queue: %p\n, txq-buf); + kfree (txq-buf); func_exit (); } diff --git a/drivers/atm/firestream.h b/drivers/atm/firestream.h index 49e783e..5408766 100644 --- a/drivers/atm/firestream.h +++ b/drivers/atm/firestream.h @@ -455,7 +455,7 @@ struct fs_vcc { struct queue { - struct FS_QENTRY *sa, *ea; + void *buf; int offset; }; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] IBM HDAPS accelerometer driver, with probing.
Andi Kleen wrote: > - it doesn't seem to help that much on modern CPUs with good > branch prediction and big icaches anyways. Really? I would think that as pipelines get deeper (although that trend seems to have stopped, thankfully) and Icache-miss penalties get relatively larger we'd see unlikely() becoming MORE of a benefit, not less. Storing the used part of a "hot" function in 1 Icacheline instead of 4 seems like an obvious win. Personally I've never found unlikely() to be ugly; if anything I think it serves as a nice little human-readable comment about whats going on in the control-flow. I guess I'm in the minority on that one, though. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.13-rc7
Richard Henderson wrote: > Because I use "extern inline" in the proper way. That is, I have both > inline and out-of-line versions of some routines. Is there any reason not to just make the out-of-line version explicit? i.e.: /* in some .h file: */ static /*(always!)*/inline int my_func(void) { return FOO; } extern int OOL_my_func(void); /* in some .c file: */ int OOL_my_func(void) { return my_func(); } It's a little ugly but there really aren't that many cases of this, right? Or is this just the principal of the thing? :-) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] IBM HDAPS accelerometer driver, with probing.
Andi Kleen wrote: - it doesn't seem to help that much on modern CPUs with good branch prediction and big icaches anyways. Really? I would think that as pipelines get deeper (although that trend seems to have stopped, thankfully) and Icache-miss penalties get relatively larger we'd see unlikely() becoming MORE of a benefit, not less. Storing the used part of a hot function in 1 Icacheline instead of 4 seems like an obvious win. Personally I've never found unlikely() to be ugly; if anything I think it serves as a nice little human-readable comment about whats going on in the control-flow. I guess I'm in the minority on that one, though. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.13-rc7
Richard Henderson wrote: Because I use extern inline in the proper way. That is, I have both inline and out-of-line versions of some routines. Is there any reason not to just make the out-of-line version explicit? i.e.: /* in some .h file: */ static /*(always!)*/inline int my_func(void) { return FOO; } extern int OOL_my_func(void); /* in some .c file: */ int OOL_my_func(void) { return my_func(); } It's a little ugly but there really aren't that many cases of this, right? Or is this just the principal of the thing? :-) -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Still one nitpick: Jon Smirl wrote: > + while (isspace(*x) && (x - buffer->page < count)) > + x++; I think you can just do: if (count > 0) while (isspace(*x)) x++; If the passed-in string was fully whitespace then the trailing-whitespace stripping would already reduce it to count==0. So as long as you check for that case you can be sure that the while() will stop before x falls off the end of the string. Other than that it looks fine to me now. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Jon Smirl wrote: > Do we need to deal with UTF8 here? I did the forward loop because you > can't parse UTF8 backwards. If UTF8 is possible I need to change the > pointer inc function. As others have mentioned there shouldn't be a UTF8 problem with isspace(). However, even if you wanted to scan going forwards you can do it without the complexity of a nested loop: char *real_end; for (real_end = x; x - buffer->page < count; x++) if (!isspace(*x)) real_end = x + 1; *real_end = '\0'; // then fix "count" BTW, my code I posted yesterday is incomplete since I neglected to notice that the "count = z - x" at the end is used to repair "count" damage from both the leading and trailing whitespace stripping. Its actually easier to strip the trailing whitespace first, like: while (count > 0 && isspace(buffer->page[count - 1])) count--;/* strip trailing whitespace */ if (count > 0 && isspace(buffer->page[0])) { /* * We have leading whitespace but also at least one * non-whitespace character */ const char *x = buffer->page; do { x++; count--; } while (isspace(*x)); memmove(buffer->page, x, count); } buffer->page[count] = '\0'; -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
Steven Rostedt wrote: > In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO > configurable" I discovered that a C version of find_first_bit is faster > than the asm version There are probably other cases of this in asm-i386/bitopts.h. For instance I think the "btl" instruction is pretty slow on modern CPUs so constant_test_bit() will probably outperform variable_test_bit() even if you feed it a non-constant "nr". I'd be happy to be proven wrong, though :-) When testing these optimizations you should also probably check the resulting vmlinux size. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Greg KH wrote: > > + /* locate trailng white space */ > > + z = y = x; > > + while (y - buffer->page < count) { > > + y++; > > + z = y; > > + while (isspace(*y) && (y - buffer->page < count)) { > > + y++; > > + } > > + } > > + count = z - x; > > Hm, I _think_ this works, but I need someone else to verify this... > Anyone else? It looks sane-ish to me, but also more complicated than need be. Why can't you just do something like: while (count > 0 && isspace(x[count - 1])) count--; -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Greg KH wrote: + /* locate trailng white space */ + z = y = x; + while (y - buffer-page count) { + y++; + z = y; + while (isspace(*y) (y - buffer-page count)) { + y++; + } + } + count = z - x; Hm, I _think_ this works, but I need someone else to verify this... Anyone else? It looks sane-ish to me, but also more complicated than need be. Why can't you just do something like: while (count 0 isspace(x[count - 1])) count--; -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
Steven Rostedt wrote: In the thread [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable I discovered that a C version of find_first_bit is faster than the asm version There are probably other cases of this in asm-i386/bitopts.h. For instance I think the btl instruction is pretty slow on modern CPUs so constant_test_bit() will probably outperform variable_test_bit() even if you feed it a non-constant nr. I'd be happy to be proven wrong, though :-) When testing these optimizations you should also probably check the resulting vmlinux size. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Jon Smirl wrote: Do we need to deal with UTF8 here? I did the forward loop because you can't parse UTF8 backwards. If UTF8 is possible I need to change the pointer inc function. As others have mentioned there shouldn't be a UTF8 problem with isspace(). However, even if you wanted to scan going forwards you can do it without the complexity of a nested loop: char *real_end; for (real_end = x; x - buffer-page count; x++) if (!isspace(*x)) real_end = x + 1; *real_end = '\0'; // then fix count BTW, my code I posted yesterday is incomplete since I neglected to notice that the count = z - x at the end is used to repair count damage from both the leading and trailing whitespace stripping. Its actually easier to strip the trailing whitespace first, like: while (count 0 isspace(buffer-page[count - 1])) count--;/* strip trailing whitespace */ if (count 0 isspace(buffer-page[0])) { /* * We have leading whitespace but also at least one * non-whitespace character */ const char *x = buffer-page; do { x++; count--; } while (isspace(*x)); memmove(buffer-page, x, count); } buffer-page[count] = '\0'; -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
Still one nitpick: Jon Smirl wrote: + while (isspace(*x) (x - buffer-page count)) + x++; I think you can just do: if (count 0) while (isspace(*x)) x++; If the passed-in string was fully whitespace then the trailing-whitespace stripping would already reduce it to count==0. So as long as you check for that case you can be sure that the while() will stop before x falls off the end of the string. Other than that it looks fine to me now. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] possible missing capability check in ioctl function, drivers/net/cris/eth_v10.c, kernel 2.6.11
Ted Kremenek wrote: > Currently we are looking primarily into the > ioctls in drivers/net, Just as a small aside, a little over five years ago (wow does time fly!) I did a manual audit for mistakes like this: http://lkml.org/lkml/2000/3/7/156 Not sure if that's relevant to your work your not...I'm not sure how you would can catch most of these errors statistically since a lot of the dangerous ioctl's were only for a single device. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] possible missing capability check in ioctl function, drivers/net/cris/eth_v10.c, kernel 2.6.11
Ted Kremenek wrote: Currently we are looking primarily into the ioctls in drivers/net, Just as a small aside, a little over five years ago (wow does time fly!) I did a manual audit for mistakes like this: http://lkml.org/lkml/2000/3/7/156 Not sure if that's relevant to your work your not...I'm not sure how you would can catch most of these errors statistically since a lot of the dangerous ioctl's were only for a single device. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext3: Fix sparse -Wbitwise warnings.
Stephen C. Tweedie wrote: > If we want to fix this, let's fix the macros: for example, convert > EXT3_HAS_COMPAT_FEATURE to be > > ( le32_to_cpu(EXT3_SB(sb)->s_es->s_feature_compat) & (mask) ) Of course that's less efficient though since "mask" is probably constant.. so now the endian conversion changed from compile-time to run-time. Would something like ( ( EXT3_SB(sb)->s_es->s_feature_compat & cpu_to_le32(mask) ) != 0) be enough to satisfy sparse? -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext3: Fix sparse -Wbitwise warnings.
Stephen C. Tweedie wrote: If we want to fix this, let's fix the macros: for example, convert EXT3_HAS_COMPAT_FEATURE to be ( le32_to_cpu(EXT3_SB(sb)-s_es-s_feature_compat) (mask) ) Of course that's less efficient though since mask is probably constant.. so now the endian conversion changed from compile-time to run-time. Would something like ( ( EXT3_SB(sb)-s_es-s_feature_compat cpu_to_le32(mask) ) != 0) be enough to satisfy sparse? -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Lockup in 2.4.2 kernel ADSL PCI card ATM driver module
[EMAIL PROTECTED] wrote: > SUBJECT: > Lockup in 2.4.2 kernel ADSL PCI card ATM driver module > > > DRIVER RESULTS: > Works fine in 2.4.0 kernel. > Locks up system (no messages/oops/etc.) in 2.4.2-2 kernel (rh 7.1). > Locks up system (no messages/oops/etc.) in 2.4.2 kernel (w/ or w/o kgdb). > Locks up system with "int 3" ??? message in 2.4.2-ac28 kernel. First, please try with a recent 2.4.5ac kernel - a number of ATM bugfixes have been made recently and they've been merged first with alan's branch. Anyway, there's a chance that you might just be hitting one of those bugs that have already been fixed, so it'd be best to try that first so we can exclude that possibility. > MESSAGE GENERATED IN 2.4.2-ac28: > int 3: > CPU: 0 > EIP: 0010:[] Please read the REPORTING-BUGS and Documentation/oops-tracing.txt files in the linux source tree - especially the parts about using the "ksymoops" tool. These numbers are useless to us unless you run it through that tool first. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Lockup in 2.4.2 kernel ADSL PCI card ATM driver module
[EMAIL PROTECTED] wrote: SUBJECT: Lockup in 2.4.2 kernel ADSL PCI card ATM driver module DRIVER RESULTS: Works fine in 2.4.0 kernel. Locks up system (no messages/oops/etc.) in 2.4.2-2 kernel (rh 7.1). Locks up system (no messages/oops/etc.) in 2.4.2 kernel (w/ or w/o kgdb). Locks up system with int 3 ??? message in 2.4.2-ac28 kernel. First, please try with a recent 2.4.5ac kernel - a number of ATM bugfixes have been made recently and they've been merged first with alan's branch. Anyway, there's a chance that you might just be hitting one of those bugs that have already been fixed, so it'd be best to try that first so we can exclude that possibility. MESSAGE GENERATED IN 2.4.2-ac28: int 3: CPU: 0 EIP: 0010:[c885478b] Please read the REPORTING-BUGS and Documentation/oops-tracing.txt files in the linux source tree - especially the parts about using the ksymoops tool. These numbers are useless to us unless you run it through that tool first. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: symlink_prefix
Alexander Viro wrote: > I can see how to implement per-mountpoint variant. However, I'm > less than enthusiastic about the API side of that and about the > ugliness it will lead to. It smells like a wrong approach. And > no, I don't see a good one right now. Aren't we one day going to have stackable filesystems? This looks like an idea for a trivial translation filesystem to me. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: symlink_prefix
Alexander Viro wrote: I can see how to implement per-mountpoint variant. However, I'm less than enthusiastic about the API side of that and about the ugliness it will lead to. It smells like a wrong approach. And no, I don't see a good one right now. Aren't we one day going to have stackable filesystems? This looks like an idea for a trivial translation filesystem to me. -Mitch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Larger dev_t
Wichert Akkerman wrote: > You are just delaying the problem then, at some point your uptime will > be large enough that you have run through all 64bit pids for example. 64 bits is enough to fork 1 million processes per second for over 500,000 years. I think that's putting the problem off far enough. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Larger dev_t
Wichert Akkerman wrote: You are just delaying the problem then, at some point your uptime will be large enough that you have run through all 64bit pids for example. 64 bits is enough to fork 1 million processes per second for over 500,000 years. I think that's putting the problem off far enough. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: spinlock usage - ext2_get_block, lru_list_lock
Andreas Dilger wrote: > With per-group (or maybe per-bitmap) locking, files could be created in > parallel with only a small amount of global locking if they are in different > groups. ...and then we can let the disc go nuts seeking to actually commit all these new blocks. I suspect that this change would only be a win for memory-based discs where seek time is zero. I think that before anyone starts modifying the kernel for this they should benchmark how often the free block checker blocks on lock contention _AND_ how often the thread its contending with is looking for a free block in a _different_ allocation group. I bet it's not often at all. > It may also be > possible to have lazy updating of the superblock counts, and depend on > e2fsck to update the superblock counts on a crash. That sounds more promising. > , and only moving the deltas from the groups > to the superblock on sync or similar. If we're going to assume that e2fsck will correct the numbers anyway then there's really no reason to update them any time except when marking the filesystem clean (umount, remount-ro) As a bonus, we have to update the superblock then anyway. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: spinlock usage - ext2_get_block, lru_list_lock
Andreas Dilger wrote: With per-group (or maybe per-bitmap) locking, files could be created in parallel with only a small amount of global locking if they are in different groups. ...and then we can let the disc go nuts seeking to actually commit all these new blocks. I suspect that this change would only be a win for memory-based discs where seek time is zero. I think that before anyone starts modifying the kernel for this they should benchmark how often the free block checker blocks on lock contention _AND_ how often the thread its contending with is looking for a free block in a _different_ allocation group. I bet it's not often at all. It may also be possible to have lazy updating of the superblock counts, and depend on e2fsck to update the superblock counts on a crash. That sounds more promising. , and only moving the deltas from the groups to the superblock on sync or similar. If we're going to assume that e2fsck will correct the numbers anyway then there's really no reason to update them any time except when marking the filesystem clean (umount, remount-ro) As a bonus, we have to update the superblock then anyway. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] 120 potential dereference to invalid pointers errors for linux 2.4.1
Junfeng Yang wrote: > [BUG] fore200e_kmalloc can return NULL > /u2/acc/oses/linux/2.4.1/drivers/atm/fore200e.c:2032:fore200e_get_esi: >ERROR:NULL:2020:2032: Using unknown ptr "prom" illegally! set by >'fore200e_kmalloc':2020 I don't see the bug - there is an explicit "if(!prom) return -ENOMEM;" after the allocation. It looks fine to me. > [BUG] break the while loop, but not the for loop > /u2/acc/oses/linux/2.4.1/drivers/atm/zatm.c:1817:zatm_detect: ERROR:NULL:1804:1817: >Using NULL ptr "zatm_dev" illegally! set by 'kmalloc':1804 Ah, good catch. It'd be almost impossible to actually trigger this since you'd need multiple cards of different types (all of which are rare) and end up with really bad allocation luck, but it is technically a bug. Really line 1829 should be "if(!zatm_dev) return devs;" > [BUG] at line 1796 > /u2/acc/oses/linux/2.4.1/net/atm/lec.c:1799:lec_arp_update: ERROR:NULL:1798:1799: >Using unknown ptr "entry" illegally! set by 'make_entry':1798 Yep, all three of the catches in lec.c are real bugs - great work as always. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] 120 potential dereference to invalid pointers errors for linux 2.4.1
Junfeng Yang wrote: [BUG] fore200e_kmalloc can return NULL /u2/acc/oses/linux/2.4.1/drivers/atm/fore200e.c:2032:fore200e_get_esi: ERROR:NULL:2020:2032: Using unknown ptr "prom" illegally! set by 'fore200e_kmalloc':2020 I don't see the bug - there is an explicit "if(!prom) return -ENOMEM;" after the allocation. It looks fine to me. [BUG] break the while loop, but not the for loop /u2/acc/oses/linux/2.4.1/drivers/atm/zatm.c:1817:zatm_detect: ERROR:NULL:1804:1817: Using NULL ptr "zatm_dev" illegally! set by 'kmalloc':1804 Ah, good catch. It'd be almost impossible to actually trigger this since you'd need multiple cards of different types (all of which are rare) and end up with really bad allocation luck, but it is technically a bug. Really line 1829 should be "if(!zatm_dev) return devs;" [BUG] at line 1796 /u2/acc/oses/linux/2.4.1/net/atm/lec.c:1799:lec_arp_update: ERROR:NULL:1798:1799: Using unknown ptr "entry" illegally! set by 'make_entry':1798 Yep, all three of the catches in lec.c are real bugs - great work as always. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel 2.4.2 SMP + ATM hangs
Miguel Armas wrote: > A couple days ago we installed a Fore 200E ATM card and after getting the > ATM address using ilmid the machine hangs. The kernel still respond to > pings, but the userspace is dead. > > If we remove SMP support in the kernel everything works fine (but with > only one CPU)... You probably need the patch that Chas Williams came up with in January. I've been meaning to forward it, but I haven't yet. Please try it and see if it fixes your problem. -Mitch --- net/atm/addr.c.000 Thu Jan 25 16:26:24 2001 +++ net/atm/addr.c Thu Jan 25 16:26:10 2001 @@ -57,7 +57,6 @@ { struct atm_dev_addr *this; - spin_lock (_dev_lock); down(_lock); while (dev->local) { this = dev->local; @@ -65,7 +64,6 @@ kfree(this); } up(_lock); - spin_unlock (_dev_lock); notify_sigd(dev); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dentry cache order 7 is broken
David S. Miller wrote: > - hash = hash ^ (hash >> D_HASHBITS) ^ (hash >> D_HASHBITS*2); > + hash = hash ^ (hash >> D_HASHBITS) ^ > + (hash >> (D_HASHBITS+(D_HASHBITS/2))); Two comments: 1. The inode-cache has the exact same problem, but it'll require a lot of RAM to run into it. The buffer and page caches don't have the same problem. 2. Given that D_HASHBITS is not a constant I wonder if there isn't a more efficient hash to be found. But I guess I'll leave that to the hashing experts. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: dentry cache order 7 is broken
David S. Miller wrote: - hash = hash ^ (hash D_HASHBITS) ^ (hash D_HASHBITS*2); + hash = hash ^ (hash D_HASHBITS) ^ + (hash (D_HASHBITS+(D_HASHBITS/2))); Two comments: 1. The inode-cache has the exact same problem, but it'll require a lot of RAM to run into it. The buffer and page caches don't have the same problem. 2. Given that D_HASHBITS is not a constant I wonder if there isn't a more efficient hash to be found. But I guess I'll leave that to the hashing experts. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]]
Ingo Molnar wrote: > - probably the most radical solution is what i suggested, to completely > avoid the unique-mapping of file structures to an integer range, and use > the address of the file structure (and some cookies) as an identification. IMO... gross. We do pretty much this exact thing in the ATM code (for the signalling daemon and the kernel exchainging status on VCCs) and it's pretty disgusting. I want to make it go away. > - a less radical solution would be to still map file structures to an > integer range (file descriptors) and usage-maintain files per processes, > but relax the 'allocate first non-allocated integer in the range' rule. [...] > fd = open(...,O_ANY); Yeah, this gets talked about, but I don't think a new flag for open is a good way to do this, because open() isn't the only thing that returns a new fd. What about socket()? pipe()? Maybe we could have a new prctl() control that turns this behavior on and off. Then you'd just have to be careful to turn it back off before calling any library functions that require ordering (like popen). Other than that, I think it'd be a good idea, especially if it could be implemented clean enough to make it CONFIG_'urable. That can't really be fairly judged until someone produces the code. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]]
Ingo Molnar wrote: - probably the most radical solution is what i suggested, to completely avoid the unique-mapping of file structures to an integer range, and use the address of the file structure (and some cookies) as an identification. IMO... gross. We do pretty much this exact thing in the ATM code (for the signalling daemon and the kernel exchainging status on VCCs) and it's pretty disgusting. I want to make it go away. - a less radical solution would be to still map file structures to an integer range (file descriptors) and usage-maintain files per processes, but relax the 'allocate first non-allocated integer in the range' rule. [...] fd = open(...,O_ANY); Yeah, this gets talked about, but I don't think a new flag for open is a good way to do this, because open() isn't the only thing that returns a new fd. What about socket()? pipe()? Maybe we could have a new prctl() control that turns this behavior on and off. Then you'd just have to be careful to turn it back off before calling any library functions that require ordering (like popen). Other than that, I think it'd be a good idea, especially if it could be implemented clean enough to make it CONFIG_'urable. That can't really be fairly judged until someone produces the code. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: include conventions /usr/include/linux/sys ?
H. Peter Anvin wrote: > I suggested include/kernel and include/arch with include/linux and > include/asm being reserved for the kernel interfaces (ioctl and their > structures mostly.) That sounds good. One other refinement I would like to see is that architecture specific but always present header files don't get used directly in architecture-independant .c flies. Some common examples of this are and The problem is that often there is some code that is identical between some or all of the archs. Then one of two things happen: * The code gets duplicated in many *.h files with all the bad things that comde duplication causes - especially since they're all under seperate maintainership * We have a big ugly conversion like what happened with 'asm/spinlock.h' -> 'linux/spinlock.h' If we only have arch-independant *.c files only include things out of (which may, of course, include things in ) we can avoid these conversions in the future and promote code reuse between the architectures. Also we should probably consider implementing a reasonalbe hierarchy to the proposed directory - for instance . -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [NEW DRIVER] firestream
First, I'd like to make a couple points about driver style that I'm trying to move towards with the ATM drivers. You're free to take them or leave them, but I want to eventually move the tree in this direction. * I don't like header files that define the registers of the chip - since the header file is only included in the driver's .c file you might as well just put the definitions there (unless, of course, there is good reason to think that the registers will be used in multiple drivers - unlikely in this case) Having a seperate header file just serves to hamper searching around the driver and cluttering the directory. * Please use the new PCI interface for new drivers (i.e. MODULE_DEVICE_TABLE and all that) > +int loopback = 0; > +int num=0x5a; These should be defined static. > +char *irq_bitname[] = { Should be static. > +#ifdef DEBUG > +#define fs_dprintk(f, str...) if (fs_debug & f) printk (str) Hint - you could use "(fs_debug & FS_DEBUG_##f)" in order to clean up the callers of this function. Also you might want to wrap this in a "do { ... } while(0)" in order to avoid expansion problems in the future. > +#ifdef DEBUG > +/* I didn't forget to set this to zero before shipping. Hit me with a stick > + if you get this with the debug default not set to zero again. -- REW */ > +int fs_debug = 0; Again, should be static. > +#ifdef MODULE > +#ifdef DEBUG > +MODULE_PARM(fs_debug, "i"); There's no reason to wrap these "MODULE_PARM"s inside an "#ifdef MODULE". > +#define MIN(a,b) (((a)<(b))?(a):(b)) You don't seem to ever use this definition. > +#else /* DEBUG */ > +static void my_hd (void *addr, int len){} > +#endif /* DEBUG */ You might as well make this a null #define in this case. > +/* Hmm. If this is ATM specific, why isn't there an ATM routine for this? > + * I copied it over from the ambassador driver. -- REW */ > + > +static inline void fs_kfree_skb (struct sk_buff * skb) Yes, in the 2.5 timeframe this wart will be cleaned up (by always requiring vcc->pop to be !=NULL) > +/* It seems the ATM forum recomends this horribly complicated 16bit > + * floating point format. Turns out the Ambassador uses the exact same > + * encoding. I just copied it over. If Mitch agrees, I'll move it over > + * to the atm_misc file or something like that. (and remove it from > + * here and the ambassador driver) -- REW > + */ Yep, in 2.5 we can do something like that (although I want to look at it a bit further) > +void write_fs (struct fs_dev *dev, int offset, int val) Should be static. And probably inline. "val" should be a u32. > +unsigned int read_fs (struct fs_dev *dev, int offset) Ditto. Return value should be u32. > +struct FS_QENTRY *get_qentry (struct fs_dev *dev, struct queue *q) Should be static > +void submit_qentry (struct fs_dev *dev, struct queue *q, struct FS_QENTRY *qe) Should be static > +void submit_queue (struct fs_dev *dev, struct queue *q, > +u32 cmd, u32 p1, u32 p2, u32 p3) Should be static > +void submit_command (struct fs_dev *dev, struct queue *q, Should be static > +void process_return_queue (struct fs_dev *dev, struct queue *q) Should be static > +void process_txdone_queue (struct fs_dev *dev, struct queue *q) Should be static > +void process_incoming (struct fs_dev *dev, struct queue *q) Should be static > +static int fs_send (struct atm_vcc *atm_vcc, struct sk_buff *skb) [...] > + vcc->last_skb = skb; I'm really leary of this... it looks to me that if we already had been in the process of sending an skb then sends after that will lose that skb (and leak the associated memory). Please reference the recent thread on the linux-atm mailing list about how to deal with TX backlog. also: > + td = kmalloc (sizeof (struct FS_BPENTRY), GFP_ATOMIC); > + fs_dprintk (FS_DEBUG_ALLOC, "Alloc transd: %p(%d)\n", td, sizeof (struct >FS_BPENTRY)); > + if (!td) { > + /* Oops out of mem */ > + return -ENOMEM; > + } What frees the skb in this case? > +void undocumented_pci_fix (struct pci_dev *pdev) Should be static > +void write_phy (struct fs_dev *dev, int regnum, int val) Should be static > +int init_phy (struct fs_dev *dev, struct reginit_item *reginit) Should be static > +void reset_chip (struct fs_dev *dev) Should be static > +void *aligned_kmalloc (int size, int flags, int alignment) Should be static > +{ > + void *t; > + > + if (alignment <= 0x10) { > + t = kmalloc (size, flags); > + if ((unsigned int)t & (alignment-1)) { > + printk ("Kmalloc doesn't align things correctly! %p\n", t); > + kfree (t); > + return aligned_kmalloc (size, flags, alignment * 4); Uh, ok I'd prefer if you just died here - there shouldn't be any way that kmalloc is going to return something that isn't 16-byte aligned. It's wise to double check this when it's important but I
Re: [PATCH] atmrefcount
Patrick van de Lageweg wrote: > This patch contains the fix for the atmrefcount problem (noted as a > critical problem in Ted's todo list). It also has the makefile > modifications for the firestream driver in a separate email. Could you please split the two patches so that they're actually independant (i.e. put all the stuff for the new driver in its own patch). I'll comment on that driver seperately. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atmrefcount
Patrick van de Lageweg wrote: This patch contains the fix for the atmrefcount problem (noted as a critical problem in Ted's todo list). It also has the makefile modifications for the firestream driver in a separate email. Could you please split the two patches so that they're actually independant (i.e. put all the stuff for the new driver in its own patch). I'll comment on that driver seperately. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [NEW DRIVER] firestream
First, I'd like to make a couple points about driver style that I'm trying to move towards with the ATM drivers. You're free to take them or leave them, but I want to eventually move the tree in this direction. * I don't like header files that define the registers of the chip - since the header file is only included in the driver's .c file you might as well just put the definitions there (unless, of course, there is good reason to think that the registers will be used in multiple drivers - unlikely in this case) Having a seperate header file just serves to hamper searching around the driver and cluttering the directory. * Please use the new PCI interface for new drivers (i.e. MODULE_DEVICE_TABLE and all that) +int loopback = 0; +int num=0x5a; These should be defined static. +char *irq_bitname[] = { Should be static. +#ifdef DEBUG +#define fs_dprintk(f, str...) if (fs_debug f) printk (str) Hint - you could use "(fs_debug FS_DEBUG_##f)" in order to clean up the callers of this function. Also you might want to wrap this in a "do { ... } while(0)" in order to avoid expansion problems in the future. +#ifdef DEBUG +/* I didn't forget to set this to zero before shipping. Hit me with a stick + if you get this with the debug default not set to zero again. -- REW */ +int fs_debug = 0; Again, should be static. +#ifdef MODULE +#ifdef DEBUG +MODULE_PARM(fs_debug, "i"); There's no reason to wrap these "MODULE_PARM"s inside an "#ifdef MODULE". +#define MIN(a,b) (((a)(b))?(a):(b)) You don't seem to ever use this definition. +#else /* DEBUG */ +static void my_hd (void *addr, int len){} +#endif /* DEBUG */ You might as well make this a null #define in this case. +/* Hmm. If this is ATM specific, why isn't there an ATM routine for this? + * I copied it over from the ambassador driver. -- REW */ + +static inline void fs_kfree_skb (struct sk_buff * skb) Yes, in the 2.5 timeframe this wart will be cleaned up (by always requiring vcc-pop to be !=NULL) +/* It seems the ATM forum recomends this horribly complicated 16bit + * floating point format. Turns out the Ambassador uses the exact same + * encoding. I just copied it over. If Mitch agrees, I'll move it over + * to the atm_misc file or something like that. (and remove it from + * here and the ambassador driver) -- REW + */ Yep, in 2.5 we can do something like that (although I want to look at it a bit further) +void write_fs (struct fs_dev *dev, int offset, int val) Should be static. And probably inline. "val" should be a u32. +unsigned int read_fs (struct fs_dev *dev, int offset) Ditto. Return value should be u32. +struct FS_QENTRY *get_qentry (struct fs_dev *dev, struct queue *q) Should be static +void submit_qentry (struct fs_dev *dev, struct queue *q, struct FS_QENTRY *qe) Should be static +void submit_queue (struct fs_dev *dev, struct queue *q, +u32 cmd, u32 p1, u32 p2, u32 p3) Should be static +void submit_command (struct fs_dev *dev, struct queue *q, Should be static +void process_return_queue (struct fs_dev *dev, struct queue *q) Should be static +void process_txdone_queue (struct fs_dev *dev, struct queue *q) Should be static +void process_incoming (struct fs_dev *dev, struct queue *q) Should be static +static int fs_send (struct atm_vcc *atm_vcc, struct sk_buff *skb) [...] + vcc-last_skb = skb; I'm really leary of this... it looks to me that if we already had been in the process of sending an skb then sends after that will lose that skb (and leak the associated memory). Please reference the recent thread on the linux-atm mailing list about how to deal with TX backlog. also: + td = kmalloc (sizeof (struct FS_BPENTRY), GFP_ATOMIC); + fs_dprintk (FS_DEBUG_ALLOC, "Alloc transd: %p(%d)\n", td, sizeof (struct FS_BPENTRY)); + if (!td) { + /* Oops out of mem */ + return -ENOMEM; + } What frees the skb in this case? +void undocumented_pci_fix (struct pci_dev *pdev) Should be static +void write_phy (struct fs_dev *dev, int regnum, int val) Should be static +int init_phy (struct fs_dev *dev, struct reginit_item *reginit) Should be static +void reset_chip (struct fs_dev *dev) Should be static +void *aligned_kmalloc (int size, int flags, int alignment) Should be static +{ + void *t; + + if (alignment = 0x10) { + t = kmalloc (size, flags); + if ((unsigned int)t (alignment-1)) { + printk ("Kmalloc doesn't align things correctly! %p\n", t); + kfree (t); + return aligned_kmalloc (size, flags, alignment * 4); Uh, ok I'd prefer if you just died here - there shouldn't be any way that kmalloc is going to return something that isn't 16-byte aligned. It's wise to double check this when it's important but I wouldn't put too much work into fixing this. And calling ourselves
Re: include conventions /usr/include/linux/sys ?
H. Peter Anvin wrote: I suggested include/kernel and include/arch with include/linux and include/asm being reserved for the kernel interfaces (ioctl and their structures mostly.) That sounds good. One other refinement I would like to see is that architecture specific but always present header files don't get used directly in architecture-independant .c flies. Some common examples of this are asm/io.h and asm/uaccess.h The problem is that often there is some code that is identical between some or all of the archs. Then one of two things happen: * The code gets duplicated in many *.h files with all the bad things that comde duplication causes - especially since they're all under seperate maintainership * We have a big ugly conversion like what happened with 'asm/spinlock.h' - 'linux/spinlock.h' If we only have arch-independant *.c files only include things out of kernel/* (which may, of course, include things in arch/*) we can avoid these conversions in the future and promote code reuse between the architectures. Also we should probably consider implementing a reasonalbe hierarchy to the proposed kernel/*.h directory - for instance kernel/bus/pci.h. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel OOPS on boot
Richard B. Johnson wrote: > > And here is the broken routine: > > > > 03f4 : [...] > This is not good code. It does the following: > > o Gets a parameter off the stack and puts into eax (a pointer). > o Put the value 1 into ecx. > o Take a byte from the pointed-to location and put it into dl. > o Put the byte in dl back into memory where it came from! > o Clear ecx > o Copy ecx to eax for a return value of 0 If you look at the original routine (arch/i386/mm/init.c:test_wp_bit), you'll see that its purpose is to test the capabilities of the CPU's MMU (to differentiate the old 386's that didn't honor WP in supervisor mode), which is why it does this little dance - to provoke a page fault. Anyway, the probable fix has already been posted (i.e. don't use gcc 2.7.2.*) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel OOPS on boot
Richard B. Johnson wrote: And here is the broken routine: 03f4 do_test_wp_bit: [...] This is not good code. It does the following: o Gets a parameter off the stack and puts into eax (a pointer). o Put the value 1 into ecx. o Take a byte from the pointed-to location and put it into dl. o Put the byte in dl back into memory where it came from! o Clear ecx o Copy ecx to eax for a return value of 0 If you look at the original routine (arch/i386/mm/init.c:test_wp_bit), you'll see that its purpose is to test the capabilities of the CPU's MMU (to differentiate the old 386's that didn't honor WP in supervisor mode), which is why it does this little dance - to provoke a page fault. Anyway, the probable fix has already been posted (i.e. don't use gcc 2.7.2.*) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > * it doesn't add extra syscalls > > Sure it does. > > What do you think ioctl's are? As I explained a few lines down from where you stopped quoting (and probably stopped reading) the ioctl() use is just an artifact of Solaris's icky implementation. It could and should use read(). Not an extra syscall. > Sure, you can have multiple "poll lists" by opening /dev/poll multiple > times, and you seem to see this as a great feature. I'm telling you that > multiple event queues are a horrible disaster, and it's a perfect example > of what an engineer with the "more is better" mindset comes up with. It has nothing to do with how exciting of a feature it is. It's a matter "hey, if you implement it with /dev/poll, you get multiple queues FOR FREE". Those sorts of things are an indication that you've got the implementation right. The fact that in your scheme multiple queues would be extra work is just residue from the fact that you've got the idea wrong. And how do you propose dealing with multi-threaded programs. Obviously for SMP you'd want multiple threads to be able to grab events, so I guess the "one great queue" will be shared among all the threads. But, suppose I want to dedicate a thread to coping with a specific set of events. H... guess you'll need a CLONE_EVENTQUEUE flag to support that. With /dev/poll none of this foolishness is needed - you just open tow of them and let the threads do what they want. > Multiple event queues are bad, because it completely breaks the notion of > even-driven programming. How do you want to listen to them all? You can't. Explained in detail in my mail. > You can only listen to one event queue at a time - unless you create some > kind of "hierarchy" of event queues, where you have one event queue that > you wait on that contains the _other_ event queues, so that you can tell > which one has anything pending.. No, event queues of event queues are messy in implementation. There's nothing fundamentally wrong with them, but you need to make sure the graph stays acyclic which is just an implementation hassle. Not enough gain for too much code (unless someone has a beautiful algorithm) However, it's perfectly reasonable to use select() or poll() on two event queues. As I explained in my email, this is needed if you're already using a library that does its own event queue based on poll or select. > Basically, tastes differ. Everything you seem to see as an advantage of > /dev/poll, I see as a horrible kludge. Well, everything I see indicates that /dev/poll would be the easier of the two interfaces to actually implement. All the added flexibility is just gravy. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Here's a suggested "good" interface that would certainly be easy to > implement, and very easy to use, with none of the scalability issues that > many interfaces have. I think everyone should take a timeout and look at Solaris 8's /dev/poll interface. This discussion is reinventing the wheel, the lever, and the inclined plane. http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123 I think this is a lot cleaner than your approach: * it doesn't add extra syscalls * it handles multiple event queues, and does so without ugliness. all the per-queue state is held in the /dev/poll's "struct file" instance * in your method you have a per-process queue - but under what clone() conditions is it shared among siblings? Here the user has a choice - they can share the open("/dev/poll") file descriptor or not using any of the existing means. Granted, they also would probably want to arrange to share the fd's being polled in order to make this useful. * No new fields in task_struct A few simple improvements can be made to the Sun model, though: * The fact that the fd of /dev/poll can't be used for poll(2) or select(2) is ugly. Sure, you probably don't want an open instance of /dev/poll being added to another /dev/poll, but being able to call "select" on them would be really useful: 1. Imagine a server that has to process connections from both high-priority and low-priority clients - and that requests from the high-priority ones always take precedence. With this interface you could easily have two open instances of /dev/poll and then call select(2) on them. This ability just falls out naturally from the interface. 2. Some libraries are internally driven by select(2) loops (I think Xlib is like this, IIRC) If you have a lot of fd's you want to watch, this means you must take the hit of calling select(2) on all of them. If you could just pass in a fd for /dev/poll, problem solved. * I think the fact that you add events via write(2) but retrieve them via ioctl(2) is an ugly asymmetry. From what I can see, the entire motivation for using ioctl as opposed to read(2) is to allow the user to specify a timeout. If you could use poll(2)/select(2) on /dev/poll this issue would be moot (see above) * It would be nice if the interface were flexible enough to report items _other_ than "activity on fd" in the future. Maybe SYSV IPC? itimers? directory update notification? It seems that every time UNIX adds a mechanism of waiting for events, we spoil it by not making it flexible enough to wait on everything you might want. Lets not repeat the mistake with a new interface. * The "struct pollfd" and "struct dvpoll" should also include a 64-bit opaque quantity supplied by userland which is returned with each event on that fd. This would save the program from having to look up which connection context goes with each fd - the kernel could just give you the pointer to the structure. Not having this capability isn't a burden for programs dealing with a small number of fd's (since they can just have a lookup table) but if you potentially have 1's of connections it may be undesirable to allocate an array for them all. The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll table... I don't think this is a good idea. First, the user just wants to be able to add events and dequeue them - both linear operations. Second, why should the kernel be forced to maintain a fixed-size linear list of events we're looking for... this seems mindlessly inefficient. Why not just pull a "struct pollfd" out of slab each time a new event is listened for? My unresolved concerns: * Is this substantially better than the already existing rtsig+poll solution? Enough to make implementing it worth while? * How do we quickly do the "struct file" -> "struct pollfd" conversion each time an event happens? Especially if there are multiple /dev/poll instances open in the current process. Probably each "struct file" would need a pointer to the instance of /dev/poll which would have some b-tree variant (or maybe a hash table). The limitation would be that a single fd couldn't be polled for events by two different /dev/poll instances, even for different events. This is probably good for sanity's sake anyway. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: Here's a suggested "good" interface that would certainly be easy to implement, and very easy to use, with none of the scalability issues that many interfaces have. I think everyone should take a timeout and look at Solaris 8's /dev/poll interface. This discussion is reinventing the wheel, the lever, and the inclined plane. http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123 I think this is a lot cleaner than your approach: * it doesn't add extra syscalls * it handles multiple event queues, and does so without ugliness. all the per-queue state is held in the /dev/poll's "struct file" instance * in your method you have a per-process queue - but under what clone() conditions is it shared among siblings? Here the user has a choice - they can share the open("/dev/poll") file descriptor or not using any of the existing means. Granted, they also would probably want to arrange to share the fd's being polled in order to make this useful. * No new fields in task_struct A few simple improvements can be made to the Sun model, though: * The fact that the fd of /dev/poll can't be used for poll(2) or select(2) is ugly. Sure, you probably don't want an open instance of /dev/poll being added to another /dev/poll, but being able to call "select" on them would be really useful: 1. Imagine a server that has to process connections from both high-priority and low-priority clients - and that requests from the high-priority ones always take precedence. With this interface you could easily have two open instances of /dev/poll and then call select(2) on them. This ability just falls out naturally from the interface. 2. Some libraries are internally driven by select(2) loops (I think Xlib is like this, IIRC) If you have a lot of fd's you want to watch, this means you must take the hit of calling select(2) on all of them. If you could just pass in a fd for /dev/poll, problem solved. * I think the fact that you add events via write(2) but retrieve them via ioctl(2) is an ugly asymmetry. From what I can see, the entire motivation for using ioctl as opposed to read(2) is to allow the user to specify a timeout. If you could use poll(2)/select(2) on /dev/poll this issue would be moot (see above) * It would be nice if the interface were flexible enough to report items _other_ than "activity on fd" in the future. Maybe SYSV IPC? itimers? directory update notification? It seems that every time UNIX adds a mechanism of waiting for events, we spoil it by not making it flexible enough to wait on everything you might want. Lets not repeat the mistake with a new interface. * The "struct pollfd" and "struct dvpoll" should also include a 64-bit opaque quantity supplied by userland which is returned with each event on that fd. This would save the program from having to look up which connection context goes with each fd - the kernel could just give you the pointer to the structure. Not having this capability isn't a burden for programs dealing with a small number of fd's (since they can just have a lookup table) but if you potentially have 1's of connections it may be undesirable to allocate an array for them all. The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll table... I don't think this is a good idea. First, the user just wants to be able to add events and dequeue them - both linear operations. Second, why should the kernel be forced to maintain a fixed-size linear list of events we're looking for... this seems mindlessly inefficient. Why not just pull a "struct pollfd" out of slab each time a new event is listened for? My unresolved concerns: * Is this substantially better than the already existing rtsig+poll solution? Enough to make implementing it worth while? * How do we quickly do the "struct file" - "struct pollfd" conversion each time an event happens? Especially if there are multiple /dev/poll instances open in the current process. Probably each "struct file" would need a pointer to the instance of /dev/poll which would have some b-tree variant (or maybe a hash table). The limitation would be that a single fd couldn't be polled for events by two different /dev/poll instances, even for different events. This is probably good for sanity's sake anyway. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: * it doesn't add extra syscalls Sure it does. What do you think ioctl's are? As I explained a few lines down from where you stopped quoting (and probably stopped reading) the ioctl() use is just an artifact of Solaris's icky implementation. It could and should use read(). Not an extra syscall. Sure, you can have multiple "poll lists" by opening /dev/poll multiple times, and you seem to see this as a great feature. I'm telling you that multiple event queues are a horrible disaster, and it's a perfect example of what an engineer with the "more is better" mindset comes up with. It has nothing to do with how exciting of a feature it is. It's a matter "hey, if you implement it with /dev/poll, you get multiple queues FOR FREE". Those sorts of things are an indication that you've got the implementation right. The fact that in your scheme multiple queues would be extra work is just residue from the fact that you've got the idea wrong. And how do you propose dealing with multi-threaded programs. Obviously for SMP you'd want multiple threads to be able to grab events, so I guess the "one great queue" will be shared among all the threads. But, suppose I want to dedicate a thread to coping with a specific set of events. H... guess you'll need a CLONE_EVENTQUEUE flag to support that. With /dev/poll none of this foolishness is needed - you just open tow of them and let the threads do what they want. Multiple event queues are bad, because it completely breaks the notion of even-driven programming. How do you want to listen to them all? You can't. Explained in detail in my mail. You can only listen to one event queue at a time - unless you create some kind of "hierarchy" of event queues, where you have one event queue that you wait on that contains the _other_ event queues, so that you can tell which one has anything pending.. No, event queues of event queues are messy in implementation. There's nothing fundamentally wrong with them, but you need to make sure the graph stays acyclic which is just an implementation hassle. Not enough gain for too much code (unless someone has a beautiful algorithm) However, it's perfectly reasonable to use select() or poll() on two event queues. As I explained in my email, this is needed if you're already using a library that does its own event queue based on poll or select. Basically, tastes differ. Everything you seem to see as an advantage of /dev/poll, I see as a horrible kludge. Well, everything I see indicates that /dev/poll would be the easier of the two interfaces to actually implement. All the added flexibility is just gravy. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-testx fr0kedness?
Jason Slagle wrote: > It worked fine under 2.3.x and as a matter of fact worked fine under > 2.4.0-test1-4 I believe. > > I don't buy a hardware explination when I've been running this setup for > well over a year and only just now have problems :) It's because we've seen it a hundred zillion times before... weird problems with overclocked systems that the owners claim "work fine under any other version, so it must be a kernel bug". You could well be on the edge of where overclocking breaks down. There's some sequence of instructions and/or memory access in the kernel that is causing your particular CPU to croak every billionth time through. Maybe previous kernel compiles didn't emit those exact instructions and you lucked out. Anyway, turn off overclokcing and try to reproduce. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-testx fr0kedness?
Jason Slagle wrote: It worked fine under 2.3.x and as a matter of fact worked fine under 2.4.0-test1-4 I believe. I don't buy a hardware explination when I've been running this setup for well over a year and only just now have problems :) It's because we've seen it a hundred zillion times before... weird problems with overclocked systems that the owners claim "work fine under any other version, so it must be a kernel bug". You could well be on the edge of where overclocking breaks down. There's some sequence of instructions and/or memory access in the kernel that is causing your particular CPU to croak every billionth time through. Maybe previous kernel compiles didn't emit those exact instructions and you lucked out. Anyway, turn off overclokcing and try to reproduce. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: atm_devs protection.
Werner Almesberger wrote: > Mitchell Blank Jr wrote: > > Yeah, a lot of the add/remove device ATM code (and, IMO, even the vcc > > open/close) code is pretty suspect. > > That's actually a bit of an understatement: Well, I was trying to be polite to the original author ;-) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: atm_devs protection.
Rogier Wolff wrote: > We're trying to make the module refcounting 'secure' against > concurrent SMP unloads. > > For example in net/atm/resources.c: Yeah, a lot of the add/remove device ATM code (and, IMO, even the vcc open/close) code is pretty suspect. If you want to look through and liberally sprinkle lock_kernel() everywhere, that wouldn't really be a bad idea. A lot of this code hasn't changed much since the 1.3 days, so I don't know if anyone ever gave a good look at its SMP friendlyness. At one point I cleaned up a few problems, but I'n not sure I got them all. I think most of the races that are easy to trigger are gone (at least in the common code, can't be sure about all the drivers). As far as module_unload, isn't that protected by lock_kernel? What am I missing? -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: atm_devs protection.
Werner Almesberger wrote: Mitchell Blank Jr wrote: Yeah, a lot of the add/remove device ATM code (and, IMO, even the vcc open/close) code is pretty suspect. That's actually a bit of an understatement: Well, I was trying to be polite to the original author ;-) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated 2.4 TODO List
[EMAIL PROTECTED] wrote: > 4. Boot Time Failures [...] > * IBM Thinkpad 390 won't boot since 2.3.11 (See Decklin Foster for >more info) I _highly_ suspect that this is not a 2.4 bug but is instead user error. I've seen it several times. On all kernel versions prior to 2.3.11 if you compiled with the CPU set to a newer model then you had (for instance, of you selected "Pentium Pro" and you actually had a 486) you would get a kernel that worked fine albiet not optimally. On kernels 2.3.11 you will have an unbootable kernel. Decklin - please try to reproduce this with CONFIG_M386. Also you might try looking in /proc/cpuinfo and make sure that laptop has the CPU you think it has. The reason I know which kernel version this problem appeared in is because I spent an entire evening chasing this same "kernel bug" a couple months ago. Turned out the machine I could have _sworn_ was a Celeron was a K6 :-) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated 2.4 TODO List
[EMAIL PROTECTED] wrote: 4. Boot Time Failures [...] * IBM Thinkpad 390 won't boot since 2.3.11 (See Decklin Foster for more info) I _highly_ suspect that this is not a 2.4 bug but is instead user error. I've seen it several times. On all kernel versions prior to 2.3.11 if you compiled with the CPU set to a newer model then you had (for instance, of you selected "Pentium Pro" and you actually had a 486) you would get a kernel that worked fine albiet not optimally. On kernels 2.3.11 you will have an unbootable kernel. Decklin - please try to reproduce this with CONFIG_M386. Also you might try looking in /proc/cpuinfo and make sure that laptop has the CPU you think it has. The reason I know which kernel version this problem appeared in is because I spent an entire evening chasing this same "kernel bug" a couple months ago. Turned out the machine I could have _sworn_ was a Celeron was a K6 :-) -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0test - uid_hash_find()
Dan Aloni wrote: > > I've been touring around the kernel sources when I stumbled > across the uid_hash_find() function (kernel/user.c): > > static inline struct user_struct *uid_hash_find(unsigned short uid, unsigned int >hashent) Is it just me, or should that be "uid_t" not "unsigned short"? Now THAT looks like a bug. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Calling current() from interrupt context
[EMAIL PROTECTED] wrote: > > Yes, on architectures that use current->processor that is an exception > > to the rule. After all, you know for sure that you're still on the > > same CPU as the task currently running. > > This makes sense. And I wish cpu architects would put a cpu-id > register somewhere > so that we could have fast computation of cpu-id on smp machines. Some do (look at asm-sparc/smp.h). However, linux does everything possible to make dereferencing "current" as fast as possible, so actually just looping it up there is pretty damn efficient. We need to keep track of the last cpu a processor ran on anyway to do cpu-affinity heuristics in the scheduler. So this implementation is quite optimal, actually. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Calling current() from interrupt context
[EMAIL PROTECTED] wrote: > Looking at the code, I don't see any places where "current" is not valid. > Got some examples? It's not that its invalid, it just doesn't make much sense. It points to whatever task happened to be running when the interrupt happened. So any attempt to access it is 99% likely to be a bug. > BTW: there is an implicit reference to "current" in smp_processor_id. Yes, on architectures that use current->processor that is an exception to the rule. After all, you know for sure that you're still on the same CPU as the task currently running. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Calling current() from interrupt context
[EMAIL PROTECTED] wrote: Looking at the code, I don't see any places where "current" is not valid. Got some examples? It's not that its invalid, it just doesn't make much sense. It points to whatever task happened to be running when the interrupt happened. So any attempt to access it is 99% likely to be a bug. BTW: there is an implicit reference to "current" in smp_processor_id. Yes, on architectures that use current-processor that is an exception to the rule. After all, you know for sure that you're still on the same CPU as the task currently running. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Calling current() from interrupt context
[EMAIL PROTECTED] wrote: Yes, on architectures that use current-processor that is an exception to the rule. After all, you know for sure that you're still on the same CPU as the task currently running. This makes sense. And I wish cpu architects would put a cpu-id register somewhere so that we could have fast computation of cpu-id on smp machines. Some do (look at asm-sparc/smp.h). However, linux does everything possible to make dereferencing "current" as fast as possible, so actually just looping it up there is pretty damn efficient. We need to keep track of the last cpu a processor ran on anyway to do cpu-affinity heuristics in the scheduler. So this implementation is quite optimal, actually. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0test - uid_hash_find()
Dan Aloni wrote: I've been touring around the kernel sources when I stumbled across the uid_hash_find() function (kernel/user.c): static inline struct user_struct *uid_hash_find(unsigned short uid, unsigned int hashent) Is it just me, or should that be "uid_t" not "unsigned short"? Now THAT looks like a bug. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
James Sutherland wrote: > In terms of latency, I'd suggest we aim to keep the device in use all the > time we have outstanding requests: every time the device is ready to > accept a request, we feed it the "next" one in the queue; until it is free > again, requests pile up in the queue, being sorted, merged etc. > > This should perform very well under light loads - requests just get passed > to the device as fast as they can be handled - and under very heavy loads > - we have a large queue in play, with plenty of scope for reordering, > merges etc. The "large queue" goes against the whole point of this exercise - that is that if there are many items in the "queue" being sorted then unlucky requests can end up waiting a long time to get serviced. You want to have some reasonable break-off point where you just decide to make an elevator swipe in order to avoid starving those requests. Keep in mind that I'm using the word "queue" in a confusing manner here - we're talking about how much we'll try to elevator-sort in a go, not the actual queue of all requests. I've been calling this a queue because from a performance standpoint that's what it is, so I think it helps to think of the problem that way. Oh yeah one other thing about huge elevator queues to think about (although its not too pertinant these days). A friend of mine had a job 15 years or so ago to working on a commercial UNIX flavor. They were having some severe I/O performance problems under high loads. The problem was that they were trying to sort and merge all the pending requests. When I/O was heavy this ended up chewing all the CPU time and the kernel wouldn't actually be able to keep up sending requests to the disk :-) However, keep in mind that this was int the bad-old-days when "hard drive" meant a rack-mount Eagle and the elevator was expected to consider things like head positioning and rotational speed when doing a sort... -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Alexander Viro wrote: > BTW, any bug reports starting with "kernel is x.y.z + FOO42069 + K314 + > " will be cheerfully flushed down the toilet here, > no matter what system of dependencies is going to be in place. Yes, for the stuff discussed on lkml patch dependencies should be pretty minimal. However, if I were discussing something on linux-m68k it would be common to say "kernel is 2.5.18 + m68k-native-patch-2.5.18 + mac68k-patch-2.5.18" -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Alan Cox wrote: > Humans will generally get a weird report from sending Linus a message wonder > what all this field stuff is and mail someone else instead. If they're able to create a patch, hopefully they'd be able to fill in a simple email template (and I've seen some pretty dim folks manage to register domains with InterNIC, so email templates aren't that hard :-) We could even have a simple web page where you check a few boxes and fill in "URL to patch" and hit submit. > The thing that needs the most thinking about is how to handle patches that > don't have all this id crap and stuff on them. Like any tracking system, the suceess or failure of its rollout depends completely on whether Linus et al will be steadfast enough to refuse to look at any patch that hasn't gone through the system. It's just like when you're converting a helpdesk to use a ticket tracking system. You need to teach the tech people that when they run into someone in the hallway that says "hey, my printer is broken, come fix it" they they need to reply "you must be mistaken - if your printer were broken you would have filed a trouble ticket... since I have no trouble ticket your printer must be fine". People catch on quick. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > Yes, but "how hard is it reasonable for the kernel to try" is based on > > both items. A good first order approximation is number of requests. > > I must strongly disagree with that claim. A request could be 512 bytes or > 128K. Yeah, as sct pointed out this gets thorny. For a modern harddrive this probably doesn't matter (since sequential I/O is SO fast compared to seek times) but for other devices its an issue. > > ...where the user sets a number exlpicitly for what performance they > > want. Again, if we're going to make the user set this latency > > No they do not. The parameters are defined by the bandwidth and measured > behaviour. Hmmm... well if someone wants to propose an algorithm to self-tune the "queue depth in milliseconds" number then maybe we'll get somewhere. You'd need to do some sort of moving averages of both requests/sec and sectors/sec that come out of the elevator and use those as feedback to adjust the queue-depth-value. I'm not sure if this is advisable, but at least it sounds feasable. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Daniel Quinlan wrote: > "Version" - the base kernel version. For example, "2.4.0-test8-pre1". > The web page will list valid version strings. Ideally this should be overridable for patches marked experimental, since they might be to some modified kernel. Of course you wouldn't do an test for applyability :-) >Linus wants the body of patches to be in text format and not >MIME-encoded or uuencoded. Well, since it'll be machine-parsed anyway, can't we just fix that up? >Good patches are sent to the linux-kernel mailing list which is >also where additional discussion about these patches can take >place. ...and the patch-id should go in the Subject line i.e. Subject: [PATCHID: KP3004] fix for video card catching fire Then the web interface to the patch database could also have the archived discussion available right there Also there should be some flag to turn off the "automatic posting" feature for the truly TRULY trivial patches ("i found a spelling error in a comment") that just don't merit discussion > - Linus should reject most out-of-band patches if this is to work. Yes, but you should be able to select people other than Linus that they want the patch sent to (and probably different mailing lists you want the patch discussed on). That way this can scale to the various maintainers. For instance if I haved a great new networking feature I could have the system send it to davem (assuming that he signs up for it - I'm just using him as an example) and discussion of it on netdev. David decides that he likes the patch, and he sends a (GPG/PGP signed) message to the server indicating this. Now the patch goes into Linus's queue (and lkml) with a note indicating that davem has blessed it. Of course, other developers with public keys on file could also tag it so that if a patch touches several different subsystems it can get "signed off on" by the involved maintainers before Linus has to make a call. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Daniel Quinlan wrote: "Version" - the base kernel version. For example, "2.4.0-test8-pre1". The web page will list valid version strings. Ideally this should be overridable for patches marked experimental, since they might be to some modified kernel. Of course you wouldn't do an test for applyability :-) Linus wants the body of patches to be in text format and not MIME-encoded or uuencoded. Well, since it'll be machine-parsed anyway, can't we just fix that up? Good patches are sent to the linux-kernel mailing list which is also where additional discussion about these patches can take place. ...and the patch-id should go in the Subject line i.e. Subject: [PATCHID: KP3004] fix for video card catching fire Then the web interface to the patch database could also have the archived discussion available right there Also there should be some flag to turn off the "automatic posting" feature for the truly TRULY trivial patches ("i found a spelling error in a comment") that just don't merit discussion - Linus should reject most out-of-band patches if this is to work. Yes, but you should be able to select people other than Linus that they want the patch sent to (and probably different mailing lists you want the patch discussed on). That way this can scale to the various maintainers. For instance if I haved a great new networking feature I could have the system send it to davem (assuming that he signs up for it - I'm just using him as an example) and discussion of it on netdev. David decides that he likes the patch, and he sends a (GPG/PGP signed) message to the server indicating this. Now the patch goes into Linus's queue (and lkml) with a note indicating that davem has blessed it. Of course, other developers with public keys on file could also tag it so that if a patch touches several different subsystems it can get "signed off on" by the involved maintainers before Linus has to make a call. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Alan Cox wrote: Humans will generally get a weird report from sending Linus a message wonder what all this field stuff is and mail someone else instead. If they're able to create a patch, hopefully they'd be able to fill in a simple email template (and I've seen some pretty dim folks manage to register domains with InterNIC, so email templates aren't that hard :-) We could even have a simple web page where you check a few boxes and fill in "URL to patch" and hit submit. The thing that needs the most thinking about is how to handle patches that don't have all this id crap and stuff on them. Like any tracking system, the suceess or failure of its rollout depends completely on whether Linus et al will be steadfast enough to refuse to look at any patch that hasn't gone through the system. It's just like when you're converting a helpdesk to use a ticket tracking system. You need to teach the tech people that when they run into someone in the hallway that says "hey, my printer is broken, come fix it" they they need to reply "you must be mistaken - if your printer were broken you would have filed a trouble ticket... since I have no trouble ticket your printer must be fine". People catch on quick. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: Linux Kernel Patch Management System
Alexander Viro wrote: BTW, any bug reports starting with "kernel is x.y.z + FOO42069 + K314 + long list of patches" will be cheerfully flushed down the toilet here, no matter what system of dependencies is going to be in place. Yes, for the stuff discussed on lkml patch dependencies should be pretty minimal. However, if I were discussing something on linux-m68k it would be common to say "kernel is 2.5.18 + m68k-native-patch-2.5.18 + mac68k-patch-2.5.18" -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
James Sutherland wrote: In terms of latency, I'd suggest we aim to keep the device in use all the time we have outstanding requests: every time the device is ready to accept a request, we feed it the "next" one in the queue; until it is free again, requests pile up in the queue, being sorted, merged etc. This should perform very well under light loads - requests just get passed to the device as fast as they can be handled - and under very heavy loads - we have a large queue in play, with plenty of scope for reordering, merges etc. The "large queue" goes against the whole point of this exercise - that is that if there are many items in the "queue" being sorted then unlucky requests can end up waiting a long time to get serviced. You want to have some reasonable break-off point where you just decide to make an elevator swipe in order to avoid starving those requests. Keep in mind that I'm using the word "queue" in a confusing manner here - we're talking about how much we'll try to elevator-sort in a go, not the actual queue of all requests. I've been calling this a queue because from a performance standpoint that's what it is, so I think it helps to think of the problem that way. Oh yeah one other thing about huge elevator queues to think about (although its not too pertinant these days). A friend of mine had a job 15 years or so ago to working on a commercial UNIX flavor. They were having some severe I/O performance problems under high loads. The problem was that they were trying to sort and merge all the pending requests. When I/O was heavy this ended up chewing all the CPU time and the kernel wouldn't actually be able to keep up sending requests to the disk :-) However, keep in mind that this was int the bad-old-days when "hard drive" meant a rack-mount Eagle and the elevator was expected to consider things like head positioning and rotational speed when doing a sort... -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > time, but remember that there are two things measured in time here: > > A. The time for the whole queue of requests to run (this is what Rik is > > proposing using to throttle) > > B. The time an average request takes to process. > > Your perceived latency is based entirely on A. Yes, but "how hard is it reasonable for the kernel to try" is based on both items. A good first order approximation is number of requests. > > If we limit on the depth of queue we're (to some level of approximation) > > making our decision based on A/B. It's still a magic constant, but at > > I dont suggest you do queue limiting on that basis. I suggest you do order > limiting based on time slots It's still a queue - the queue of things we're going to take on this elevator swipe, right? And the problem is one of keeping a sane watermark on this queue - not too many requests to destroy latency but enough to let the elevator do some good. > > Well, actually just about any communications protocol worth its salt > > uses some sort of windowing throttle based on the amount of data > > Im talking about flow control/traffic shaping ...where the user sets a number exlpicitly for what performance they want. Again, if we're going to make the user set this latency variable for each of their devices, then doing it based on time will work great. > > There's too many orders of magnatude difference between even just SCSI > > disks (10 yr old drive? 16-way RAID? Solid state?) to make > > supplying any sort of default with the kernel impractical. The end > > The same argument is equally valid for the current scheme, and I think you'll > find equally bogus There will always need to be tunables - and it's fine to say "if you've got oddball hardware and/or workload and/or requirements then you should twiddle this knob". But it seems to me that the current scheme works well for a pretty big range of devices. If you do the setting based on time, I think it'll be a lot more sensitive since there's nothing that will scale based on the speed of the device. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > Now, I see people trying to introduce the concept of elapsed time into > > that fix, which smells strongly of hack. How will this hack be cobbled > > Actually my brain says that elapsed time based scheduling is the right thing > to do. No, Andrea is right here. The argument that everyone is using ("Our target - latency - is measured in time") is utterly bogus. Yes, it's measured in time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at least it's scaled to take into account the speed of the drive. And underneath, it's still based on time. > It certainly works for networks Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data outstanding, not the length of time it's been in the queue. Which is why TCP works well over both GigE and 28.8. [*] Now substitute "big fiberchannel RAID" for GigE and "360K floppy" for 28.8 and you've got the same problem. * -- Yes, for optimal TCP over big WAN pipes you may want to use a larger buffer size, but that's a matter of the bandwidth delay product, which isn't relavent for talking about storage If we move to a "length of queue in time" as Rik suggests then we're going to have to MAKE the user set it manually for each device. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end result might be a bit better behaved, but only just slightly. If people absolutely need this behavior for some reason, the current algorithm should stay as the default. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. No, Andrea is right here. The argument that everyone is using ("Our target - latency - is measured in time") is utterly bogus. Yes, it's measured in time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at least it's scaled to take into account the speed of the drive. And underneath, it's still based on time. It certainly works for networks Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data outstanding, not the length of time it's been in the queue. Which is why TCP works well over both GigE and 28.8. [*] Now substitute "big fiberchannel RAID" for GigE and "360K floppy" for 28.8 and you've got the same problem. * -- Yes, for optimal TCP over big WAN pipes you may want to use a larger buffer size, but that's a matter of the bandwidth delay product, which isn't relavent for talking about storage If we move to a "length of queue in time" as Rik suggests then we're going to have to MAKE the user set it manually for each device. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end result might be a bit better behaved, but only just slightly. If people absolutely need this behavior for some reason, the current algorithm should stay as the default. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. Your perceived latency is based entirely on A. Yes, but "how hard is it reasonable for the kernel to try" is based on both items. A good first order approximation is number of requests. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at I dont suggest you do queue limiting on that basis. I suggest you do order limiting based on time slots It's still a queue - the queue of things we're going to take on this elevator swipe, right? And the problem is one of keeping a sane watermark on this queue - not too many requests to destroy latency but enough to let the elevator do some good. Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data Im talking about flow control/traffic shaping ...where the user sets a number exlpicitly for what performance they want. Again, if we're going to make the user set this latency variable for each of their devices, then doing it based on time will work great. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end The same argument is equally valid for the current scheme, and I think you'll find equally bogus There will always need to be tunables - and it's fine to say "if you've got oddball hardware and/or workload and/or requirements then you should twiddle this knob". But it seems to me that the current scheme works well for a pretty big range of devices. If you do the setting based on time, I think it'll be a lot more sensitive since there's nothing that will scale based on the speed of the device. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/