Re: [PATCH] dax/hmem: Fix refcount leak in dax_hmem_probe()
On Fri, 2 Jun 2023, Ira Weiny wrote: > Paul Cassella wrote: > > On Sat, 3 Dec 2022, Ira Weiny wrote: > > > On Sat, Dec 03, 2022 at 09:58:58AM +, Yongqiang Liu wrote: > > > > We should always call dax_region_put() whenever devm_create_dev_dax() > > > > succeed or fail to avoid refcount leak of dax_region. Move the return > > > > value check after dax_region_put(). > > > I think dax_region_put is called from dax_region_unregister() > > > automatically on > > > tear down. > > Note the reference dax_region_unregister() will be putting is the one > > devm_create_dev_dax() takes by kref_get(&dax_region->kref). I think > > dax_hmem_probe() needs to put its reference in the error case, as in the > > successful case. > Looking at this again I'm inclined to agree that something is wrong. But > I'm not sure this patch fixes it. anything. > When you say: > > > ... I think > > dax_hmem_probe() needs to put its reference in the error case, as in the > > successful case. > > ... which kref_get() is dax_hmem_probe() letting go? Isn't it letting go of the initial kref_init() reference from alloc_dax_region()? Sorry, I had gone through the code a little carelessly yesterday. Now I think that kref_init() reference is the one that dax_hmem_probe() is dropping in the success case, and which still needs to be dropped in the error case. (If so, I think the alloc_dax_region() path that returns NULL on devm_add_action_or_reset() failure, releasing the kref_get reference, will leak the kref_init reference and the region.) -- Paul Cassella > Here is what I see with the current code. > > dax_hmem_probe() > alloc_dax_region() > kref_get(&dax_region->kref) > devm_add_action_or_reset(... dax_region_unregister, ...) > ... kref_put() later... > > devm_create_dev_dax() > ...may return error... > kref_get() > [dev_dax_release() set to call kref_put() later] > ...may return error... > > if not error > dax_region_put() => kref_put() > > I think this is an extra unneeded put??? > > Dan I see this pattern repeated in cxl and pmem. Is the intent to remove > the need for dax_region_unregister() to be called when the platform device > unwinds because the platform device is not intended to own the dax_region > after success? If so it seems like the device managed call should be > removed too. Not just calling dax_region_put()? :-/ > > But wouldn't that cause an issue with the sysfs entries created? > > > > > Consider, devm_create_dev_dax() has error paths that return without > > involving dax_region_unregister(), prior to kref_get() and device_add(). > > dax_hmem_probe() is clearly responsible for freeing the region in those > > cases. > > No the devm_add_action_or_reset(... dax_region_unregister, ...) will cause > dax_region_unregister() to release the reference when the platform device > unwinds. > > devm_create_dev_dax() configures a reference release through the > dev_dax->type release. So when the dev_dax device is put the dax_region > will be released through dev_dax_release()->dax_region_put(). > > > > > > > dax_hmem_probe() drops its own reference in the successful case because > > (per the comment) "child dev_dax instances now own the lifetime of the > > dax_region". That ownership is the reference that the error-case > > dax_region_unregister() is dropping. > > No dax_region_unregister() is not just an error case flow. > > > > > dax_hmem_probe()'s initial reference > > also needs to be dropped in the error case, as in the successful case. > > > > I don't follow this. Doesn't this now result in an invalid reference > release in dax_region_unregister() when the platform device is unwound? > Furthermore, that reference is required I think for the sysfs entries. > > > > > > > Fixes: c01044cc8191 ("ACPI: HMAT: refactor hmat_register_target_device > > > > to hmem_register_device") > > > > > > I'm also not sure how this patch is related to this fix. > > > > > > Ira > > > > > > > Signed-off-by: Yongqiang Liu > > > > --- > > > > drivers/dax/hmem/hmem.c | 5 ++--- > > > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c > > > > index 1bf040dbc834..
Re: [PATCH] dax/hmem: Fix refcount leak in dax_hmem_probe()
On Sat, 3 Dec 2022, Ira Weiny wrote: > On Sat, Dec 03, 2022 at 09:58:58AM +, Yongqiang Liu wrote: > > We should always call dax_region_put() whenever devm_create_dev_dax() > > succeed or fail to avoid refcount leak of dax_region. Move the return > > value check after dax_region_put(). > I think dax_region_put is called from dax_region_unregister() automatically on > tear down. Hi Ira, Note the reference dax_region_unregister() will be putting is the one devm_create_dev_dax() takes by kref_get(&dax_region->kref). I think dax_hmem_probe() needs to put its reference in the error case, as in the successful case. Consider, devm_create_dev_dax() has error paths that return without involving dax_region_unregister(), prior to kref_get() and device_add(). dax_hmem_probe() is clearly responsible for freeing the region in those cases. dax_hmem_probe() drops its own reference in the successful case because (per the comment) "child dev_dax instances now own the lifetime of the dax_region". That ownership is the reference that the error-case dax_region_unregister() is dropping. dax_hmem_probe()'s initial reference also needs to be dropped in the error case, as in the successful case. > > Fixes: c01044cc8191 ("ACPI: HMAT: refactor hmat_register_target_device to > > hmem_register_device") > > I'm also not sure how this patch is related to this fix. > > Ira > > > Signed-off-by: Yongqiang Liu > > --- > > drivers/dax/hmem/hmem.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c > > index 1bf040dbc834..09f5cd7b6c8e 100644 > > --- a/drivers/dax/hmem/hmem.c > > +++ b/drivers/dax/hmem/hmem.c > > @@ -36,12 +36,11 @@ static int dax_hmem_probe(struct platform_device *pdev) > > .size = region_idle ? 0 : resource_size(res), > > }; > > dev_dax = devm_create_dev_dax(&data); > > - if (IS_ERR(dev_dax)) > > - return PTR_ERR(dev_dax); > > > > /* child dev_dax instances now own the lifetime of the dax_region */ This comment should probably be updated now. :) -- Paul Cassella > > dax_region_put(dax_region); > > - return 0; > > + > > + return IS_ERR(dev_dax) ? PTR_ERR(dev_dax) : 0; > > } > > > > static int dax_hmem_remove(struct platform_device *pdev) > > -- > > 2.25.1 > > > > >
Re: Hangs under 2.4.2-ac{18,19,24} that do not happen under -ac12.
On Wed, 28 Mar 2001, Paul Cassella wrote: > Hangs under 2.4.2-ac{18,19,24} that do not happen under -ac12. I've been running -ac27 for over 5 days, and it's been fine, so this seems to have been fixed. -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs under 2.4.2-ac{18,19,24} that do not happen under -ac12.
On Thu, 29 Mar 2001, Alan Cox wrote: > Was anything between 12 and 18 stable ? I didn't actually try them; I jumped right from 12 to 18, and when that and 19 died, I went back to 12. But a quick look suggests that the entire patch I'd applied to 12 and got a hang with was in 13, including the pm.c change. I also haven't tried anything after 24; is it likely to have been fixed? -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hangs under 2.4.2-ac{18,19,24} that do not happen under -ac12.
Earlier today, I wrote > and no sysrq keys other than B worked; I didn't hear disk activity > after S, and the disks weren't unmounted. Nothing made it to the Of course, when I rebooted this time (after SysRQ S,U,B), all the filesystems were clean. Nothing in the logs this time either though. > When I get home and reboot (following this most recent hang :( ), I'll > put the diff, .config, and more stuff from /proc at > http://manetheren.eigenray.com/~fortytwo/crash-12-18.2 This is now there. -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hangs under 2.4.2-ac{18,19,24} that do not happen under -ac12.
1 (autoclean) [visor usbserial uhci] from Debian's /var/log/ksymoops [7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) Under plain -ac12: manetheren:/var/log/ksymoops# cat /proc/ioports -001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0070-007f : rtc 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 0170-0177 : ide1 01f0-01f7 : ide0 0376-0376 : ide1 0378-037a : parport0 037b-037f : parport0 03c0-03df : vga+ 03f6-03f6 : ide0 03f8-03ff : serial(set) 0cf8-0cff : PCI conf1 d000-dfff : PCI Bus #01 d800-d8ff : Lite-On Communications Inc LNE100TX d800-d8ff : eth0 df00-df3f : Ensoniq ES1371 [AudioPCI-97] df00-df3f : es1371 ef80-ef9f : Intel Corporation 82801AA USB ef80-ef9f : usb-uhci efa0-efaf : Intel Corporation 82801AA SMBus ffa0-ffaf : Intel Corporation 82801AA IDE ffa0-ffa7 : ide0 ffa8-ffaf : ide1 manetheren:/var/log/ksymoops# cat /proc/iomem -0009fbff : System RAM 0009fc00-0009 : reserved 000a-000b : Video RAM area 000c-000c7fff : Video ROM 000f-000f : System ROM 0010-07eb : System RAM 0010-001cd2c5 : Kernel code 001cd2c6-0020c2ff : Kernel data 07ec-07ef7fff : ACPI Tables 07ef8000-07ef : ACPI Non-volatile Storage f6a0-f6af : PCI Bus #01 f800-fbff : Intel Corporation 82810-DC100 CGC [Chipset Graphics Controller] ff80-ff8f : PCI Bus #01 ff8ffc00-ff8ffcff : Lite-On Communications Inc LNE100TX ff8ffc00-ff8ffcff : eth0 ffa8-ffaf : Intel Corporation 82810-DC100 CGC [Chipset Graphics Controller] ffb8-ffbf : reserved fff0- : reserved [7.5.] PCI information ('lspci -vvv' as root) manetheren:/var/log/ksymoops# lspci -vvv 00:00.0 Host bridge: Intel Corporation 82810-DC100 GMCH [Graphics Memory Controller Hub] (rev 02) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- SERR- Reset- FastB2B- 00:1f.0 ISA bridge: Intel Corporation 82801AA ISA Bridge (LPC) (rev 02) Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- i_size = offset; if (inode->i_op && inode->i_op->truncate) + { + /* This doesnt scale but it is meant to be a 2.4 invariant */ + lock_kernel(); inode->i_op->truncate(inode); + unlock_kernel(); + } return 0; out: return -EFBIG; A few lines earlier in this function, inode->i_op->truncate() is called without lock_kernel(). Should it also have a lock_kernel(), or is it not needed there? -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-ac3 write() to tcp socket returning errno of -3 (ESRCH:"No such process")
On Tue, 9 Jan 2001, Paul Cassella wrote: > and mss_now seems to be less than skb->len when the printk happens. My > copy of K&R is at work; could that comparison be being done unsigned > because of skb->len? I wouldn't think so, but the alternative seems > somewhat worse... That'll teach me to post about integral promotions ... > + printk(KERN_ERR "%s:%d:%s: err is unexpectedly %d.\n", file, line, >func, ret); ... and hand-edit patches before breakfast. I'm not familiar enough with the tcp code to know if this patch (against -ac6) is a solution, band-aid, or, in fact, wrong, but I've run with it (on -ac3) and haven't seen the errors for over twelve hours, which is three times longer than it had been able to go without it coming up. --- tcp.c.orig Thu Jan 11 08:54:50 2001 +++ tcp.c Thu Jan 11 08:56:42 2001 @@ -954,7 +954,7 @@ */ skb = sk->write_queue.prev; if (tp->send_head && - (mss_now - skb->len) > 0) { + (signed int)(mss_now - skb->len) > 0) { copy = skb->len; if (skb_tailroom(skb) > 0) { int last_byte_was_odd = (copy % 4); Or would this be better? + (unsigned int)mss_now > skb->len) { Or making mss_now unsigned in the first place? -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-ac3 write() to tcp socket returning errno of -3 (ESRCH:"No such process")
csum_and_copy_from_user( from, skb_put(skb, copy), copy, skb->csum, &err); + CHECK_TCP_RET(); } /* * FIXME: the *_user functions should @@ -1042,6 +1068,7 @@ } if (signal_pending(current)) { err = sock_intr_errno(timeo); + CHECK_TCP_RET(); goto do_interrupted; } timeo = wait_for_tcp_memory(sk, timeo); @@ -1078,8 +1105,11 @@ skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err); - if (err) - goto do_fault; + if (err) { + if(copied) check_tcp_ret(copied, __FILE__, __LINE__, +__FUNCTION__); + else CHECK_TCP_RET(); + goto do_fault; + } from += copy; copied += copy; @@ -1092,6 +1122,7 @@ } } err = copied; + CHECK_TCP_RET(); out: __tcp_push_pending_frames(sk, tp, mss_now, tp->nonagle); out_unlock: @@ -1100,14 +1131,20 @@ return err; do_sock_err: - if (copied) + if (copied) { err = copied; - else + CHECK_TCP_RET(); + + } else { err = sock_error(sk); + CHECK_TCP_RET(); + } goto out; do_shutdown: - if (copied) + if (copied) { err = copied; + CHECK_TCP_RET(); + } else { if (!(flags&MSG_NOSIGNAL)) send_sig(SIGPIPE, current, 0); @@ -1115,14 +1152,18 @@ } goto out; do_interrupted: - if (copied) + if (copied) { err = copied; + CHECK_TCP_RET(); + } goto out_unlock; do_fault: __kfree_skb(skb); do_fault2: - if (copied) + if (copied) { err = copied; + CHECK_TCP_RET(); + } else err = -EFAULT; goto out; @@ -1283,6 +1324,8 @@ remove_wait_queue(sk->sleep, &wait); __set_current_state(TASK_RUNNING); + if(timeo < 0) + printk(KERN_ERR "tcp_data_wait: timeo == %ld.\n", timeo); return timeo; } @@ -2026,8 +2069,10 @@ if (sk->state != TCP_LISTEN) break; err = sock_intr_errno(timeo); - if (signal_pending(current)) + if (signal_pending(current)) { + CHECK_TCP_RET(); break; + } err = -EAGAIN; if (!timeo) break; -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-ac3 write() to tcp socket returning errno of -3 (ESRCH:"No such process")
On Mon, 8 Jan 2001, Andi Kleen wrote: > On Sun, Jan 07, 2001 at 11:55:28PM -0600, Paul Cassella wrote: > > write() returns -1 and sets errno non-sensically. 2.4.0{,-ac[23]} > Would it be possible to provide a compiling test case that shows these > errors ? The CVS version (perhaps even version 0.12) of gtk-gnutella should do it, though it's not exactly what I'm running. After the crash I posted about, I've been running it for another three and a half hours under similar load, and it hasn't crashed again yet; nor have I seen any other unexpected ret's (besides those noted below) get logged. http://gtk-gnutella.sourceforge.net/cvs/ Though I don't think gtk-gnutella is special. It doesn't do anything programmatically unusual with sockets, it just has a bunch of connections to a bunch of different machines, which are probably running a bunch of different os's, etc. And it aborts when write() returns unexpected values. I've appended the actual kernel diffs that I'm using; I'd just pasted them from an xterm before. I probably should have added ECONNREFUSED and ERESTARTSYS to the list. > Also over what interface do you run it? eth0 (tulip): 01:08.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 21) In answer to David's questions, I don't think I'm doing anything out of the ordinary. I'm running over DSL with the above card on an external DSL router. No netfilter, no tunneling, just IP. CONFIG_PACKET=m CONFIG_PACKET_MMAP=y CONFIG_NETLINK=y CONFIG_RTNETLINK=y CONFIG_NETLINK_DEV=m # CONFIG_NETFILTER is not set # CONFIG_FILTER is not set CONFIG_UNIX=y CONFIG_INET=y # CONFIG_IP_MULTICAST is not set # CONFIG_IP_ADVANCED_ROUTER is not set # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_ARPD is not set CONFIG_INET_ECN=y CONFIG_SYN_COOKIES=y CONFIG_IPV6=m # CONFIG_IPV6_EUI64 is not set # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_DECNET is not set # CONFIG_BRIDGE is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_LLC is not set # CONFIG_NET_DIVERT is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # CONFIG_NET_FASTROUTE is not set # CONFIG_NET_HW_FLOWCONTROL is not set CONFIG_DUMMY=m CONFIG_TULIP=y I have these set, but I haven't used them: CONFIG_PPP=m # CONFIG_PPP_MULTILINK is not set CONFIG_PPP_ASYNC=m # CONFIG_PPP_SYNC_TTY is not set CONFIG_PPP_DEFLATE=m CONFIG_PPP_BSDCOMP=m eth0 Link encap:Ethernet HWaddr 00:A0:CC:3E:E6:63 inet addr:64.81.146.215 Bcast:64.81.146.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:674842 errors:0 dropped:0 overruns:0 frame:0 TX packets:791977 errors:0 dropped:0 overruns:0 carrier:0 collisions:490 txqueuelen:100 Interrupt:11 Base address:0xd800 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:3856 Metric:1 RX packets:1138 errors:0 dropped:0 overruns:0 frame:0 TX packets:1138 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 Destination Gateway Genmask Flags Metric RefUse Iface 64.81.146.0 0.0.0.0 255.255.255.0 U 0 00 eth0 0.0.0.0 64.81.146.1 0.0.0.0 UG0 00 eth0 Is there a list somewhere of network options I should report (such as filtering and tunneling)? Should one of linux-kernel or linux-net be pruned from the Cc: list? -- Paul Cassella --- fs/read_write.c~Sat Nov 11 18:07:38 2000 +++ fs/read_write.c Sun Jan 7 14:00:25 2001 @@ -146,6 +146,8 @@ ssize_t ret; struct file * file; +extern struct file_operations socket_file_ops; + ret = -EBADF; file = fget(fd); if (file) { @@ -165,6 +167,18 @@ DN_MODIFY); fput(file); } + if(ret < 0 && file && file->f_op == &socket_file_ops ) { + switch(-ret) { + case EAGAIN: case EBADF: case EPIPE: case ENOSPC: case EIO: case +ECONNRESET: + case EINTR: case ETIMEDOUT: case EFAULT: case EINVAL: case EMSGSIZE: +case ENOMEM: + case ENOBUFS: case ENOTCONN: + break; + + default: + printk(KERN_ERR "sys_write: ret is unexpectedly %d.\n", ret); + } + } + return ret; } --- net/socket.c~ Mon Jan 8 01:26:55 2001 +++ net/socket.cSun Jan 7 13:58:55 2001 @@ -111,7 +111,7 @@ * in the operation structures but are done directly via the socketcall() multiplexor. */ -static struct file_operations socket_file_ops = { +struct file_operations socket_file_ops = { llseek:
Re: 2.4.0-ac3 write() to tcp socket returning errno of -3 (ESRCH:"No such process")
On Sun, 7 Jan 2001, David S. Miller wrote: >Date: Sun, 7 Jan 2001 23:55:28 -0600 (CST) >From: Paul Cassella <[EMAIL PROTECTED]> > >[1.] One line summary of the problem: > >write() returns -1 and sets errno non-sensically. 2.4.0{,-ac[23]} > > What you describe I can only say is "impossible". Right. That's why the code g_error()'s there. :) > None of them can occur via TCP socket writes (only netlink socket > operations or socket control calls). I didn't imagine that a TCP write could return any of these errors, which is why I guessed that something's returning a value it thinks is positive, but turns out to be negative in this case, for example, "return a - b;" where due to, maybe, (but I guess not,) the peer shrunk window thing, a < b. > Therefore I suspect you are perhaps getting rather some form of memory > corruption or similar, really, please search the networking code for > ESRCH value usage, you will see. I believe that the tcp_sendmsg() path never tries to return -ESRCH, but I don't see how userland memory corruption could cause the app to (1) open such a socket instead of a TCP socket, and (2) get ESRCH or ENOEXEC instead of EPERM or whatever a non-root process would get when it tried such a thing. I have seen no signs of bad memory on this machine, and of a fairly small sample set, one (I was wrong when I said several) person is seeing the same thing, with 2.4.0-test13-pre4, and I didn't see it with 2.4.0-test11-pre2. The change I made to linux/fs/read_write.c:sys_write() checks file->f_op == &socket_file_ops and then finds ret == -3; presumably this is the same 3 that my app gets. Would it be more helpful if I were to check something like socki_lookup(file->f_dentry->f_inode)->ops == tcp_prot instead? Or do the check in tcp_sendmsg()? -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0-ac3 write() to tcp socket returning errno of -3 (ESRCH: "Nosuch process")
[1.] One line summary of the problem: write() returns -1 and sets errno non-sensically. 2.4.0{,-ac[23]} [2.] Full description of the problem/report: write() sometimes returns -1 and sets errno to one of 1 (EPERM: "Operation not permitted") 3 (ESRCH: "No such process") 8 (ENOEXEC: "Exec format error") 1 seems to be the most common. I have verified that at least the ESRCH case is caused by the kernel. With the following patch to sys_write(), --- fs/read_write.c~ Sat Nov 11 18:07:38 2000 +++ fs/read_write.cSun Jan 7 14:00:25 2001 @@ -146,6 +146,8 @@ ssize_t ret; struct file * file; +extern struct file_operations socket_file_ops; + ret = -EBADF; file = fget(fd); if (file) { @@ -165,6 +167,18 @@ DN_MODIFY); fput(file); } + if(ret < 0 && file && file->f_op == &socket_file_ops ) { + switch(-ret) { + case EAGAIN: case EBADF: case EPIPE: case ENOSPC: case EIO: case +ECONNRESET: + case EINTR: case ETIMEDOUT: case EFAULT: case EINVAL: case EMSGSIZE: +case ENOMEM: + case ENOBUFS: case ENOTCONN: + break; + + default: + printk(KERN_ERR "sys_write: ret is unexpectedly %d.\n", ret); + } + } + return ret; } and socket_file_ops made non-static, I eventually got this logged: Jan 7 22:15:29 localhost kernel: sys_write: ret is unexpectedly -3. And this user code: r = write(fd_that_is_a_tcp_socket, ...); if (r > 0) { ... } else if(r == 0) { ... } else if (errno == EAGAIN || errno == EINTR) { ... } else if (errno == EPIPE || errno == ENOSPC || errno == EIO || errno == ECONNRESET || errno == ETIMEDOUT) { ... } else { g_error("node_write: write failed with unexpected errno: %d (%s)\n", errno, g_strerror(errno)); } at the same time produced ** ERROR **: node_write: write failed with unexpected errno: 3 (No such process) aborting... This socket is O_NONBLOCK'd, as well as SO_KEEPALIVE'd and SO_REUSEADDR'd, and was an outgoing connection. TCP ECN is on. Syncookies are compiled in but turned off. This app has previously failed with the same thing for errno's 1 and 8, but without the kernel change to verify that those values are actually coming from the kernel. That had happened on 2.4.0, 2.4.0-ac2, and 2.4.0-ac3. I didn't notice it on 2.4.0-test11-pre2, which was the last kernel I'd been running. But the problem's not particularly deterministic, and I may not have been running it long enough. At least two other people are seeing the same thing on 2.4 kernels. My guess is that something mistakenly thinks it's returning a positive number. [3.] Keywords (i.e., modules, networking, kernel): networking [4.] Kernel version (from /proc/version): Linux version 2.4.0-ac3 (fortytwo@manetheren) (gcc version 2.95.2 2220 (Debian GNU/Linux)) #4 Sun Jan 7 16:18:26 CST 2001 The machine and kernel are UP. [6.] A small shell script or example program which triggers the problem (if possible) This happens after the app runs usually for several hours with lots of outgoing and incoming TCP connections to lots of different hosts. [7.1.] Software (add the output of the ver_linux script here) Linux manetheren 2.4.0-ac3 #4 Sun Jan 7 16:18:26 CST 2001 i686 unknown Kernel modules 2.3.23 Gnu C 2.95.2 Gnu Make 3.79.1 Binutils 2.10.1.0.2 Linux C Library> libc.2.2 Dynamic linker ldd (GNU libc) 2.2 Procps 2.0.6 Mount 2.10q Net-tools 2.05 Console-tools 0.2.3 Sh-utils 2.0.11 Modules Loaded parport_pc lp parport rtc usbcore [7.2.] Processor information (from /proc/cpuinfo): [7.3.] Module information (from /proc/modules): [7.5.] PCI information ('lspci -vvv' as root) [7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) These doesn't seem particularly relevant, so I'm putting them, as well as .config, at http://manetheren.eigenray.com/~fortytwo/bad_write_info Don't hesitate to ask me for more info or to try a patch or something. [7.7.] Other information that might be relevant to the problem (please look in /proc and include all information that you think to be relevant): In 2.4.0 and -ac2, I was getting the reset_xmit_timer() messages that -ac3 fixed. I'm also getting TCP peer shrinks window messages, but had none this boot before this error. -- Paul Cassella - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Semaphores used for daemon wakeup
On Fri, 22 Dec 2000, Daniel Phillips wrote: > But isn't this actually a simple situation? How about: I had only adapted that example because it had already been posted showing one way to do it, and so provided something to compare the sv approach to. > dmabuf_alloc(...) > { > while (1) { > spin_lock(&dmabuf_lock); > attempt to grab a free buffer; > spin_unlock(&dmabuf_lock); > if (success) >return; > down(&dmabuf_wait); > } > } > dmabuf_free(...) > { > spin_lock(&dmabuf_lock); > free up buffer; > spin_unlock(&dmabuf_lock); > up(&dmabuf_wait); > } This does things a little differently than the way the original did it. I thought the original implied that dmabuf_free() might free up multiple buffers. There's no indication in the comments that this is the case, but the original, by using vall_sema(), wakes up all dmabuf_alloc()'s that had gone to sleep. The example wasn't meant to be an ideal use of sv's, but merely as an example of how they could be used to achieve the same behavior as the code that was posted. -- Paul Cassella [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Semaphores used for daemon wakeup
The mechanism being developed here seems a lot like synchronization variables (aka condition variables), which are a part of the "monitor" synchronization construct. There is a simple implementation of them in the xfs patch. I've been working on a more general version in order to aid porting some other stuff, which I have appended to this post. I had been holding off on posting about it since I didn't have any code that used it ready, and probably wouldn't until 2.5 anyway, but due to this thread, I think bringing it up now might be helpful. Daniel Phillips wrote: > Tim Wright wrote: > > p_sema_v_lock(&sema, priority, &lock); /* atomically release the lock AND */ > > /* go to sleep on the semaphore */ This in particular looks like sv_wait(), which atomically releases a lock and goes to sleep. The style is somewhat different, as a sync variable (sv) is not really a lock, but is still something that a process can block on. A process goes to sleep by calling sv_wait(sv), and is woken up by another process calling sv_signal(sv) (wake one) or sv_broadcast(sv) (wake all). If there is no process sleeping in sv_wait, signals and broadcasts have no effect; they do not affect any process which subsequently calls sv_wait(). Each sync variable is associated with another lock, which provides mutual exclusion guarantees. This lock must be held to call sv_wait(), and is dropped atomically with the process going to sleep. This lock must also be held to call sv_signal() or sv_broadcast(). Currently, this lock can be a spinlock or a semaphore; other types of locks could be added if necessary. The sync variable version of the dmabuf code snippet (assuming the dmabuf_mutex is never acquired from an interrupt) would look like this: dmabuf_init(...); { ... spin_lock_init(&dmabuf_spin); sv_init(&dmabuf_sv, &dmabuf_spin, SV_MON_SPIN); ... } dmabuf_alloc(...) { ... while (1) { spin_lock(&dmabuf_spin); attempt to grab a free buffer; if (success){ spin_unlock(&dmabuf_spin); return; } else { sv_wait(&dmabuf_sv); } } } dmabuf_free(...) { ... spin_lock(&dmabuf_spin); free up buffer; sv_broadcast(&dmabuf_sv); spin_unlock(&dmabuf_spin); } If dmabuf_free() could be called from an interrupt, this would be modified by passing the SV_INTS flag to sv_init(), using spin_lock_irq() in dmabuf_alloc, and spin_lock_irqsave/restore in dmabuf_free(). > > As you can see, the spinlocks ensure no races, and the key is the atomicity > > of p_sema_v_lock(). No-one can race in and sleep on dmabuf_wait, because > > they have to hold dmabuf_mutex to do so. Exactly the same mechanism would > > work for the bdflush problem. The same protections are in place here, as the two methods are basically the same. > described. The main difference between this situation and bdflush is > that dmabuf_free isn't really waiting on dmabuf_alloc to fullfill a > condition (other than to get out of its exclusion region) while bdflush > can have n waiters. This could be done with two sv's. After all, there are two conditions: "someone needs bdflush to run", and "bdflush is done". > int atomic_read_and_clear(atomic_t *p) > { > int n = atomic_read(p); > atomic_sub(p, n); > return n; > } I don't think this will work; consider two callers doing the atomic_read() at the same time, or someone else doing an atomic_dec() after the atomic_read(). > the more involved wake_up_process(). What's clear is, they are all > plenty fast enough for this application, and what I'm really trying for > is readability. I think sv's are pretty readable and clear, but I'd like to find out what other people think. If anyone finds this useful or finds any bugs, or has any questions or suggestions, please let me know. I'll be reading the list, but I'm going on vacation tomorrow, so I'd appreciate a Cc: so I have a better chance of answering before then. Thanks. -- Paul Cassella [EMAIL PROTECTED] And now, the code. include/linux/sv.h: /* This is the version I'm using with a test8-pre1 kernel, so it uses the old TASK_EXCLUSIVE semantics; it should be trivial to make it work with new kernels. I haven't yet done so, since we're going to be using the test8-pre1 kernel for a few more weeks yet. In the interests of brevity, I've taken out most of the debugging code, some typecasting, and some other things like the wr