Re: Race to power off harming SATA SSDs
On Mon 2017-05-08 16:40:11, David Woodhouse wrote: > On Mon, 2017-05-08 at 13:50 +0200, Boris Brezillon wrote: > > On Mon, 08 May 2017 11:13:10 +0100 > > David Woodhousewrote: > > > > > > > > On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote: > > > > > > > > You're forgetting that the SSD itself (this thread is about SSDs) also > > > > has > > > > a major software component which is doing housekeeping all the time, so > > > > even > > > > if the main CPU gets reset the SSD's controller may still happily be > > > > erasing > > > > blocks. > > > We're not really talking about SSDs at all any more; we're talking > > > about real flash with real maintainable software. > > > > It's probably a good sign that this new discussion should take place in > > a different thread :-). > > Well, maybe. But it was a silly thread in the first place. SATA SSDs > aren't *expected* to be reliable. Citation needed? I'm pretty sure SATA SSDs are expected to be reliable, up to maximum amount of gigabytes written (specified by manufacturer), as long as you don't cut power without warning. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: Race to power off harming SATA SSDs
On Mon 2017-05-08 13:43:03, Tejun Heo wrote: > Hello, > > On Mon, May 08, 2017 at 06:43:22PM +0200, Pavel Machek wrote: > > What I was trying to point out was that storage people try to treat > > SSDs as HDDs... and SSDs are very different. Harddrives mostly survive > > powerfails (with emergency parking), while it is very, very difficult > > to make SSD survive random powerfail, and we have to make sure we > > always powerdown SSDs "cleanly". > > We do. > > The issue raised is that some SSDs still increment the unexpected > power loss count even after clean shutdown sequence and that the > kernel should wait for some secs before powering off. > > We can do that for select devices but I want something more than "this > SMART counter is getting incremented" before doing that. Well... the SMART counter tells us that the device was not shut down correctly. Do we have reason to believe that it is _not_ telling us truth? It is more than one device. SSDs die when you power them without warning: http://lkcl.net/reports/ssd_analysis.html What kind of data would you like to see? "I have been using linux and my SSD died"? We have had such reports. "I have killed 10 SSDs in a week then I added one second delay, and this SSD survived 6 months"? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Race to power off harming SATA SSDs
On Mon 2017-05-08 13:50:05, Boris Brezillon wrote: > On Mon, 08 May 2017 11:13:10 +0100 > David Woodhousewrote: > > > On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote: > > > You're forgetting that the SSD itself (this thread is about SSDs) also has > > > a major software component which is doing housekeeping all the time, so > > > even > > > if the main CPU gets reset the SSD's controller may still happily be > > > erasing > > > blocks. > > > > We're not really talking about SSDs at all any more; we're talking > > about real flash with real maintainable software. > > It's probably a good sign that this new discussion should take place in > a different thread :-). Well, you are right.. and I'm responsible. What I was trying to point out was that storage people try to treat SSDs as HDDs... and SSDs are very different. Harddrives mostly survive powerfails (with emergency parking), while it is very, very difficult to make SSD survive random powerfail, and we have to make sure we always powerdown SSDs "cleanly". Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Race to power off harming SATA SSDs
Hi! > > 'clean marker' is a good idea... empty pages have plenty of space. > > Well... you lose that space permanently. Although I suppose you could > do things differently and erase a block immediately prior to using it. > But in that case why ever write the cleanmarker? Just maintain a set of > blocks that you *will* erase and re-use. Yes, but erase is slow so that would hurt performance...? > > How do you handle the issue during regular write? Always ignore last > > successfully written block? > > Log nodes have a CRC. If you get interrupted during a write, that CRC > should fail. Umm. That is not what "unstable bits" issue is about, right? If you are interrupted during write, you can get into state where readback will be correct on next boot (CRC, ECC ok), but then the bits will go back few hours after that. You can't rely on checksums to detect that.. because the bits will have the right values -- for a while. > > Do you handle "paired pages" problem on MLC? > > No. It would theoretically be possible, by not considering a write to > the first page "committed" until the second page of the pair is also > written. Essentially, it's not far off expanding the existing 'wbuf' > which we use to gather writes into full pages for NAND, to cover the > *whole* of the set of pages which are affected by MLC. > > But we mostly consider JFFS2 to be obsolete these days, in favour of > UBI/UBIFS or other approaches. Yes, I guess MLC NAND chips are mostly too big for jjfs2. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: Race to power off harming SATA SSDs
On Mon 2017-05-08 10:34:08, David Woodhouse wrote: > On Mon, 2017-05-08 at 11:28 +0200, Pavel Machek wrote: > > > > Are you sure you have it right in JFFS2? Do you journal block erases? > > Apparently, that was pretty much non-issue on older flashes. > > It isn't necessary in JFFS2. It is a *purely* log-structured file > system (which is why it doesn't scale well past the 1GiB or so that we > made it handle for OLPC). > > So we don't erase a block until all its contents are obsolete. And if > we fail to complete the erase... well the contents are either going to > fail a CRC check, or... still be obsoleted by later entries elsewhere. > > And even if it *looks* like an erase has completed and the block is all > 0xFF, we erase it again and write a 'clean marker' to it to indicate > that the erase was completed successfully. Because otherwise it can't > be trusted. Aha, nice, so it looks like ubifs is a step back here. 'clean marker' is a good idea... empty pages have plenty of space. How do you handle the issue during regular write? Always ignore last successfully written block? Do you handle "paired pages" problem on MLC? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: Race to power off harming SATA SSDs
On Mon 2017-05-08 08:21:34, David Woodhouse wrote: > On Sun, 2017-05-07 at 22:40 +0200, Pavel Machek wrote: > > > > NOTE: unclean SSD power-offs are dangerous and may brick the device in > > > > the worst case, or otherwise harm it (reduce longevity, damage flash > > > > blocks). It is also not impossible to get data corruption. > > > > > I get that the incrementing counters might not be pretty but I'm a bit > > > skeptical about this being an actual issue. Because if that were > > > true, the device would be bricking itself from any sort of power > > > losses be that an actual power loss, battery rundown or hard power off > > > after crash. > > > > And that's exactly what users see. If you do enough power fails on a > > SSD, you usually brick it, some die sooner than others. There was some > > test results published, some are here > > http://lkcl.net/reports/ssd_analysis.html, I believe I seen some > > others too. > > > > It is very hard for a NAND to work reliably in face of power > > failures. In fact, not even Linux MTD + UBIFS works well in that > > regards. See > > http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its > > down now?!). If we can't get it right, do you believe SSD manufactures > > do? > > > > [Issue is, if you powerdown during erase, you get "weakly erased" > > page, which will contain expected 0xff's, but you'll get bitflips > > there quickly. Similar issue exists for writes. It is solveable in > > software, just hard and slow... and we don't do it.] > > It's not that hard. We certainly do it in JFFS2. I was fairly sure that > it was also part of the design considerations for UBI — it really ought > to be right there too. I'm less sure about UBIFS but I would have > expected it to be OK. Are you sure you have it right in JFFS2? Do you journal block erases? Apparently, that was pretty much non-issue on older flashes. https://web-beta.archive.org/web/20160923094716/http://www.linux-mtd.infradead.org:80/doc/ubifs.html#L_unstable_bits > SSDs however are often crap; power fail those at your peril. And of > course there's nothing you can do when they do fail, whereas we accept > patches for things which are implemented in Linux. Agreed. If the SSD indiciates unexpected powerdown, it is a problem and we need to fix it. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: Race to power off harming SATA SSDs
Hi! > > However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion > > [often?] only indicates that the device is now switching to the target > > power management state, not that it has reached the target state. Any > > further device status inquires would return that it is in STANDBY mode, > > even if it is still entering that state. > > > > The kernel then continues the shutdown path while the SSD is still > > preparing itself to be powered off, and it becomes a race. When the > > kernel + firmware wins, platform power is cut before the SSD has > > finished (i.e. the SSD is subject to an unclean power-off). > > At that point, the device is fully flushed and in terms of data > integrity should be fine with losing power at any point anyway. Actually, no, that is not how it works. "Fully flushed" is one thing, surviving power loss is different. Explanation below. > > NOTE: unclean SSD power-offs are dangerous and may brick the device in > > the worst case, or otherwise harm it (reduce longevity, damage flash > > blocks). It is also not impossible to get data corruption. > > I get that the incrementing counters might not be pretty but I'm a bit > skeptical about this being an actual issue. Because if that were > true, the device would be bricking itself from any sort of power > losses be that an actual power loss, battery rundown or hard power off > after crash. And that's exactly what users see. If you do enough power fails on a SSD, you usually brick it, some die sooner than others. There was some test results published, some are here http://lkcl.net/reports/ssd_analysis.html, I believe I seen some others too. It is very hard for a NAND to work reliably in face of power failures. In fact, not even Linux MTD + UBIFS works well in that regards. See http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its down now?!). If we can't get it right, do you believe SSD manufactures do? [Issue is, if you powerdown during erase, you get "weakly erased" page, which will contain expected 0xff's, but you'll get bitflips there quickly. Similar issue exists for writes. It is solveable in software, just hard and slow... and we don't do it.] Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [PATCH v5 4/8] char: rpmb: provide a user space interface
On Sun 2016-09-04 11:35:33, Winkler, Tomas wrote: > > > On Thu, Sep 01, 2016 at 08:05:26PM +, Winkler, Tomas wrote: > > > > > > > > > > > On Sun, Aug 07, 2016 at 09:44:03AM +, Winkler, Tomas wrote: > > > > > > > > > > > > On Mon 2016-07-18 23:27:49, Tomas Winkler wrote: > > > > > > > The user space API is achieved via two synchronous IOCTL. > > > > > > > > > > > > IOCTLs? > > > > > > > > > > Will fix > > > > > > > > > > > > Simplified one, RPMB_IOC_REQ_CMD, were read result cycles is > > > > > > performed > > > > > > > by the framework on behalf the user and second, > > > > > > > RPMB_IOC_SEQ_CMD > > > > > > where > > > > > > > the whole RPMB sequence including RESULT_READ is supplied by > > > > > > > the > > > > caller. > > > > > > > The latter is intended for easier adjusting of the > > > > > > > applications that use MMC_IOC_MULTI_CMD ioctl. > > > > > > > > > > > > Why " "? > > > > > Not sure I there is enough clue in your question. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Tomas Winkler> > > > > > > > > > > > > + > > > > > > > +static long rpmb_ioctl(struct file *fp, unsigned int cmd, > > > > > > > +unsigned long arg) { > > > > > > > + return __rpmb_ioctl(fp, cmd, (void __user *)arg); } > > > > > > > + > > > > > > > +#ifdef CONFIG_COMPAT > > > > > > > +static long rpmb_compat_ioctl(struct file *fp, unsigned int cmd, > > > > > > > + unsigned long arg) > > > > > > > +{ > > > > > > > + return __rpmb_ioctl(fp, cmd, compat_ptr(arg)); > > > > > > > +} > > > > > > > +#endif /* CONFIG_COMPAT */ > > > > > > > > > > > > Description of the ioctl is missing, > > > > > Will add. > > > > > > > > > > and it should certainly be designed in a way > > > > > > that it does not need compat support. > > > > > > > > > > The compat_ioctl handler just casts the compat_ptr, I believe this > > > > > should be done unless the ioctl is globaly registered in > > > > > fs/compat_ioctl.c, but I might be wrong. > > > > > > > > You shouldn't need a compat ioctl for anything new that is added, > > > > unless your api is really messed up. Please test to be sure, and > > > > not use a compat ioctl at all, it isn't that hard to do. > > > > > > compat_ioctl is called anyhow when CONFIG_COMPAT is set, there is no > > > way around it, or I'm missing something? Actually there is no more > > > than that for the COMPAT support in this code. > > > > If you don't provide a compat_ioctl() all should be fine, right? > > No, this doesn't work the driver has to provide compat_ioctl > > You would expect something like > if (!f.file->f_op->compat_ioctl) { > error = f_op->f.file->f_op->unlocked_ioctl((f.file, > cmd, compat_ptr(arg)) > } > But there is no such code under fs/compat_ioctl.c > > The translation has to implemented by the device driver or registered > directly in fs/compat_ioct.c in do_ioctl_trans or ioctl_pointer[] > > If compat_ioct is not provided the application is receiving > : ioctl failure -1: Inappropriate ioctl for device Care to submit a patch? We should not really have to include compat_ioctl support if it is already compatible... Or maybe better provide empty function drivers can fill in when compatible...? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression - SATA disks behind USB ones on v4.8-rc1, breaking boot. [Re: Who reordered my disks (probably v4.8-rc1 problem)]
Hi! On Sun 2016-08-14 18:17:39, Tom Yan wrote: > On 14 August 2016 at 18:07, Tom Yan <tom.t...@gmail.com> wrote: > > On 14 August 2016 at 18:01, Pavel Machek <pa...@ucw.cz> wrote: > >> > >> Since SATA support was merged, certainly since v2.4, and from way > >> before /dev/disk/by-id existed. > > > > I have no idea how "SATA before USB" had been done in the past (if it > > was ever a thing in the kernel), but that has not been the case since > > at least v3.0 AFAIR. It is the case in v4.6. We had change hda->sda for SATA drives long time ago, it was stable since that. > > No, but you can always use root=PARTUUID=, that's built into the > > kernel. (root=UUID= requires udev or so though). > > Silly me. root=UUID= has nothing to do with udev, but `blkid` in > util-linux. At least that's how it's done in Arch/mkinitcpio. I'd rather not mess with initrd, and initrd was not required in the past. kernel-parameters.txt only mentions UUID= in connection with resume. Is the documentation correct? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/8] char: rpmb: provide a user space interface
On Mon 2016-07-18 23:27:49, Tomas Winkler wrote: > The user space API is achieved via two synchronous IOCTL. IOCTLs? > Simplified one, RPMB_IOC_REQ_CMD, were read result cycles is performed > by the framework on behalf the user and second, RPMB_IOC_SEQ_CMD where > the whole RPMB sequence including RESULT_READ is supplied by the caller. > The latter is intended for easier adjusting of the applications that > use MMC_IOC_MULTI_CMD ioctl. Why " "? > > Signed-off-by: Tomas Winkler> + > +static long rpmb_ioctl(struct file *fp, unsigned int cmd, unsigned long arg) > +{ > + return __rpmb_ioctl(fp, cmd, (void __user *)arg); > +} > + > +#ifdef CONFIG_COMPAT > +static long rpmb_compat_ioctl(struct file *fp, unsigned int cmd, > + unsigned long arg) > +{ > + return __rpmb_ioctl(fp, cmd, compat_ptr(arg)); > +} > +#endif /* CONFIG_COMPAT */ Description of the ioctl is missing, and it should certainly be designed in a way that it does not need compat support. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/8] Replay Protected Memory Block (RPMB) subsystem
Hi! > Few storage technologies such is EMMC, UFS, and NVMe support RPMB > hardware partition with common protocol and frame layout. > The RPMB partition cannot be accessed via standard block layer, but by a > set of specific commands: WRITE, READ, GET_WRITE_COUNTER, and > PROGRAM_KEY. > Such a partition provides authenticated and replay protected access, > hence suitable as a secure storage. ...and that is suitable from locking devices from their owners, as Nokia N9 (aka brick, because Microsoft turned off support servers) teached me recently. So I have to ask -- what are non-evil uses for this? There were "secure extensions" mentioned before, but my understanding is that it currently has severe limitations making it unsuitable for mainline kernel. (IOW you can't event test the functionality if you are not Intel). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] char: rpmb: add sysfs-class ABI documentation
On Sun 2016-04-03 12:42:46, Tomas Winkler wrote: > Signed-off-by: Tomas Winkler> --- > Documentation/ABI/testing/sysfs-class-rpmb | 15 +++ > MAINTAINERS| 1 + > 2 files changed, 16 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-class-rpmb > > diff --git a/Documentation/ABI/testing/sysfs-class-rpmb > b/Documentation/ABI/testing/sysfs-class-rpmb > new file mode 100644 > index ..62d1959bf19e > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-class-rpmb > @@ -0,0 +1,15 @@ > +What:/sys/class/rpmb/ > +Date:Mar 2016 > +KernelVersion: TBD > +Contact: Tomas Winkler > +Description: > + The rpmb/ class sub-directory belongs to RPMB device class > + > + > +What:/sys/class/rpmb/rpmbN/ > +Date:Mar 2016 > +KernelVersion: TBD > +Contact: Tomas Winkler > +Description: > + The /sys/class/rpmb/rpmbN directory is created for > + each RPMB registered device Umm. Can we get some better documentation? This is useless. Interested reader would also like to know what "RPMB" is, and why he should be interested... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sequential I/O on SSD disk varies from 20 to 300 MBytes/s every week
Hi! During the testing period of about 5 months I have concluded: 1) There are 3 identical Fujitsu RX200 S6 test servers which all show the same problem, but I also reproduced it on some Sun Fire and Dell server. 2) The problem happens with both HW RAID (MegaRAID SAS 2108) and when disks were directly on integrated SATA card. 3) The problem happens with different Kernel versions (tried 3.14, 3.16, 3.18) 4) The problem happens with newest FW/BIOS versions and on older version 5) I have checked/replaced the cabling. 6) It is not a caching issue (controller/disk caches were off during testing, but even putting them on had minor impact on the results) 7) The problem happens with both 2.5 SATA (12 x HGST Travelstar 1TB, 3 x WD Black 750G), and SSD disks (3 x Samsung Pro 840) 8) I have NOT been able to reproduce it on Windows - the speeds have been good for all disks at all times. 9) Changing the disks (eg. taking currently slow disk and putting it to another server) has mixed results - it usually triggers some change of speed (slow becomes fast or vice-versa) but not always. The only thing that somewhat correlates with the change of speed is the environment: the IO speed of disks is generally better when testing in the office vs if that exact same server is in the server room. It might just been luck, however. I did not find correlation with the uptime, restarts, change of temperature, etc, so I assumed it might be the vibrations/rotations for SATA disks, but now that I have reproduced it with expensive SSD disks as well, I am out of ideas. That's strange. Vibrations? But not for SSDs. Does hwmon say anything interesting? Anything in smart? Only 20Mbytes/s on SSD must be wrong, right? (Especially if week earlier or week later it is ~300MBytes/s). Yes. Can you try the disks in different mainboard (but keep software version?) Are there any other performance problems? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM
On Mon 2014-09-22 13:23:54, Dmitry Torokhov wrote: On Monday, September 22, 2014 09:49:06 PM Pavel Machek wrote: On Thu 2014-09-11 13:23:54, Dmitry Torokhov wrote: On Thu, Sep 11, 2014 at 12:59:25PM -0700, James Bottomley wrote: Yes, but we mostly do this anyway. SCSI for instance does asynchronous scanning of attached devices (once the cards are probed) What would it do it card was a bit slow to probe? but has a sync point for ordering. Quite often we do not really care about ordering of devices. I mean, does it matter if your mouse is discovered before your keyboard or after? Actually yes, I suspect it does. I do evtest /dev/input/eventX by hand, occassionaly. It would be annoying if they moved between reboots. I am sorry but you will have to cope with such annoyances. It' snot like we fail to boot the box here. The systems are now mostly hot-pluggable and userland is supposed to handle it, and it does, at least for input devices. If you want stable naming use udev facilities to rename devices as needed or add needed symlinks (by-id, etc.). Well, it would be nice if udev was not mandatory. Do the sync points for ordering actually cost us something? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM
On Thu 2014-09-11 13:23:54, Dmitry Torokhov wrote: On Thu, Sep 11, 2014 at 12:59:25PM -0700, James Bottomley wrote: On Tue, 2014-09-09 at 16:01 -0700, Dmitry Torokhov wrote: On Tuesday, September 09, 2014 03:46:23 PM James Bottomley wrote: On Wed, 2014-09-10 at 07:41 +0900, Tejun Heo wrote: The thing is that we have to have dynamic mechanism to listen for device attachments no matter what and such mechanism has been in place for a long time at this point. The synchronous wait simply doesn't serve any purpose anymore and kinda gets in the way in that it makes it a possibly extremely slow process to tell whether loading of a module succeeded or not because the wait for the initial round of probe is piggybacked. OK, so we just fire and forget in userland ... why bother inventing an elaborate new infrastructure in the kernel to do exactly what modprobe mod would do? Just so we do not forget: we also want the no-modules case to also be able to probe asynchronously so that a slow device does not stall kernel booting. Yes, but we mostly do this anyway. SCSI for instance does asynchronous scanning of attached devices (once the cards are probed) What would it do it card was a bit slow to probe? but has a sync point for ordering. Quite often we do not really care about ordering of devices. I mean, does it matter if your mouse is discovered before your keyboard or after? Actually yes, I suspect it does. I do evtest /dev/input/eventX by hand, occassionaly. It would be annoying if they moved between reboots. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/15] block copy: initial XCOPY offload support
On Tue 2014-07-15 15:34:47, Mikulas Patocka wrote: This is Martin Petersen's xcopy patch (https://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=xcopyid=0bdeed274e16b3038a851552188512071974eea8) with some bug fixes, ported to the current kernel. This patch makes it possible to use the SCSI XCOPY command. We create a bio that has REQ_COPY flag in bi_rw and a bi_copy structure that defines the source device. The target device is defined in the bi_bdev and bi_iter.bi_sector. There is a new BLKCOPY ioctl that makes it possible to use XCOPY from userspace. The ioctl argument is a pointer to an array of four uint64_t values. But it is there only for block devices, right? Is there plan to enable tools such as /bin/cp to use XCOPY? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] enclosure: add support for enclosure services
On Wed 2008-02-13 09:45:02, Kristen Carlson Accardi wrote: On Tue, 12 Feb 2008 13:28:15 -0600 James Bottomley [EMAIL PROTECTED] wrote: On Tue, 2008-02-12 at 11:07 -0800, Kristen Carlson Accardi wrote: I understand what you are trying to do - I guess I just doubt the value you've added by doing this. I think that there's going to be so much customization that system vendors will want to add, that they are going to wind up adding a custom library regardless, so standardising those few things won't buy us anything. It depends ... if you actually have a use for the customisations, yes. If you just want the basics of who (what's in the enclousure), what (activity) and where (locate) then I think it solves your problem almost entirely. So, entirely as a straw horse, tell me what else your enclosures provide that I haven't listed in the four points. The SES standards too provide a huge range of things that no-one ever seems to implement (temperature, power, fan speeds etc). I think the users of enclosures fall int these categories 85% just want to know where their device actually is (i.e. that sdc is in enclosure slot 5) 50% like watching the activity lights 30% want to be able to have a visual locate function 20% want a visual failure indication (the other 80% rely on some OS notification instead) When you add up the overlapping needs, you get about 90% of people happy with the basics that the enclosure services provide. Could there be more ... sure; should there be more ... I don't think so ... that's what value add the user libraries can provide. James I don't think I'm arguing whether or not your solution may work, what I am arguing is really a more philosophical point. Not can we do it this way, but should we do it way. I am of the opinion that Hw abstraction is still kernel's job. That's why we have leds exported in sysfs... let vendors have their libraries, but lets put the 'everyone does these' stuff in kernel. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI power management for AHCI
On Wed 2008-01-16 10:21:35, Alan Stern wrote: On Tue, 15 Jan 2008, Pavel Machek wrote: Hi! This is my first attempt at ahci autosuspend. It is _very_ hacky at this moment, I'll seriously need to clean it up. But it seems to work here. How does this interact with Link Power Management? Should there be a stronger connection between the two? Link Power Management seems to be hw stuff, and I do not think much linking is neccessary. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SCSI power management for AHCI
Hi! This is my first attempt at ahci autosuspend. It is _very_ hacky at this moment, I'll seriously need to clean it up. But it seems to work here. It includes Alan Stern's patches. I guess I could/should produce separate version. Pavel diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 54f38c2..58c558c 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -32,6 +32,7 @@ * */ +#define DEBUG #include linux/kernel.h #include linux/module.h #include linux/pci.h @@ -259,8 +260,8 @@ static void ahci_fill_cmd_slot(struct ah u32 opts); #ifdef CONFIG_PM static int ahci_port_suspend(struct ata_port *ap, pm_message_t mesg); -static int ahci_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg); -static int ahci_pci_device_resume(struct pci_dev *pdev); +int ahci_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg); +int ahci_pci_device_resume(struct pci_dev *pdev); #endif static struct class_device_attribute *ahci_shost_attrs[] = { @@ -268,6 +269,37 @@ static struct class_device_attribute *ah NULL }; +struct pci_dev *my_pdev; +int ahci_runtime_suspend(struct pci_dev *pdev); +int ahci_runtime_resume(struct pci_dev *pdev); +int autosuspend_enabled; + +/* The host and its devices are all idle so we can autosuspend */ +static int autosuspend(struct Scsi_Host *host) +{ + if (my_pdev autosuspend_enabled) { + printk(ahci: should autosuspend\n); + ahci_runtime_suspend(my_pdev); + return 0; + } + printk(ahci: autosuspend disabled\n); + return -EINVAL; +} + +/* The host needs to be autoresumed */ +static int autoresume(struct Scsi_Host *host) +{ + if (my_pdev autosuspend_enabled) { + printk(ahci: should autoresume\n); + ahci_runtime_resume(my_pdev); + return 0; + } + printk(ahci: autoresume disabled\n); + return -EINVAL; +} + + + static struct scsi_host_template ahci_sht = { .module = THIS_MODULE, .name = DRV_NAME, @@ -286,6 +318,8 @@ static struct scsi_host_template ahci_sh .slave_destroy = ata_scsi_slave_destroy, .bios_param = ata_std_bios_param, .shost_attrs= ahci_shost_attrs, + .autosuspend= autosuspend, + .autoresume = autoresume, }; static const struct ata_port_operations ahci_ops = { @@ -1810,6 +1844,10 @@ static void ahci_thaw(struct ata_port *a static void ahci_error_handler(struct ata_port *ap) { + struct ata_host *host = ap-host; + int rc; + extern int slept; + if (!(ap-pflags ATA_PFLAG_FROZEN)) { /* restart engine */ ahci_stop_engine(ap); @@ -1916,7 +1954,7 @@ static int ahci_port_suspend(struct ata_ return rc; } -static int ahci_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg) +int ahci_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg) { struct ata_host *host = dev_get_drvdata(pdev-dev); void __iomem *mmio = host-iomap[AHCI_PCI_BAR]; @@ -1936,7 +1974,8 @@ static int ahci_pci_device_suspend(struc return ata_pci_device_suspend(pdev, mesg); } -static int ahci_pci_device_resume(struct pci_dev *pdev) + +int ahci_pci_device_resume(struct pci_dev *pdev) { struct ata_host *host = dev_get_drvdata(pdev-dev); int rc; @@ -1945,7 +1984,7 @@ static int ahci_pci_device_resume(struct if (rc) return rc; - if (pdev-dev.power.power_state.event == PM_EVENT_SUSPEND) { + if (1) { rc = ahci_reset_controller(host); if (rc) return rc; @@ -1957,6 +1996,55 @@ static int ahci_pci_device_resume(struct return 0; } + +int ahci_runtime_suspend(struct pci_dev *pdev) +{ + struct ata_host *host = dev_get_drvdata(pdev-dev); + int i; + + printk(ahci_runtime_suspend...\n); + +#if 0 + for (i = 0; i host-n_ports; i++) { + struct ata_port *ap = host-ports[i]; + + if (ata_port_is_dummy(ap)) + continue; + + printk(suspending ata port %d...\n, i); + ahci_port_suspend(ap, PMSG_SUSPEND); + printk(done suspending port %d...\n, i); + } +#endif + + ata_pci_device_suspend(my_pdev, PMSG_SUSPEND); +} + +int ahci_runtime_resume(struct pci_dev *pdev) +{ + struct ata_host *host = dev_get_drvdata(pdev-dev); + int i; + + printk(ahci_runtime_resume\n); + ata_pci_device_resume(my_pdev); + +#if 0 + + for (i = 0; i host-n_ports; i++) { + struct ata_port *ap = host-ports[i]; + + if (ata_port_is_dummy(ap)) + continue; + + printk(resume ata port %d...\n, i); +
Re: [linux-pm] Re: [RFC] Implementation of SCSI dynamic power management
When all the devices under a host are suspended, the LLD is informed (via a new autosuspend method in the host template) so that it can That is most certainly a mistake. Why? Is there a good reason to not modify to extend suspend() to take an extra argument for the reason it is called? In fact suspend methods already do take an argument specifying the reason they were called. It wouldn't be hard to add a couple of extra PM_EVENT_* values for manual suspend and autosuspend. The problem is that resume methods don't take a corresponding argument. Well, you could store the value in struct device or something. There's other problem: if autosuspend is != NULL, you know device supports autosuspend. If you call existing suspend(PMSG_AUTOSUSPEND), and driver does not support it, it will crash and burn. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Small cleanups for scsi_host.h
Small cleanups in scsi_host.h. Few #defines make me wonder if their description is still up to date..? Signed-off-by: Pavel Machek [EMAIL PROTECTED] --- commit f37d85c2619d02ca383962e588417b9eacae366d tree 4f2b2e93717c2f434e91ba32115a3801212e5e3d parent 9918d25acff9540d86ccd93f6e8e536f1cb0a281 author Pavel [EMAIL PROTECTED] Mon, 17 Dec 2007 15:32:49 +0100 committer Pavel [EMAIL PROTECTED] Mon, 17 Dec 2007 15:32:49 +0100 include/scsi/scsi_host.h | 46 ++ 1 files changed, 26 insertions(+), 20 deletions(-) diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 0fd4746..69e5a4a 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -283,39 +283,45 @@ #endif * If the host wants to be called before the scan starts, but * after the midlayer has set up ready for the scan, it can fill * in this function. +* +* Status: OPTIONAL */ void (* scan_start)(struct Scsi_Host *); /* -* fill in this function to allow the queue depth of this host -* to be changeable (on a per device basis). returns either +* Fill in this function to allow the queue depth of this host +* to be changeable (on a per device basis). Returns either * the current queue depth setting (may be different from what * was passed in) or an error. An error should only be * returned if the requested depth is legal but the driver was * unable to set it. If the requested depth is illegal, the * driver should set and return the closest legal queue depth. * +* Status: OPTIONAL */ int (* change_queue_depth)(struct scsi_device *, int); /* -* fill in this function to allow the changing of tag types +* Fill in this function to allow the changing of tag types * (this also allows the enabling/disabling of tag command * queueing). An error should only be returned if something * went wrong in the driver while trying to set the tag type. * If the driver doesn't support the requested tag type, then * it should set the closest type it does support without * returning an error. Returns the actual tag type set. +* +* Status: OPTIONAL */ int (* change_queue_type)(struct scsi_device *, int); /* -* This function determines the bios parameters for a given +* This function determines the BIOS parameters for a given * harddisk. These tend to be numbers that are made up by * the host adapter. Parameters: * size, device, list (heads, sectors, cylinders) * -* Status: OPTIONAL */ +* Status: OPTIONAL +*/ int (* bios_param)(struct scsi_device *, struct block_device *, sector_t, int []); @@ -354,7 +360,7 @@ #endif /* * This determines if we will use a non-interrupt driven -* or an interrupt driven scheme, It is set to the maximum number +* or an interrupt driven scheme. It is set to the maximum number * of simultaneous commands a given host adapter will accept. */ int can_queue; @@ -375,12 +381,12 @@ #endif unsigned short sg_tablesize; /* -* If the host adapter has limitations beside segment count +* Set this if the host adapter has limitations beside segment count. */ unsigned short max_sectors; /* -* dma scatter gather segment boundary limit. a segment crossing this +* DMA scatter gather segment boundary limit. A segment crossing this * boundary will be split in two. */ unsigned long dma_boundary; @@ -389,7 +395,7 @@ #endif * This specifies machine infinity for host templates which don't * limit the transfer size. Note this limit represents an absolute * maximum, and may be over the transfer limits allowed for -* individual devices (e.g. 256 for SCSI-1) +* individual devices (e.g. 256 for SCSI-1). */ #define SCSI_DEFAULT_MAX_SECTORS 1024 @@ -416,12 +422,12 @@ #define SCSI_DEFAULT_MAX_SECTORS 1024 unsigned supported_mode:2; /* -* true if this host adapter uses unchecked DMA onto an ISA bus. +* True if this host adapter uses unchecked DMA onto an ISA bus. */ unsigned unchecked_isa_dma:1; /* -* true if this host adapter can make good use of clustering. +* True if this host adapter can make good use of clustering. * I originally thought that if the tablesize was large that it * was a waste of CPU cycles to prepare a cluster list, but * it works out that the Buslogic is faster if you use a smaller @@ -431,7 +437,7 @@ #define SCSI_DEFAULT_MAX_SECTORS1024
Re: OOM killer gripe (was Re: What still uses the block layer?)
Hi! Would an oom-kill-someone-now sysrq be of help, I wonder? *shrug* It might. I was a letting it run hoping it would complete itself when sysrq-f, IIRC. it locked solid. (The keyboard LEDs weren't flashing, so I don't _think_ it paniced. I was in X so I wouldn't have seen a message...) (To be honest, I can never remember how to trigger sysrq on a laptop keyboard. Presumably X won't intercept it the way it does alt-f1 and ctrl-alt-del...) sysrq works even in X, and should be pressable on todays laptop keyboards... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OOM killer gripe (was Re: What still uses the block layer?)
Hi! I suppose I should just configure suspending to a file instead of a swap partition, but I've just historically trusted suspend/resume to a swap partition much more than to a file. Or maybe I should hack in a sysctl to prevent any swapping even though the swap partition is configured (so only suspend/resume will use it). swapon -a; swsusp; swapoff -a? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kdump detection in SCSI drivers
Hi! How do we know when little memory is available? Kernel already scales its hash tables according to total RAM available, perhaps you can use similar mechanism? Other suggestion which came about was to parse the kernel command line and look for elfcorehdr=. Is this ok? Is kernel command line visible to the SCSI drivers? Kernel command line probably is visible, but I'd recommend against doing that. Pavel Cc: Linux-scsi@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: kdump detection in SCSI drivers Hi, Is there a standard way for drivers (RAID) to detect if the current kernel is running in kdump mode? We would like to adjust driver behavior dynamically when kdump is active by scaling down resources. Perhaps you should be automatically using little resources when little memory is available, or something? With upcomping kjump patches, it is more interesting than kdump vs. no kdump. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kdump detection in SCSI drivers
Hi, Is there a standard way for drivers (RAID) to detect if the current kernel is running in kdump mode? We would like to adjust driver behavior dynamically when kdump is active by scaling down resources. Perhaps you should be automatically using little resources when little memory is available, or something? With upcomping kjump patches, it is more interesting than kdump vs. no kdump. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] Expose Power Management Policy option to users
Hi! This patch will modify the scsi subsystem to allow users to set a power management policy for the link. The scsi subsystem will create a new sysfs file for each host in /sys/class/scsi_host called link_power_management_policy. This file can have 3 possible values: Value Meaning --- min_power User wishes the link to conserve power as much as possible, even at the cost of some performance max_performance User wants priority to be on performance, not power savings medium_power User wants power savings, with less performance cost than min_power (but less power savings as well). Has that anything to do with HIPM vs. DIPM? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] scsi: megaraid_sas -- add hibernation support
Hi! The megaraid_sas driver doesn't support the hibernation, the suspend/resume routine implemented to support the hibernation. Signed-off-by: Bo Yang [EMAIL PROTECTED] I'm glad to see first scsi driver to have hibernation support :-). Appart for whitespace, patch looks ok to me. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] AHCI Link Power Management
Hi! I'm not sure about this. We need better PM framework to support powersaving in other controllers and some ahcis don't save much when only link power management is used, do you have data to support this? Yeah, it was some Lenovo notebook. Pavel is more familiar with the hardware. Pavel, what was the notebook which didn't save much power with standard SATA power save but needed port to be completely turned off? Thinkpad x60. Some one Kristen probably used while developing the patch :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] AHCI Link Power Management
Hi! Yeah, it was some Lenovo notebook. Pavel is more familiar with the hardware. Pavel, what was the notebook which didn't save much power with standard SATA power save but needed port to be completely turned off? Pavel, if you have time, could you measure this with Kristen's patch? Kristen has same machine as me, and I have seen similar '1W' saving with previous version of the patch. I'd trust her results. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] AHCI Link Power Management
Hi! I'm not sure about this. We need better PM framework to support powersaving in other controllers and some ahcis don't save much when only link power management is used, do you have data to support this? Yeah, it was some Lenovo notebook. Pavel is more familiar with the hardware. Pavel, what was the notebook which didn't save much power with standard SATA power save but needed port to be completely turned off? Uhuh, now I understand why Arjan wanted me to test. But I have same hw as Kristen, so I assume there must have been something wrong with the old tests. Sorry for confusion. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PATCH] scsi updates for 2.6.12-rc2
Hi! This is a small set of bugfixes for 2.6.12-rc2 ... you asked me to try git, so I did (I actually updated my bk backport script simply to export from a BK tree to a git tree). For the time being, I plan to keep the scsi changes in BK, but I'll export them for you to try merging The patch (against kernel-test.git) is here rsync://www.parisc-linux.org/~jejb/scsi-rc-fixes-2.6.git Can you du -s on it? Just curious. I started rsync on it, but because it is not standard gzip files, it is difficult to see anything interesting... Okay, so du -s is: [EMAIL PROTECTED]:~# du -sh /tmp/delme.git/ 109M/tmp/delme.git/ Not as bad as I expected, but still quite a lot of data for few changes. Pavel -- Boycott Kodak -- for their patent abuse against Java. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html