Re: SSD Optimizations
>>>>> "Gordan" == Gordan Bobic writes: Gordan> I fully agree that it's important for wear leveling on flash Gordan> media, but from the security point of view, I think TRIM would Gordan> be a useful feature on all storage media. If the erased blocks Gordan> were trimmed it would provide a potentially useful feature of Gordan> securely erasing the sectors that are no longer used. It would Gordan> be useful and much more transparent than the secure erase Gordan> features that only operate on the entire disk. Just MHO. Except there are no guarantees that TRIM does anything, even if the drive claims to support it. There are a couple of IDENTIFY DEVICE knobs that indicate whether the drive deterministically returns data after a TRIM. And whether the resulting data is zeroes. We query these values and report them to the filesystem. However, testing revealed several devices that reported the right thing but which did in fact return the old data afterwards. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
>>>>> "Gordan" == Gordan Bobic writes: Gordan> SD == SSD with an SD interface. No, not really. It is true that conceivably you could fit a sophisticated controller in an SD card form factor. But fact is that takes up space which could otherwise be used for flash. There may also be power consumption/heat dissipation concerns. Most SD card controllers have very, very simple wear leveling that in most cases rely on the filesystem being FAT. These cards are aimed at cameras, MP3 players, etc. after all. And consequently it's trivial to wear out an SD card by writing a block over and over. The same is kind of true for Compact Flash. There are two types of cards, I prefer to think of them as camera grade and industrial. Camera grade CF is really no different from SD cards or any other consumer flash form factor. Industrial CF cards have controllers with sophisticated wear leveling. Usually this is not quite as clever as a "big" SSD, but it is close enough that you can treat the device as a real disk drive. I.e. it has multiple channels working in parallel unlike the consumer devices. As a result of the smarter controller logic and the bigger bank of spare flash, industrial cards are much smaller in capacity. Typically in the 1-4 GB range. But they are in many cases indistinguishable from a real SSD in terms of performance and reliability. Gordan> You can make an educated guess. For starters given that visible Gordan> sector sizes are not equal to FS block sizes, it means that FS Gordan> block sizes can straddle erase block boundaries without the Gordan> flash controller, no matter how fancy, being able to determine Gordan> this. Thus, at the very least, aligning FS structures so that Gordan> they do not straddle erase block boundaries is useful in ALL Gordan> cases. Thinking otherwise is just sticking your head in the sand Gordan> because you cannot be bothered to think. There are no means of telling what the erase block size is. None. We have no idea. The vendors won't talk. It's part of their IP. Also, there is no point in being hung up on the whole erase block thing. Only crappy SSDs use block mapping where that matters. These devices will die a horrible death soon enough. Good SSDs use a technique akin to logging filesystems in which the erase block size and all other other physical characteristics don't matter (from a host perspective). -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A couple of questions
>>>>> "Paul" == Paul Millar writes: Paul> My concern is that, if the server-software doesn't push the Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX) Paul> would not provide a rigorous assurance that the bytes are the Paul> same. Without this assurance, corruption could still occur; for Paul> example, within the server's memory. For DIX we allow integrity metadata conversion. Once the data is received, the server generates appropriate IMD for the next layer. Then the server verifies that the original IMD matches the data buffer. That way there's no window of error. But obviously the ideal case is where the same IMD can be passed throughout the stack without conversion. Not sure what you use for file service? I believe NFSv4 allows for checksums to be passed along. I have not looked at them closely yet, though. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A couple of questions
>>>>> "Paul" == Paul Millar writes: Paul> Please correct me if I'm wrong here, but T10 DIF/DIX refers only Paul> to data integrity protection from the OS's FS-level down to the Paul> block device: a userland application doesn't know that it is Paul> writing into a FS that is utilising DIX with a DIF-enabled storage Paul> system. My point was that it is possible to have different protection types in play (and thus different checksums) as long as you overlap the protection envelopes. At the expense of having to calculate checksums multiple times, of course. Paul> Unfortunately, any such solution would be btrfs-specific, since (I Paul> believe) no one has standardised how to extend T10 into userspace. Not yet, but we're working on a generic interface that would allow the protection information to be attached. This is not going to be tied to just T10 DIF. The current Linux block layer integrity handles different types of protection information. Paul> I believe NFS currently doesn't support checksums (as per v4.1). Paul> Looking into more detail, Alok Aggarwal gave a talk at 2006 Paul> connectathon about this. Alok's slides have a nice diagram (slide Paul> 11) showing the kind of end-to-end integrity I'm after. The issue Paul> is how to achieve the assurance between "NFS Server" and "Local Paul> FS" on the right. Paul> For NFS, I believe there aren't any plans for introducing checksum Paul> support for v4.2. Perhaps it'll appear with the later minor Paul> versions of the standard. I haven't looked into this for a long time. Last time I talked to the NFS folks they seemed to think it would be possible to bridge the two methods. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Future Linux filesystems
>>>>> "Joe" == Joe Peterson <[EMAIL PROTECTED]> writes: Joe> You don't mention what I believe is the *key* issue (and I don't Joe> think the author did either, but I skimmed his article): data Joe> integrity. I'm not talking about blatant failures or known need Joe> for an fsck, but rather silent corruption. We're very concerned about data integrity. With btrfs everything is checksummed at the logical level. This allows you to detect data corruption, repair bad blocks using redundant, good copies, perform data scrubbing, etc. A related, but orthogonal data integrity measure is the T10 DIF infrastructure that I am working on. DIF enables protection at the sector level and includes stuff like a data checksum and a locality check which ensures that the sector ends up the right place on disk. If there is a mismatch the I/O will be reject by either the HBA or the storage device. That allows us to catch a lot of the corruption scenarios where we accidentally write bad stuff to disk. Right now the DIF checksum is added at the block layer level. Work is in progress to move it up into the filesystems and from there into user space. Eventually we'd like to be able to generate the checksum in the application and pass it along the I/O path all the way out to the physical disk. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
>>>>> "Dmitri" == Dmitri Nikulin writes: Dmitri> That's excellent, but until consumer-level drives have the same Dmitri> feature, the fact remains that consumer SSDs are a net loss in Dmitri> reliability compared to consumer rotating disks, My SSD testing has not been very promising in the data integrity department. I've got a couple of drives here which end up corrupting stuff every time they are reset or lose power. "Stuff" in this case is on the order of megabytes, not a few sectors like on spinning media. With disk drives the risk of garbling unrelated files is there but relatively small. On SSDs it's much higher because of the big blocking and the high latency erase/rewrite cycle. In several cases I've lost system binaries that obviously weren't being written when I crashed the system. In one case I even lost all of /sbin. Dmitri> I'm just curious if there's anything that can be done in a Dmitri> filesystem to minimise the damage of a lost eraseblock. The problem is that we have no way of knowing what's inside each erase block. We don't even know how big the erase block is. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
>>>>> "Claudio" == Claudio Martins writes: Claudio> What brand of SSDs are we talking about? I tested two different products that both exhibited poor behavior during frequent resets/power cycles. I can't really disclose details but rest assured I'm working with the vendors in question to get this fixed. Just last Friday I received a couple of updated drives that appear to work correctly. But I haven't tested them extensively yet. Claudio> What filesystems did you experiment with? Some tests were done with ext3 but most tests were done using my own integrity checking tooling and/or dt. In the filesystem case it was trivial to map out the affected files in debugfs and correlate those to a logical region on the disk. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
>>>>> "Dmitri" == Dmitri Nikulin writes: Dmitri> Already SanDisk are offering a proprietary "Extreme FFS" Dmitri> (perhaps even based on Unix FFS) for Windows Vista only. Extreme FFS is SanDisk's next generation FTL/firmware. It's not a filesystem that plugs into the OS. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ssd optimised mode
>>>>> "Dmitri" == Dmitri Nikulin writes: Dmitri> If that's the case, why is it marketed for Windows Vista only, Dmitri> and referring to filesystem features like marking unused blocks? Dmitri> Surely if it was at the device level it would be OS-neutral, and Dmitri> marketed as such. The article you posted references some benchmarketing numbers involving Vista. That does not imply it's a Windows-only product. http://en.wikipedia.org/wiki/ExtremeFFS -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/13] scsi/osd: don't save block errors into req_results
Christoph, > We will only have sense data if the command exectured and got a SCSI > result, so this is pointless. "executed" Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/35] block: add REQ_OP definitions and bi_op/op fields
>>>>> "Mike" == mchristi writes: +enum req_op { + REQ_OP_READ, + REQ_OP_WRITE= REQ_WRITE, + REQ_OP_DISCARD = REQ_DISCARD, + REQ_OP_WRITE_SAME = REQ_WRITE_SAME, +}; + I have been irked by the REQ_ prefix in bios since the flags were consolidated a while back. When I attempted to fix the READ/WRITE mess I used a BLK_ prefix as a result. Anyway. Just bikeshedding... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs
>>>>> "Mike" == mchristi writes: Mike> The following patches begin to cleanup the request->cmd_flags and bio-> bi_rw mess. We currently use cmd_flags to specify the operation, Mike> attributes and state of the request. For bi_rw we use it for Mike> similar info and also the priority but then also have another Mike> bi_flags field for state. At some point, we abused them so much we Mike> just made cmd_flags 64 bits, so we could add more. Mike> The following patches seperate the operation (read, write discard, Mike> flush, etc) from cmd_flags/bi_rw. Mike> This patchset was made against linux-next from today Jan 5 2016. Mike> (git tag next-20160105). Very nice work. Thanks for doing this! I think it's a much needed cleanup. I focused mainly on the core block, discard, write same and sd.c pieces and everything looks sensible to me. I wonder what the best approach is to move a patch set with this many stakeholders forward? Set a "speak now or forever hold your peace" review deadline? -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] [PATCH] block: add a bi_error field to struct bio
>>>>> "Christoph" == Christoph Hellwig writes: Christoph> The first one has the drawback of only communicating a single Christoph> possible error (-EIO), and the second one has the drawback of Christoph> not beeing persistent when bios are queued up, and are not Christoph> passed along from child to parent bio in the ever more Christoph> popular chaining scenario. Christoph> So add a new bi_error field to store an errno value directly Christoph> in struct bio and remove the existing mechanisms to clean all Christoph> this up. Having the error status separate from the bio has been a major headache. I am entirely in favor of this patch. It was a big chunk of changes to read through but I did not spot any obvious problems or polarity reversals. It would be nice to get the respective fs/md/target driver folks to check their portions, though. Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Keith, > With respect to fs block sizes, one thing making discards suck is that > many high capacity SSDs' physical page sizes are larger than the fs > block size, and a sub-page discard is worse than doing nothing. That ties into the whole zeroing as a side-effect thing. The devices really need to distinguish between discard-as-a-hint where it is free to ignore anything that's not a whole multiple of whatever the internal granularity is, and the WRITE ZEROES use case where the end result needs to be deterministic. -- Martin K. Petersen Oracle Linux Engineering
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Jeff, > We've always been told "don't worry about what the internal block size > is, that only matters to the FTL." That's obviously not true, but > when devices only report a 512 byte granularity, we believe them and > will issue discard for the smallest size that makes sense for the file > system regardless of whether it makes sense (internally) for the SSD. > That means 4k for pretty much anything except btrfs metadata nodes, > which are 16k. The devices are free to report a bigger discard granularity. We already support and honor that (for SCSI, anyway). It's completely orthogonal to reported the logical block size, although it obviously needs to be a multiple. The real problem is that vendors have zero interest in optimizing for discard. They are so confident in their FTL and overprovisioning that they don't view it as an important feature. At all. Consequently, many of the modern devices that claim to support discard to make us software folks happy (or to satisfy a purchase order requirements) complete the commands without doing anything at all. We're simply wasting queue slots. Personally, I think discard is dead on anything but the cheapest devices. And on those it is probably going to be performance-prohibitive to use it in any other way than a weekly fstrim. -- Martin K. Petersen Oracle Linux Engineering
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Roman, >> Consequently, many of the modern devices that claim to support >> discard to make us software folks happy (or to satisfy a purchase >> order requirements) complete the commands without doing anything at >> all. We're simply wasting queue slots. > > Any example of such devices? Let alone "many"? Where you would issue a > full-device blkdiscard, but then just read back old data. I obviously can't mention names or go into implementation details. But there are many drives out there that return old data. And that's perfectly within spec. At least some of the pain in the industry in this department can be attributed to us Linux folks and RAID device vendors. We all wanted deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE. The device vendors weren't happy about that and we ended up with weasel language in the specs. This lead to the current libata whitelist mess for SATA SSDs and ongoing vendor implementation confusion in SCSI and NVMe devices. On the Linux side the problem was that we originally used discard for two distinct purposes: Clearing block ranges and deallocating block ranges. We cleaned that up a while back and now have BLKZEROOUT and BLKDISCARD. Those operations get translated to different operations depending on the device. We also cleaned up several of the inconsistencies in the SCSI and NVMe specs to facilitate making this distinction possible in the kernel. In the meantime the SSD vendors made great strides in refining their flash management. To the point where pretty much all enterprise device vendors will ask you not to issue discards. The benefits simply do not outweigh the costs. If you have special workloads where write amplification is a major concern it may still be advantageous to do the discards and reduce WA and prolong drive life. However, these workloads are increasingly moving away from the classic LBA read/write model. Open Channel originally targeted this space. Right now work is underway on Zoned Namespaces and Key-Value command sets in NVMe. These curated application workload protocols are fundamental departures from the traditional way of accessing storage. And my postulate is that where tail latency and drive lifetime management is important, those new command sets offer much better bang for the buck. And they make the notion of discard completely moot. That's why I don't think it's going to be terribly important in the long term. This leaves consumer devices and enterprise devices using the traditional LBA I/O model. For consumer devices I still think fstrim is a good compromise. Lack of queuing for DSM hurt us for a long time. And when it was finally added to the ATA command set, many device vendors got their implementations wrong. So it sucked for a lot longer than it should have. And of course FTL implementations differ. For enterprise devices we're still in the situation where vendors generally prefer for us not to use discard. I would love for the DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have fairly low confidence that it's going to happen. Case in point: Despite a lot of leverage and purchasing power, the cloud industry has not been terribly successful in compelling the drive manufacturers to make DEALLOCATE perform well for typical application workloads. So I'm not holding my breath... -- Martin K. Petersen Oracle Linux Engineering
Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?
>>>>> "Yunpeng" == Gao, Yunpeng writes: Yunpeng> So, my question is, is there any plan or discussion on Yunpeng> supporting this feature (passing data type info to low level Yunpeng> block device driver) on file system developments? Especially Yunpeng> for ext4/btrfs, since now they are very hot in Linux? Thanks. Yes, I have been working on some changes that allow us to tag bios and pass the information out to storage. These patches have been on the back burner for a while due to other commitments. But I'll dig them out and post them later. We just discussed them a couple of weeks ago at the Linux Storage Workshop. In the meantime: Can you point me to the relevant eMMC stuff so I can see how many tiers or classes we have to work with there? -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: discard synchronous on most SSDs?
>>>>> "Marc" == Marc MERLIN writes: Marc, Marc> So I have Sata 3.1, that's great news, it means I can keep using Marc> discard without worrying about performance and hangs The fact that the drive reports compliance with a certain version of SATA does not in any way imply that it implements all commands defined in that specification. The location where queued TRIM support is reported is somewhat unusual. And last I looked hdparm -I had no infrastructure in place to report stuff contained in log pages. The kernel does look the right place to determine whether to issue the queued or unqueued variant or not. But the information isn't exported to userland. So right now I'm afraid we don't have a good way for a user to determine whether a device supports queued trims or not. I guess we could consider either adding an ATA-specific "I don't suck" flag in sysfs, add the missing code to hdparm, or both... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: discard synchronous on most SSDs?
>>>>> "Chris" == Chris Samuel writes: Chris> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Chris> Of course that's what the drive is reporting it supports, I'm not Chris> sure whether that's the result of what has been negotiated Chris> between the controller and drive or purely what the drive Chris> supports. It just what the drive reports. Often drives will implement features before they are ratified in the spec and thus before they can claim compliance with a specific version. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: discard synchronous on most SSDs?
>>>>> "Chris" == Chris Samuel writes: Chris> It looks like drives that do support it can be detected with the Chris> kernel helper function ata_fpdma_dsm_supported() defined in Chris> include/linux/libata.h. Chris> I wonder if it would be possible to use that knowledge to extend Chris> the smartctl's --identify functionality to report this? Queued trim support is indicated in a log page and not the identify information. However, we can get to the information we want using smartctl's ability to look at log pages. I don't have a single drive from any vendor in the lab that supports queued trim, not even a prototype. I went out and bought a 840 EVO this morning because the general lazyweb opinion seemed to indicate that this drive supports queued trim. Well, it doesn't. At least not in the 120GB version: # smartctl -l gplog,0x13 /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.14.0-rc6+] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net General Purpose Log 0x13 does not exist (override with '-T permissive' option) If there's a drive with a working queued trim implementation out there, I'd like to know about it... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using BTRFS on SSD now ?
>>>>> "Duncan" == Duncan <1i5t5.dun...@cox.net> writes: Duncan> OTOH, certain high-performance hardware goes beyond the current Duncan> standard and does a queued trim, without forcing a flush of the Duncan> queue in the process. But this hardware tends to be rather rare Duncan> and expensive, Queued trim has started to appear in consumer SSDs. However, since we're the only OS that supports it the feature has come off to a bumpy start. We tried to enable it on a drive model that passed testing here but we had to revert to unqueued when bug reports started rolling in this week. Duncan> (FWIW, in new enough versions of smartctl, smartctl -i will have Duncan> a "SATA Version is:" line, but even my newer Corsair Neutrons Duncan> report only SATA 2.5, so obviously they don't support queued Duncan> trim by the standard, tho it's still possible they implement it Duncan> beyond the standard, I simply don't know.) The reported SATA version is in no way indicative of whether a drive supports queued trim. The capability flag was put a highly unusual place in the protocol. I posted a patch that makes this information available in sysfs a while back. However, the patch is currently being reworked to support a debug override... Until then, the following command will give you the answer: # smartctl -l gplog,0x13 /dev/sda | grep : 000: 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 || ^^ ^^ These two 01 fields indicate that the drive supports queued trim. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using BTRFS on SSD now ?
>>>>> "Pavel" == Pavel Volkov writes: Pavel> Is mine a dinosaur drive? [...] Pavel> General Purpose Log 0x13 does not exist (override with '-T Pavel> permissive' option) The only flags defined in general purpose log page 0x13 are the queued trim ones. So there is no compelling reason for a drive vendor to implement that page unless the drive actually supports queued trim. And consequently it's perfectly normal for that page to be absent. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on whole disk (no partitions)
>>>>> "Chris" == Chris Murphy writes: Chris> Does anyone know if blktrace will intercept the actual SCSI Chris> commands sent to the drive? Or is there a better utility to use Chris> for this? When I use it unfiltered, I'm not seeing SCSI write Chris> commands at all. # echo scsi:scsi_dispatch_cmd_start > /sys/kernel/debug/tracing/set_event # echo 1 > /sys/kernel/debug/tracing/tracing_on [do stuff] # cat /sys/kernel/debug/tracing/trace -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on whole disk (no partitions)
>>>>> "Duncan" == Duncan <1i5t5.dun...@cox.net> writes: Duncan> Tho as you point out elsewhere, levels under the filesystem Duncan> layer may split the btrfs 4096 byte block size into 512 byte Duncan> logical sector sizes if appropriate, but that has nothing to do Duncan> with btrfs except that it operates on top of that. The notion of "splitting into a different block size" is a bit confusing. The filesystem submits an N-byte I/O. Whether the logical block size is 512 or 4096 doesn't really matter. We're still transferring N bytes of data. The only thing the logical block size really affects is how we calculate the LBA and block counts in the command we send to the device. If N is not a multiple of the device's logical block size we'll simply reject the I/O. If we receive an I/O that is misaligned or not a multiple of the physical block size we let the drive do RMW. So there isn't any "splitting" going on. An I/O may be split if MD or DM is involved and the request straddles a stripe chunk boundary. Because Linux generally does all I/O in terms of 4K pages, sub-page size splits are rare. Pretty much all the other cases that would force us to split an I/O (typically controller DMA constraints) operate on a page boundary. To avoid the drive being forced to do RMW on the head and tail of a misaligned I/O it is imperative that the filesystems are aligned to the physical block size of the underlying device. As has been pointed out the partitioning utilities generally make sure that's the case. If there are no partitions then you're by definition aligned unless the drive has the infamous Windows XP jumper installed. Anyway. The short answer is that Linux will pretty much always do I/O in multiples of the system page size regardless of the logical block size of the underlying device. There are a few exceptions to this such as direct I/O, legacy filesystems using bufferheads and raw block device access. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs vs fstrim on an SD Card (not SSD)
>>>>> "Chris" == Chris Murphy writes: Chris> Since the SD Card spec references a completely different command Chris> than the ATA spec (TRIM), I don't think either one of these are Chris> TRIM, even if functionally equivalent. Instead the SD Card Chris> ERASE_* commands are probably being used, Indeed. Discard is our generic block layer abstraction that gets translated into whichever command is appropriate for the device in question (ACS DSM TRIM, SBC WRITE SAME/UNMAP, etc.). Chris> but I can't confirm this because writes to /dev/mmcblk0 aren't Chris> showing up with: Chris> echo scsi:scsi_dispatch_cmd_start > Chris> /sys/kernel/debug/tracing/set_event echo 1 > Chris> /sys/kernel/debug/tracing/tracing_on cat Chris> /sys/kernel/debug/tracing/trace_pipe MMC doesn't go through SCSI like ATA does. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs: allow not to trim a device
>>>>> "David" == David Sterba writes: >> Just curious: Whats the use case for this? David> http://digitalvampire.org/blog/index.php/2012/03/16/you-can-never-be-too-rich-or-too-thin/ Well, a cap of 1MB per UNMAP command is absolute crazy talk. We currently do 2GB per command on ATA and 2TB per command on most SCSI targets. I don't disagree that it may make sense to have a disable discard option to mkfs. But Roland really needs to get his VPD reporting fixed. I suspect what the array meant to communicate was a 1MB discard granularity, not a 1MB per command limit...a common mistake. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html