Re: SSD Optimizations

2010-03-11 Thread Martin K. Petersen
>>>>> "Gordan" == Gordan Bobic  writes:

Gordan> I fully agree that it's important for wear leveling on flash
Gordan> media, but from the security point of view, I think TRIM would
Gordan> be a useful feature on all storage media. If the erased blocks
Gordan> were trimmed it would provide a potentially useful feature of
Gordan> securely erasing the sectors that are no longer used. It would
Gordan> be useful and much more transparent than the secure erase
Gordan> features that only operate on the entire disk. Just MHO.

Except there are no guarantees that TRIM does anything, even if the
drive claims to support it.

There are a couple of IDENTIFY DEVICE knobs that indicate whether the
drive deterministically returns data after a TRIM.  And whether the
resulting data is zeroes.  We query these values and report them to the
filesystem.

However, testing revealed several devices that reported the right thing
but which did in fact return the old data afterwards.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-11 Thread Martin K. Petersen
>>>>> "Gordan" == Gordan Bobic  writes:

Gordan> SD == SSD with an SD interface.

No, not really.

It is true that conceivably you could fit a sophisticated controller in
an SD card form factor.  But fact is that takes up space which could
otherwise be used for flash.  There may also be power consumption/heat
dissipation concerns.

Most SD card controllers have very, very simple wear leveling that in
most cases rely on the filesystem being FAT.  These cards are aimed at
cameras, MP3 players, etc. after all. And consequently it's trivial to
wear out an SD card by writing a block over and over.

The same is kind of true for Compact Flash.  There are two types of
cards, I prefer to think of them as camera grade and industrial.  Camera
grade CF is really no different from SD cards or any other consumer
flash form factor.

Industrial CF cards have controllers with sophisticated wear leveling.
Usually this is not quite as clever as a "big" SSD, but it is close
enough that you can treat the device as a real disk drive.  I.e. it has
multiple channels working in parallel unlike the consumer devices.

As a result of the smarter controller logic and the bigger bank of spare
flash, industrial cards are much smaller in capacity.  Typically in the
1-4 GB range.  But they are in many cases indistinguishable from a real
SSD in terms of performance and reliability.


Gordan> You can make an educated guess. For starters given that visible
Gordan> sector sizes are not equal to FS block sizes, it means that FS
Gordan> block sizes can straddle erase block boundaries without the
Gordan> flash controller, no matter how fancy, being able to determine
Gordan> this. Thus, at the very least, aligning FS structures so that
Gordan> they do not straddle erase block boundaries is useful in ALL
Gordan> cases. Thinking otherwise is just sticking your head in the sand
Gordan> because you cannot be bothered to think.

There are no means of telling what the erase block size is.  None.  We
have no idea.  The vendors won't talk.  It's part of their IP.

Also, there is no point in being hung up on the whole erase block thing.
Only crappy SSDs use block mapping where that matters.  These devices
will die a horrible death soon enough.  Good SSDs use a technique akin
to logging filesystems in which the erase block size and all other other
physical characteristics don't matter (from a host perspective).

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A couple of questions

2010-06-01 Thread Martin K. Petersen
>>>>> "Paul" == Paul Millar  writes:

Paul> My concern is that, if the server-software doesn't push the
Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
Paul> would not provide a rigorous assurance that the bytes are the
Paul> same.  Without this assurance, corruption could still occur; for
Paul> example, within the server's memory.

For DIX we allow integrity metadata conversion.  Once the data is
received, the server generates appropriate IMD for the next layer.  Then
the server verifies that the original IMD matches the data buffer.  That
way there's no window of error.  But obviously the ideal case is where
the same IMD can be passed throughout the stack without conversion.

Not sure what you use for file service?  I believe NFSv4 allows for
checksums to be passed along. I have not looked at them closely yet,
though.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A couple of questions

2010-06-03 Thread Martin K. Petersen
>>>>> "Paul" == Paul Millar  writes:

Paul> Please correct me if I'm wrong here, but T10 DIF/DIX refers only
Paul> to data integrity protection from the OS's FS-level down to the
Paul> block device: a userland application doesn't know that it is
Paul> writing into a FS that is utilising DIX with a DIF-enabled storage
Paul> system.

My point was that it is possible to have different protection types in
play (and thus different checksums) as long as you overlap the
protection envelopes.  At the expense of having to calculate checksums
multiple times, of course.


Paul> Unfortunately, any such solution would be btrfs-specific, since (I
Paul> believe) no one has standardised how to extend T10 into userspace.

Not yet, but we're working on a generic interface that would allow the
protection information to be attached.  This is not going to be tied to
just T10 DIF.  The current Linux block layer integrity handles different
types of protection information.


Paul> I believe NFS currently doesn't support checksums (as per v4.1).
Paul> Looking into more detail, Alok Aggarwal gave a talk at 2006
Paul> connectathon about this.  Alok's slides have a nice diagram (slide
Paul> 11) showing the kind of end-to-end integrity I'm after.  The issue
Paul> is how to achieve the assurance between "NFS Server" and "Local
Paul> FS" on the right.

Paul> For NFS, I believe there aren't any plans for introducing checksum
Paul> support for v4.2.  Perhaps it'll appear with the later minor
Paul> versions of the standard.

I haven't looked into this for a long time.  Last time I talked to the
NFS folks they seemed to think it would be possible to bridge the two
methods.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future Linux filesystems

2008-06-03 Thread Martin K. Petersen
>>>>> "Joe" == Joe Peterson <[EMAIL PROTECTED]> writes:

Joe> You don't mention what I believe is the *key* issue (and I don't
Joe> think the author did either, but I skimmed his article): data
Joe> integrity.  I'm not talking about blatant failures or known need
Joe> for an fsck, but rather silent corruption.

We're very concerned about data integrity.  With btrfs everything is
checksummed at the logical level.  This allows you to detect data
corruption, repair bad blocks using redundant, good copies, perform
data scrubbing, etc.

A related, but orthogonal data integrity measure is the T10 DIF
infrastructure that I am working on.  DIF enables protection at the
sector level and includes stuff like a data checksum and a locality
check which ensures that the sector ends up the right place on disk.

If there is a mismatch the I/O will be reject by either the HBA or the
storage device.  That allows us to catch a lot of the corruption
scenarios where we accidentally write bad stuff to disk.

Right now the DIF checksum is added at the block layer level.  Work is
in progress to move it up into the filesystems and from there into
user space.  Eventually we'd like to be able to generate the checksum
in the application and pass it along the I/O path all the way out to
the physical disk.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Martin K. Petersen
>>>>> "Dmitri" == Dmitri Nikulin  writes:

Dmitri> That's excellent, but until consumer-level drives have the same
Dmitri> feature, the fact remains that consumer SSDs are a net loss in
Dmitri> reliability compared to consumer rotating disks,

My SSD testing has not been very promising in the data integrity
department.  I've got a couple of drives here which end up corrupting
stuff every time they are reset or lose power.

"Stuff" in this case is on the order of megabytes, not a few sectors
like on spinning media.  With disk drives the risk of garbling unrelated
files is there but relatively small.  On SSDs it's much higher because
of the big blocking and the high latency erase/rewrite cycle.  In
several cases I've lost system binaries that obviously weren't being
written when I crashed the system.  In one case I even lost all of
/sbin.


Dmitri> I'm just curious if there's anything that can be done in a
Dmitri> filesystem to minimise the damage of a lost eraseblock.

The problem is that we have no way of knowing what's inside each erase
block.  We don't even know how big the erase block is.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Martin K. Petersen
>>>>> "Claudio" == Claudio Martins  writes:

Claudio> What brand of SSDs are we talking about?

I tested two different products that both exhibited poor behavior during
frequent resets/power cycles.  I can't really disclose details but rest
assured I'm working with the vendors in question to get this fixed.
Just last Friday I received a couple of updated drives that appear to
work correctly.  But I haven't tested them extensively yet.


Claudio> What filesystems did you experiment with?

Some tests were done with ext3 but most tests were done using my own
integrity checking tooling and/or dt.

In the filesystem case it was trivial to map out the affected files in
debugfs and correlate those to a logical region on the disk.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Martin K. Petersen
>>>>> "Dmitri" == Dmitri Nikulin  writes:

Dmitri> Already SanDisk are offering a proprietary "Extreme FFS"
Dmitri> (perhaps even based on Unix FFS) for Windows Vista only.

Extreme FFS is SanDisk's next generation FTL/firmware.  It's not a
filesystem that plugs into the OS.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Martin K. Petersen
>>>>> "Dmitri" == Dmitri Nikulin  writes:

Dmitri> If that's the case, why is it marketed for Windows Vista only,
Dmitri> and referring to filesystem features like marking unused blocks?
Dmitri> Surely if it was at the device level it would be OS-neutral, and
Dmitri> marketed as such.

The article you posted references some benchmarketing numbers involving
Vista.  That does not imply it's a Windows-only product.

    http://en.wikipedia.org/wiki/ExtremeFFS

-- 
Martin K. Petersen  Oracle Linux Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/13] scsi/osd: don't save block errors into req_results

2017-05-26 Thread Martin K. Petersen

Christoph,

> We will only have sense data if the command exectured and got a SCSI
> result, so this is pointless.

"executed"

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/35] block: add REQ_OP definitions and bi_op/op fields

2016-01-06 Thread Martin K. Petersen
>>>>> "Mike" == mchristi   writes:

+enum req_op {
+   REQ_OP_READ,
+   REQ_OP_WRITE= REQ_WRITE,
+   REQ_OP_DISCARD  = REQ_DISCARD,
+   REQ_OP_WRITE_SAME   = REQ_WRITE_SAME,
+};
+

I have been irked by the REQ_ prefix in bios since the flags were
consolidated a while back. When I attempted to fix the READ/WRITE mess I
used a BLK_ prefix as a result.

Anyway. Just bikeshedding...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Martin K. Petersen
>>>>> "Mike" == mchristi   writes:

Mike> The following patches begin to cleanup the request->cmd_flags and
bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
Mike> attributes and state of the request. For bi_rw we use it for
Mike> similar info and also the priority but then also have another
Mike> bi_flags field for state. At some point, we abused them so much we
Mike> just made cmd_flags 64 bits, so we could add more.

Mike> The following patches seperate the operation (read, write discard,
Mike> flush, etc) from cmd_flags/bi_rw.

Mike> This patchset was made against linux-next from today Jan 5 2016.
Mike> (git tag next-20160105).

Very nice work. Thanks for doing this!

I think it's a much needed cleanup. I focused mainly on the core block,
discard, write same and sd.c pieces and everything looks sensible to me.

I wonder what the best approach is to move a patch set with this many
stakeholders forward? Set a "speak now or forever hold your peace"
review deadline?

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [PATCH] block: add a bi_error field to struct bio

2015-06-04 Thread Martin K. Petersen
>>>>> "Christoph" == Christoph Hellwig  writes:

Christoph> The first one has the drawback of only communicating a single
Christoph> possible error (-EIO), and the second one has the drawback of
Christoph> not beeing persistent when bios are queued up, and are not
Christoph> passed along from child to parent bio in the ever more
Christoph> popular chaining scenario.

Christoph> So add a new bi_error field to store an errno value directly
Christoph> in struct bio and remove the existing mechanisms to clean all
Christoph> this up.

Having the error status separate from the bio has been a major headache.
I am entirely in favor of this patch.

It was a big chunk of changes to read through but I did not spot any
obvious problems or polarity reversals. It would be nice to get the
respective fs/md/target driver folks to check their portions, though.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-21 Thread Martin K. Petersen


Keith,

> With respect to fs block sizes, one thing making discards suck is that
> many high capacity SSDs' physical page sizes are larger than the fs
> block size, and a sub-page discard is worse than doing nothing.

That ties into the whole zeroing as a side-effect thing.

The devices really need to distinguish between discard-as-a-hint where
it is free to ignore anything that's not a whole multiple of whatever
the internal granularity is, and the WRITE ZEROES use case where the end
result needs to be deterministic.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-21 Thread Martin K. Petersen


Jeff,

> We've always been told "don't worry about what the internal block size
> is, that only matters to the FTL."  That's obviously not true, but
> when devices only report a 512 byte granularity, we believe them and
> will issue discard for the smallest size that makes sense for the file
> system regardless of whether it makes sense (internally) for the SSD.
> That means 4k for pretty much anything except btrfs metadata nodes,
> which are 16k.

The devices are free to report a bigger discard granularity. We already
support and honor that (for SCSI, anyway). It's completely orthogonal to
reported the logical block size, although it obviously needs to be a
multiple.

The real problem is that vendors have zero interest in optimizing for
discard. They are so confident in their FTL and overprovisioning that
they don't view it as an important feature. At all.

Consequently, many of the modern devices that claim to support discard
to make us software folks happy (or to satisfy a purchase order
requirements) complete the commands without doing anything at all.
We're simply wasting queue slots.

Personally, I think discard is dead on anything but the cheapest
devices.  And on those it is probably going to be
performance-prohibitive to use it in any other way than a weekly fstrim.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-22 Thread Martin K. Petersen


Roman,

>> Consequently, many of the modern devices that claim to support
>> discard to make us software folks happy (or to satisfy a purchase
>> order requirements) complete the commands without doing anything at
>> all.  We're simply wasting queue slots.
>
> Any example of such devices? Let alone "many"? Where you would issue a
> full-device blkdiscard, but then just read back old data.

I obviously can't mention names or go into implementation details. But
there are many drives out there that return old data. And that's
perfectly within spec.

At least some of the pain in the industry in this department can be
attributed to us Linux folks and RAID device vendors. We all wanted
deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE.
The device vendors weren't happy about that and we ended up with weasel
language in the specs. This lead to the current libata whitelist mess
for SATA SSDs and ongoing vendor implementation confusion in SCSI and
NVMe devices.

On the Linux side the problem was that we originally used discard for
two distinct purposes: Clearing block ranges and deallocating block
ranges. We cleaned that up a while back and now have BLKZEROOUT and
BLKDISCARD. Those operations get translated to different operations
depending on the device. We also cleaned up several of the
inconsistencies in the SCSI and NVMe specs to facilitate making this
distinction possible in the kernel.

In the meantime the SSD vendors made great strides in refining their
flash management. To the point where pretty much all enterprise device
vendors will ask you not to issue discards. The benefits simply do not
outweigh the costs.

If you have special workloads where write amplification is a major
concern it may still be advantageous to do the discards and reduce WA
and prolong drive life. However, these workloads are increasingly moving
away from the classic LBA read/write model. Open Channel originally
targeted this space. Right now work is underway on Zoned Namespaces and
Key-Value command sets in NVMe.

These curated application workload protocols are fundamental departures
from the traditional way of accessing storage. And my postulate is that
where tail latency and drive lifetime management is important, those new
command sets offer much better bang for the buck. And they make the
notion of discard completely moot. That's why I don't think it's going
to be terribly important in the long term.

This leaves consumer devices and enterprise devices using the
traditional LBA I/O model.

For consumer devices I still think fstrim is a good compromise. Lack of
queuing for DSM hurt us for a long time. And when it was finally added
to the ATA command set, many device vendors got their implementations
wrong. So it sucked for a lot longer than it should have. And of course
FTL implementations differ.

For enterprise devices we're still in the situation where vendors
generally prefer for us not to use discard. I would love for the
DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have
fairly low confidence that it's going to happen. Case in point: Despite
a lot of leverage and purchasing power, the cloud industry has not been
terribly successful in compelling the drive manufacturers to make
DEALLOCATE perform well for typical application workloads. So I'm not
holding my breath...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: Is it possible for the ext4/btrfs file system to pass some context related info to low level block driver?

2011-05-03 Thread Martin K. Petersen
>>>>> "Yunpeng" == Gao, Yunpeng  writes:

Yunpeng> So, my question is, is there any plan or discussion on
Yunpeng> supporting this feature (passing data type info to low level
Yunpeng> block device driver) on file system developments? Especially
Yunpeng> for ext4/btrfs, since now they are very hot in Linux? Thanks.

Yes, I have been working on some changes that allow us to tag bios and
pass the information out to storage. These patches have been on the back
burner for a while due to other commitments. But I'll dig them out and
post them later. We just discussed them a couple of weeks ago at the
Linux Storage Workshop.

In the meantime: Can you point me to the relevant eMMC stuff so I can
see how many tiers or classes we have to work with there?

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-14 Thread Martin K. Petersen
>>>>> "Marc" == Marc MERLIN  writes:

Marc,

Marc> So I have Sata 3.1, that's great news, it means I can keep using
Marc> discard without worrying about performance and hangs

The fact that the drive reports compliance with a certain version of
SATA does not in any way imply that it implements all commands defined
in that specification.

The location where queued TRIM support is reported is somewhat unusual.
And last I looked hdparm -I had no infrastructure in place to report
stuff contained in log pages.

The kernel does look the right place to determine whether to issue the
queued or unqueued variant or not. But the information isn't exported to
userland.

So right now I'm afraid we don't have a good way for a user to determine
whether a device supports queued trims or not.

I guess we could consider either adding an ATA-specific "I don't suck"
flag in sysfs, add the missing code to hdparm, or both...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-16 Thread Martin K. Petersen
>>>>> "Chris" == Chris Samuel  writes:

Chris> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Chris> Of course that's what the drive is reporting it supports, I'm not
Chris> sure whether that's the result of what has been negotiated
Chris> between the controller and drive or purely what the drive
Chris> supports.

It just what the drive reports. Often drives will implement features
before they are ratified in the spec and thus before they can claim
compliance with a specific version.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-16 Thread Martin K. Petersen
>>>>> "Chris" == Chris Samuel  writes:

Chris> It looks like drives that do support it can be detected with the
Chris> kernel helper function ata_fpdma_dsm_supported() defined in
Chris> include/linux/libata.h.

Chris> I wonder if it would be possible to use that knowledge to extend
Chris> the smartctl's --identify functionality to report this?

Queued trim support is indicated in a log page and not the identify
information. However, we can get to the information we want using
smartctl's ability to look at log pages.

I don't have a single drive from any vendor in the lab that supports
queued trim, not even a prototype. I went out and bought a 840 EVO this
morning because the general lazyweb opinion seemed to indicate that this
drive supports queued trim. Well, it doesn't. At least not in the 120GB
version:

# smartctl -l gplog,0x13 /dev/sda
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.14.0-rc6+] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

General Purpose Log 0x13 does not exist (override with '-T permissive' option)

If there's a drive with a working queued trim implementation out there,
I'd like to know about it...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using BTRFS on SSD now ?

2014-06-05 Thread Martin K. Petersen
>>>>> "Duncan" == Duncan  <1i5t5.dun...@cox.net> writes:

Duncan> OTOH, certain high-performance hardware goes beyond the current
Duncan> standard and does a queued trim, without forcing a flush of the
Duncan> queue in the process.  But this hardware tends to be rather rare
Duncan> and expensive,

Queued trim has started to appear in consumer SSDs. However, since we're
the only OS that supports it the feature has come off to a bumpy start.
We tried to enable it on a drive model that passed testing here but we
had to revert to unqueued when bug reports started rolling in this week.

Duncan> (FWIW, in new enough versions of smartctl, smartctl -i will have
Duncan> a "SATA Version is:" line, but even my newer Corsair Neutrons
Duncan> report only SATA 2.5, so obviously they don't support queued
Duncan> trim by the standard, tho it's still possible they implement it
Duncan> beyond the standard, I simply don't know.)

The reported SATA version is in no way indicative of whether a drive
supports queued trim. The capability flag was put a highly unusual place
in the protocol. I posted a patch that makes this information available
in sysfs a while back. However, the patch is currently being reworked to
support a debug override...

Until then, the following command will give you the answer:

# smartctl -l gplog,0x13 /dev/sda | grep :
000: 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ||
     ^^      ^^
These two 01 fields indicate that the drive supports queued trim.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using BTRFS on SSD now ?

2014-06-08 Thread Martin K. Petersen
>>>>> "Pavel" == Pavel Volkov  writes:

Pavel> Is mine a dinosaur drive?
[...]
Pavel> General Purpose Log 0x13 does not exist (override with '-T
Pavel> permissive' option)

The only flags defined in general purpose log page 0x13 are the queued
trim ones. So there is no compelling reason for a drive vendor to
implement that page unless the drive actually supports queued trim. And
consequently it's perfectly normal for that page to be absent.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on whole disk (no partitions)

2014-06-23 Thread Martin K. Petersen
>>>>> "Chris" == Chris Murphy  writes:

Chris> Does anyone know if blktrace will intercept the actual SCSI
Chris> commands sent to the drive? Or is there a better utility to use
Chris> for this? When I use it unfiltered, I'm not seeing SCSI write
Chris> commands at all.

# echo scsi:scsi_dispatch_cmd_start > /sys/kernel/debug/tracing/set_event 
# echo 1 > /sys/kernel/debug/tracing/tracing_on
[do stuff]
# cat /sys/kernel/debug/tracing/trace

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on whole disk (no partitions)

2014-06-23 Thread Martin K. Petersen
>>>>> "Duncan" == Duncan  <1i5t5.dun...@cox.net> writes:

Duncan> Tho as you point out elsewhere, levels under the filesystem
Duncan> layer may split the btrfs 4096 byte block size into 512 byte
Duncan> logical sector sizes if appropriate, but that has nothing to do
Duncan> with btrfs except that it operates on top of that.

The notion of "splitting into a different block size" is a bit
confusing. The filesystem submits an N-byte I/O. Whether the logical
block size is 512 or 4096 doesn't really matter. We're still
transferring N bytes of data. The only thing the logical block size
really affects is how we calculate the LBA and block counts in the
command we send to the device. If N is not a multiple of the device's
logical block size we'll simply reject the I/O. If we receive an I/O
that is misaligned or not a multiple of the physical block size we let
the drive do RMW. So there isn't any "splitting" going on.

An I/O may be split if MD or DM is involved and the request straddles a
stripe chunk boundary. Because Linux generally does all I/O in terms of
4K pages, sub-page size splits are rare. Pretty much all the other cases
that would force us to split an I/O (typically controller DMA
constraints) operate on a page boundary.

To avoid the drive being forced to do RMW on the head and tail of a
misaligned I/O it is imperative that the filesystems are aligned to the
physical block size of the underlying device. As has been pointed out
the partitioning utilities generally make sure that's the case. If there
are no partitions then you're by definition aligned unless the drive has
the infamous Windows XP jumper installed.

Anyway. The short answer is that Linux will pretty much always do I/O in
multiples of the system page size regardless of the logical block size
of the underlying device. There are a few exceptions to this such as
direct I/O, legacy filesystems using bufferheads and raw block device
access.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs vs fstrim on an SD Card (not SSD)

2014-08-22 Thread Martin K. Petersen
>>>>> "Chris" == Chris Murphy  writes:

Chris> Since the SD Card spec references a completely different command
Chris> than the ATA spec (TRIM), I don't think either one of these are
Chris> TRIM, even if functionally equivalent. Instead the SD Card
Chris> ERASE_* commands are probably being used,

Indeed. Discard is our generic block layer abstraction that gets
translated into whichever command is appropriate for the device in
question (ACS DSM TRIM, SBC WRITE SAME/UNMAP, etc.).

Chris> but I can't confirm this because writes to /dev/mmcblk0 aren't
Chris> showing up with:

Chris> echo scsi:scsi_dispatch_cmd_start >
Chris> /sys/kernel/debug/tracing/set_event echo 1 >
Chris> /sys/kernel/debug/tracing/tracing_on cat
Chris> /sys/kernel/debug/tracing/trace_pipe

MMC doesn't go through SCSI like ATA does. 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs: allow not to trim a device

2012-03-20 Thread Martin K. Petersen
>>>>> "David" == David Sterba  writes:

>> Just curious: Whats the use case for this?

David> 
http://digitalvampire.org/blog/index.php/2012/03/16/you-can-never-be-too-rich-or-too-thin/

Well, a cap of 1MB per UNMAP command is absolute crazy talk. We
currently do 2GB per command on ATA and 2TB per command on most SCSI
targets.

I don't disagree that it may make sense to have a disable discard option
to mkfs. But Roland really needs to get his VPD reporting fixed. I
suspect what the array meant to communicate was a 1MB discard
granularity, not a 1MB per command limit...a common mistake.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html