Re: [PATCH 00/20] drop useless LIST_HEAD

2018-12-28 Thread Darrick J. Wong
On Thu, Dec 27, 2018 at 04:40:55PM +0300, Dan Carpenter wrote:
> On Tue, Dec 25, 2018 at 11:12:20PM +0100, Tom Psyborg wrote:
> > there was discussion about this just some days ago. CC 4-5 lists is
> > more than enough
> > 
> 
> I don't know who you were discussing this with...
> 
> You should CC the 0th patch to all the mailinglists.  That much is a
> clear rule.
> 
> For the rest, Julia's position is the more conservative one.  I was in
> a conversation in RL and they were like, "CC everyone for all the
> patches".  It depends on the context, of course.  If the patches are
> dependent on each other then you *have* to CC everyone for everything.

Agreed.  Ms. Lawall, sending "Cover letter + all relevant XFS patches"
(as you did) was exactly the right thing for us xfs types. :)

For that matter, we prefer to receive through linux-xfs more patches
than necessary (one can send the entire series if one is unsure) than to
go wanting for more context.

--D

> If we really have other clear rules, then it should be encoded into
> get_maintainer.pl so that it's automatic.
> 
> My other question is why do the linux-arm-ker...@lists.infradead.org
> people feel like they need to be CC'd about every driver???  I always
> remove them from the CC list unless it's an arch/arm issue.
> 
> regards,
> dan carpenter
> 
> PS:  Please, no more top posting.
> 


Re: [LSF/MM TOPIC] Patch Submission process and Handling Internal Conflict

2018-01-24 Thread Darrick J. Wong
On Wed, Jan 24, 2018 at 01:36:00PM -0800, James Bottomley wrote:
> On Wed, 2018-01-24 at 11:20 -0800, Mike Kravetz wrote:
> > On 01/24/2018 11:05 AM, James Bottomley wrote:
> > > 
> > > I've got two community style topics, which should probably be
> > > discussed
> > > in the plenary
> > > 
> > > 1. Patch Submission Process
> > > 
> > > Today we don't have a uniform patch submission process across
> > > Storage, Filesystems and MM.  The question is should we (or at
> > > least should we adhere to some minimal standards).  The standard
> > > we've been trying to hold to in SCSI is one review per accepted
> > > non-trivial patch.  For us, it's useful because it encourages
> > > driver writers to review each other's patches rather than just
> > > posting and then complaining their patch hasn't gone in.  I can
> > > certainly think of a couple of bugs I've had to chase in mm where
> > > the underlying patches would have benefited from review, so I'd
> > > like to discuss making the one review per non-trival patch our base
> > > minimum standard across the whole of LSF/MM; it would certainly
> > > serve to improve our Reviewed-by statistics.
> > 
> > Well, the mm track at least has some discussion of this last year:
> > https://lwn.net/Articles/718212/
> 
> The pushback in your session was mandating reviews would mean slowing
> patch acceptance or possibly causing the dropping of patches that
> couldn't get reviewed.  Michal did say that XFS didn't have the
> problem, however there not being XFS people in the room, discussion
> stopped there.

I actually /was/ lurking in the session, but a year later I have more
thoughts:

Now that I've been maintainer for more than a year I feel more confident
in actually talking about our review processes, though I can only speak
about my own experiences and hope the other xfs developers chime in if
they choose.

In xfs we are fortunate enough that most of the codebase is at least
one software layer up from the raw hardware, which means that anybody
can build xfs with all kconfig options enabled and use it to try to
create all possible metadata structures, which means that the ability to
review a given patch and try it out isn't restricted to the subset of
people with a particular hardware device.  This means that there aren't
any patches that cannot be reviewed, which is not something I'm so sure
of for the mm layer.

Requiring review on the vast majority of non-maintainer patches that
goes into xfs (and xfsprogs) doesn't has the effect of increasing the
time to upstream acceptance, since the fact that it was committed at all
implies that the maintainer probably looked at it.

The dangerous part of course is when the maintainer commits non-trivial
code without a review -- did they look at it, or just commit whatever
made the symptoms go away?  So that's argument #1 for creating a group
norm that yes, everyone should be involved in review on a semi regular
basis.  Certainly if they're also *submitting* patches.

Argument #2 is that encouraging review of everything most likely reduces
the overall time it takes for a feature to mature because that means
that at least one of the regular participants in the group have taken
the time to read and understand how the patches mesh with the existing
systems and will ask questions when they see ill-fitting pieces.  It
definitely reduces code churn from not having to walk back bad patches
and rushed microcode updates.  That said, I've no data to back up this
assertion, merely my observations of the past decade.

My third argument is that the most time consuming part of
maintainership isn't gluing patches onto a git tree and running tests,
it's reviewing the patches.  It's a big help to know that other people
who are more familiar with various subcomponents of xfs review patches
regularly, so I don't feel as much pressure to know all things at all
times, and I worry less about blind spots because we work as a group of
people who don't see every xfs component in exactly the same way.

(Granted it helps that Dave Chinner is a fountain of historical context
indexing...)

That said, I also get rally itchy to commit my own patches at times,
especially things that look like trivial one-liners.  However, I find
that nothing in xfs is simple, and moreover the reviewers are
knowledgeable enough that even trivial patches can get reviewed quickly.

For bigger things like new features or large refactorings, there's a
strong need for updating documentation like the disk format
specification, developing a test plan, and integrating new tests into
xfstests.  That's where review is most useful, because it is the
submitter's opportunity to increase everyone's knowledge levels.  It is
also the reviewers' chance to anticipate design problems when it is
easy/cheap to fix them, and for everyone to build confidence about the
code that's going in.

The challenge for everyone, then, is to get together to decide on a
reasonable target for the amount and the l

Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Darrick J. Wong
0644
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -2968,7 +2968,7 @@ mptsas_exp_repmanufacture_info(MPT_ADAPTER *ioc,
>   mutex_unlock(&ioc->sas_mgmt.mutex);
>  out:
>   return ret;
> - }
> +}
>  
>  static void
>  mptsas_parse_device_info(struct sas_identify *identify,
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c 
> b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> index 3dd973475125..0ea141ece19e 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> @@ -603,7 +603,7 @@ static struct uni_table_desc *nx_get_table_desc(const u8 
> *unirom, int section)
>  
>  static int
>  netxen_nic_validate_header(struct netxen_adapter *adapter)
> - {
> +{
>   const u8 *unirom = adapter->fw->data;
>   struct uni_table_desc *directory = (struct uni_table_desc *) &unirom[0];
>   u32 fw_file_size = adapter->fw->size;
> diff --git a/drivers/net/wireless/ath/ath9k/xmit.c 
> b/drivers/net/wireless/ath/ath9k/xmit.c
> index bd438062a6db..baedc7186b10 100644
> --- a/drivers/net/wireless/ath/ath9k/xmit.c
> +++ b/drivers/net/wireless/ath/ath9k/xmit.c
> @@ -196,7 +196,7 @@ ath_tid_pull(struct ath_atx_tid *tid)
>   }
>  
>   return skb;
> - }
> +}
>  
>  static struct sk_buff *ath_tid_dequeue(struct ath_atx_tid *tid)
>  {
> diff --git a/drivers/platform/x86/eeepc-laptop.c 
> b/drivers/platform/x86/eeepc-laptop.c
> index 5a681962899c..4c38904a8a32 100644
> --- a/drivers/platform/x86/eeepc-laptop.c
> +++ b/drivers/platform/x86/eeepc-laptop.c
> @@ -492,7 +492,7 @@ static void eeepc_platform_exit(struct eeepc_laptop 
> *eeepc)
>   * potentially bad time, such as a timer interrupt.
>   */
>  static void tpd_led_update(struct work_struct *work)
> - {
> +{
>   struct eeepc_laptop *eeepc;
>  
>   eeepc = container_of(work, struct eeepc_laptop, tpd_led_work);
> diff --git a/drivers/rtc/rtc-ab-b5ze-s3.c b/drivers/rtc/rtc-ab-b5ze-s3.c
> index a319bf1e49de..ef5c16dfabfa 100644
> --- a/drivers/rtc/rtc-ab-b5ze-s3.c
> +++ b/drivers/rtc/rtc-ab-b5ze-s3.c
> @@ -648,7 +648,7 @@ static int abb5zes3_rtc_set_alarm(struct device *dev, 
> struct rtc_wkalrm *alarm)
>   ret);
>  
>   return ret;
> - }
> +}
>  
>  /* Enable or disable battery low irq generation */
>  static inline int _abb5zes3_rtc_battery_low_irq_enable(struct regmap *regmap,
> diff --git a/drivers/scsi/dpt_i2o.c b/drivers/scsi/dpt_i2o.c
> index fd172b0890d3..a00d822e3142 100644
> --- a/drivers/scsi/dpt_i2o.c
> +++ b/drivers/scsi/dpt_i2o.c
> @@ -3524,7 +3524,7 @@ static int adpt_i2o_systab_send(adpt_hba* pHba)
>  #endif
>  
>   return ret; 
> - }
> +}
>  
>  
>  
> /*
> diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.c 
> b/drivers/scsi/sym53c8xx_2/sym_glue.c
> index 791a2182de53..7320d5fe4cbc 100644
> --- a/drivers/scsi/sym53c8xx_2/sym_glue.c
> +++ b/drivers/scsi/sym53c8xx_2/sym_glue.c
> @@ -1393,7 +1393,7 @@ static struct Scsi_Host *sym_attach(struct 
> scsi_host_template *tpnt, int unit,
>   scsi_host_put(shost);
>  
>   return NULL;
> - }
> +}
>  
>  
>  /*
> diff --git a/fs/locks.c b/fs/locks.c
> index 21b4dfa289ee..d2399d001afe 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -559,7 +559,7 @@ static const struct lock_manager_operations 
> lease_manager_ops = {
>   * Initialize a lease, use the default lock manager operations
>   */
>  static int lease_init(struct file *filp, long type, struct file_lock *fl)
> - {
> +{
>   if (assign_type(fl, type) != 0)
>   return -EINVAL;
>  
> diff --git a/fs/ocfs2/stack_user.c b/fs/ocfs2/stack_user.c
> index dae9eb7c441e..d2fb97b173da 100644
> --- a/fs/ocfs2/stack_user.c
> +++ b/fs/ocfs2/stack_user.c
> @@ -398,7 +398,7 @@ static int ocfs2_control_do_setnode_msg(struct file *file,
>  
>  static int ocfs2_control_do_setversion_msg(struct file *file,
>  struct ocfs2_control_message_setv 
> *msg)
> - {
> +{
>   long major, minor;
>   char *ptr = NULL;
>   struct ocfs2_control_private *p = file->private_data;
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 0da80019a917..217108f765d5 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2401,7 +2401,7 @@ static bool
>  xfs_agf_verify(
>   struct xfs_mount *mp,
>   struct xfs_buf  *bp)
> - {
> +{
>   struct xfs_agf  *agf = XFS_BUF_TO_AGF(bp);
>  

Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Darrick J. Wong
On Fri, Mar 31, 2017 at 03:11:28PM +, Bart Van Assche wrote:
> On Fri, 2017-03-31 at 13:02 +0300, Dmitry Monakhov wrote:
> > Another good example may be a bug with dirty page cache after blkdiscard
> > https://lkml.org/lkml/2017/3/22/789 . This simple bug  result in crappy
> > fsimage if mkfs relay on discard_zeroes_data behaviour.
> > So IMHO basic blkdev test coverage is important filesystem testing. i.e.
> > important for xfstests.
> 
> Mixing up filesystem tests and block layer / block driver tests in the same
> directory is completely wrong. Block driver developers will be primarily
> interested in the block tests and may want to skip the filesystem tests.
> Filesystem developers will probably run the block tests only once and will
> likely run the filesystem tests repeatedly. Mixing up different kinds of
> tests in the same directory makes it unnecessarily hard to run block and
> filesystem tests separately.

During LSF I had started to wonder if we should just create a new
FSTYP=blockdev fs type with a no-op mkfs & mount.  "_require_fs generic"
could be taught to ignore FSTYP=blockdev; blockdev tests that should
work on all block devices can stay in tests/generic, and blockdev tests
that require specific features or complicated setup can go in
tests/blockdev.

The benefit (for the fs developers, anyway) of having complex block
device setup code helper functions in common/ is that then we can also
start writing tests to see how the fs reacts with more complex storage
setups.  We already have some of that for dm_{thin,flakey,delay,error}.

That way we keep the tests together and make it easy to run them (when
applicable) as part of regular fs testing, and avoid the situation where
bdevtests and xfstests slowly drift apart in terms of behaviors and
command line switches.

The downside ofc is the potential for bloat. :)

(The blockdev fallocate tests fit the fs/block split awkwardly --
they call what is nominally a fs feature on something that isn't itself
a filesystem...)

 Just my 5c.

--D

> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BLKZEROOUT not zeroing md dev on VMDK

2016-05-26 Thread Darrick J. Wong
On Wed, May 18, 2016 at 11:39:30PM +0100, Sitsofe Wheeler wrote:
> Hi,
> 
> With Ubuntu's 4.4.0-22-generic kernel and a Fedora 23
> 4.6.0-1.vanilla.knurd.1.fc23.x86_64 kernel I've found that the
> BLKZEROOUT syscall can malfunction and not zero data.
> 
> When BLKZEROOUT is issued to an MD device atop a PVSCSI controller
> supplied VMDK from ESXi 6.0 the call returns immediately and with a zero
> return code. Unfortunately, inspecting the data on the MD device shows
> that it has not been zeroed and is in fact untouched. The easiest way to
> see this behaviour is to boot the VM, create an mdadm device atop
> /dev/sd?, scribble some non-zero value on the disk and then use
> blkdiscard --zeroout /dev/md??? . If you then inspect the MD disk (e.g.
> with hexdump) you will still see the old data and using POSIX_FADV_DONTNEED
> on the MD device doesn't change the outcome.
> 
> The only clue I've seen is that
> /sys/block/sd?/queue/write_same_max_bytes starts out being 33553920 but
> after a WRITE SAME is issued it becomes 0. If the MD device is created
> after write_same_max_bytes has become 0 on the backing disk then
> BLKZEROOUT seems to work correctly.

It's possible that the pvscsi device advertised WRITE SAME, but if the device
sends back ILLEGAL REQUEST then the SCSI disk driver will set
write_same_max_bytes=0.  Subsequent BLKZEROOUT attempts will then issue writes
of zeroes to the drive.

--D

> 
> -- 
> Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] LSF/MM Schedule and improving discard support

2016-04-13 Thread Darrick J. Wong
On Wed, Apr 13, 2016 at 09:51:04AM -0700, James Bottomley wrote:
> On Wed, 2016-04-13 at 09:29 -0700, Bart Van Assche wrote:
> > On 04/13/2016 09:21 AM, Martin K. Petersen wrote:
> > > From a filesystem/ioctl perspective, BLKDISCARD is a hint. We
> > > should not be
> > > rounding off or aligning anything.
> > 
> > Hello Martin,
> > 
> > Today if a BLKDISCARD ioctl passes a non-aligned start and/or end 
> > sector to the kernel then the block layer will submit invalid (non
> > -aligned) REQ_DISCARD requests to the block driver the ioctl applies 
> > to. This is not acceptable. Does the above mean that you are 
> > proposing to fail such BLKDISCARD ioctls with an error code?
> 
> The answer would be of course not.  discard is a hint so malformed
> discard gets ignored by the device and success is returned because you
> can't oblige devices to obey hints (that's why they're called hints).

Agree.  For blockdev FALLOC_FL_PUNCH_HOLE I think we can simply check for
logical block size ("lbs") alignment and then pass the request to the
device with the understanding that it can do as it pleases.  We asked the
device to try to deallocate blocks, and perhaps it cannot.

Just to be clear, this only applies to zeroing discard; the "discard and who
knows what you can now read back" thing that nobody likes has been temporarily
wired up to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE. :)

> However, the problem of needing a mandatory discard for scrubbing
> blocks is part of the fallocate discussion, I think.

The third fallocate mode (FALLOC_FL_ZERO_RANGE) doesn't fit with the phrase
"mandatory discard for scrubbing blocks", though if one removed "discard" from
that phrase then it would.  The only thing that ZERO_RANGE guarantees is that
subsequent reads return zeroes.  XFS punches the entire range and reallocates
it with unwritten extents; ext4 fills the holes in the range with unwritten
extents and converts real extents to unwritten.  Both also write zeroes to any
part of the range that doesn't align to an FS block.

Yes, I think there are several questions to resolve here for mandatory zeroing
with FALLOC_FL_ZERO_RANGE (summarizing the issues I've come up with so far):

a) Should blockdev fallocate accept byte-granular offset/length arguments, even
if it has to use the page cache to write zeroes to the device?  This is what
file fallocate does today.

b) If blockdev fallocate does impose alignment requirements, should it return
EINVAL to a request that isn't aligned to the logical block size?

c) If a device really really prefers that its requests are aligned to
min_io_size (which can be much larger than the logical block size), should it
reject requests that aren't aligned to min_io?  Or perhaps it should take care
of the alignment problems on its own somehow?

For allocate mode (the thing Mike Snitzer brought up in another thread
yesterday), the alignment problems are much easier because we're allowed to
round the start down and the end up to fit whatever alignment we require.

Should we promote this to a storage track session at LSF next week?

--D

> 
> James
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please submit specific discussion proposals for the File & Storage miniconf at LPC2015

2015-06-04 Thread Darrick J. Wong
Hi folks,

Well, we made it!  As of yesterday, the File & Storage systems microconf has
been approved for Plumbers!  If you're interested in attending, I highly
recommend that you register[0] immediately, as the earlybird deadline is
tomorrow, June 5th.

We have a solid list of discussion ideas on the wiki page[1], and three hours
in which to conduct those discussions!  If you are interested in leading one of
the three hourlong sessions, it is now time to submit[2] a specific proposal
for consideration.  People selected to be session leaders can have their
registrations changed to the speaker package even after registering.

Proposals needn't be strictly limited to the fourteen bullet points on the wiki
page.  I will try to have the three key discussions lined up by the end of the
month, so please send in proposals!

--Darrick

[0] 
https://www.regonline.com/register/login.aspx?eventID=1623891&MethodId=0&EventsessionId=
[1] http://wiki.linuxplumbersconf.org/2015:file_and_storage_systems
[2] https://linuxplumbersconf.org/2015/ocw/events/LPC2015/proposals
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Proposal for annotating _unstable_ pages

2015-05-22 Thread Darrick J. Wong
On Thu, May 21, 2015 at 09:21:12PM +0200, Jan Kara wrote:
> On Thu 21-05-15 11:09:55, Kent Overstreet wrote:
> > On Thu, May 21, 2015 at 06:54:53PM +0200, Jan Kara wrote:
> > > On Wed 20-05-15 18:04:40, Kent Overstreet wrote:
> > > > > Yeah.  I never figured out a sane way to migrate pages and keep 
> > > > > everything
> > > > > else happy.  Daniel Phillips is having a go at page forking for tux3; 
> > > > > let's
> > > > > see if the questions about that get resolved.
> > > > 
> > > > That would be great, we need something.
> > > > 
> > > > I'd also be really curious what btrfs is doing today - is it just 
> > > > bouncing
> > > > everything internally, or did they come up with something more clever?
> > > 
> > > Btrfs is just waiting for IO to complete.
> > > 
> > > > > > Also, there's probably always going to be situations where we're 
> > > > > > reading or
> > > > > > writing to pages user space can stomp on (dio) - IMO we need to add 
> > > > > > a bio flag
> > > > > > to annotate this - "if you need this to be stable you have to 
> > > > > > bounce it".
> > > > > > Otherwise either filesystems/block drivers are going to be stuck 
> > > > > > bouncing
> > > > > > everything, or it'll just (continue to be) buggy.
> > > > > 
> > > > > Well, for now there's BIO_SNAP_STABLE that forces the block layer to 
> > > > > bounce it,
> > > > > but right now ext3 is the last user of it, and afaict btrfs is the 
> > > > > only other
> > > > > FS that takes care of stable pages on its own.
> > > > 
> > > > I have no idea what BIO_SNAP_STABLE was supposed to be for, but I don't 
> > > > see how
> > > > it's useful for anything sane.
> > > 
> > > It's for the case where lower layer requests it needs stable pages but
> > > upper layer isn't able to provide them (as is the case of ext3). Then 
> > > block
> > > layer bounces the data for the caller.
> > > 
> > > > But that's the complete opposite of the problem stable pages are 
> > > > supposed to
> > > > solve: stable pages are for when the _lower_ layer (be it filesystem, 
> > > > bcache,
> > > > md, lvm) needs the memory being either read to or written from (both, 
> > > > it's not
> > > > just writes) to not be diddled over while the IO is in flight.
> > > > 
> > > > Now, a point that I think has been missed is that stable pages are 
> > > > _not_ a
> > > > complete solution, at least for consumers in the block layer.
> > > > 
> > > > The situation today is that if I'm in the block layer, and I get a 
> > > > handed a read
> > > > or write bio, I _don't know_ if it's from something that's going to 
> > > > diddle over
> > > > those pages or not. So if I require stable pages - be it for data 
> > > > checksumming
> > > > or for other things - I've just got to bounce the bio myself.
> > > > 
> > > > And then the really annoying thing is that if you've got stacked things 
> > > > that all
> > > > need stable pages (maybe btrfs on top of bcache on top of md) - they 
> > > > _all_ have
> > > > to assume the pages aren't going to be stable, so if they need them 
> > > > they _all_
> > > > have to bounce - even though once the first layer bounced the bio that 
> > > > made it
> > > > stable for everything underneath it.
> > > 
> > > The current design is that if you need stable pages for your device, set
> > > bdi capability BDI_CAP_STABLE_WRITES, fs then takes care of not scribbling
> > > over your page while it is under writeback or uses BIO_SNAP_STABLE if it
> > > cannot.
> > 
> > But if I need stable pages, I still have to bounce because that _does not_
> > guarantee stable pages, it only gives me stable pages for some of the IOs 
> > and in
> > the lower layers you can't tell which is which.
> > 
> > Do you see the problem? What good is BDI_CAP_STABLE_WRITES if it's not a
> > guarantee and I can't tell if I need to bounce or not?
>   So fix the upper layers to make it a guarantee? You mentioned direct IO
> needs fixing. Anything else?

Back when I was writing the stable pages patches, I observed that some of the
filesystems didn't hold the pages containing their own metadata stable during
writeback on a stable-writes device.  The journalling filesystems were fine
because they had various means to take care of that.

ISTR ext2 and vfat were the biggest culprits, but both maintainers rejected
the patches to fix that behavior.  This might no longer be the case; those
patches were so long ago I can't find them in Google.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LPC2015: File and Storage Systems uconf

2015-04-03 Thread Darrick J. Wong
Hi everyone,

Linux Plumbers is coming up in just four months!  I would like for there to be
a file & storage miniconf at this year's LPC, so I've started assembling a plan
for what we might discuss.  As a starting point, I've filled the planning page
with the topics that didn't achieve any sort of resolution at LSF/MM:

http://wiki.linuxplumbersconf.org/2015:file_and_storage_systems

There are undoubtedly things that I missed in my initial list, and it would be
very helpful to figure out who's going.

If you'd like to visit Seattle in mid-August (I promise it probably won't be
raining!) and/or have a topic that you'd like to talk about that I missed,
I'd appreciate it if you wrote it into the wiki page.

Thanks,

--Darrick
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-01-21 Thread Darrick J. Wong
Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

Depends on "block: Add discard flag to blkdev_issue_zeroout() function".

Signed-off-by: Darrick J. Wong 
---
 block/ioctl.c   |   45 ++---
 include/uapi/linux/fs.h |7 +++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, 
uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-uint64_t len)
+uint64_t len, uint32_t flags)
 {
+   int ret;
+   struct address_space *mapping;
+   uint64_t end = start + len - 1;
+
+   if (flags & ~BLKZEROOUT2_DISCARD_OK)
+   return -EINVAL;
if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;
-   start >>= 9;
-   len >>= 9;
-
-   if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+   if (end >= i_size_read(bdev->bd_inode))
return -EINVAL;
 
-   return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+   /* Invalidate the page cache, including dirty pages */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+  flags & BLKZEROOUT2_DISCARD_OK);
+   if (ret)
+   goto out;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.
+*/
+   ret = invalidate_inode_pages2_range(mapping,
+   start >> PAGE_CACHE_SHIFT,
+   end >> PAGE_CACHE_SHIFT);
+out:
+   return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
 
-   return blk_ioctl_zeroout(bdev, range[0], range[1]);
+   return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+   }
+   case BLKZEROOUT2: {
+   struct blkzeroout2 p;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(&p, (void __user *)arg, sizeof(p)))
+   return -EFAULT;
+
+   return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
}
 
case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+   __u64 start;
+   __u64 length;
+   __u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK 1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] uas: disable UAS on Apricorn SATA dongles

2014-12-11 Thread Darrick J. Wong
The Apricorn SATA dongle will occasionally return "USBSUSBSUSB" in
response to SCSI commands when running in UAS mode.  Therefore,
disable UAS mode on this dongle.

Signed-off-by: Darrick J. Wong 
---
 drivers/usb/storage/unusual_uas.h |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/usb/storage/unusual_uas.h 
b/drivers/usb/storage/unusual_uas.h
index 18a283d..3530cb0 100644
--- a/drivers/usb/storage/unusual_uas.h
+++ b/drivers/usb/storage/unusual_uas.h
@@ -40,6 +40,16 @@
  * and don't forget to CC: the USB development list 
  */
 
+/*
+ * Apricorn USB3 dongle sometimes returns "USBSUSBSUSBS" in response to SCSI
+ * commands in UAS mode.  Observed with the 1.28 firmware; are there others?
+ */
+UNUSUAL_DEV(0x0984, 0x0301, 0x0128, 0x0128,
+   "Apricorn",
+   "",
+   USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+   US_FL_IGNORE_UAS),
+
 /* https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
 UNUSUAL_DEV(0x0bc2, 0x2312, 0x, 0x,
"Seagate",
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-11 Thread Darrick J. Wong
On Wed, Dec 10, 2014 at 05:41:54PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2014 at 02:29:29AM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
> > > On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
> > > > > Hi,
> > > > > 
> > > > > On 09-12-14 20:31, Darrick J. Wong wrote:
> > > > > >Hi,
> > > > > >
> > > > > >I have an Apricorn USB 3 disk dongle thing that claims to support 
> > > > > >UAS.
> > > > > >However, the kernel crashes when I plug it in[1].
> > > > > 
> > > > > Yes there are some known issues with uas error handling which are 
> > > > > fixed
> > > > > in 3.18, can you try with a 3.18 kernel please ?
> > > > 
> > > > The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
> > > > fuller
> > > > dmesg output.  Looking at the code, it looks like we end up in
> > > > queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we 
> > > > fall off
> > > > the end.
> 
> Well, there are (at least) two issues going on here.  The first is that the
> SCSI layer passes us zero-length READ10 commands, which is causing this crash.
> Zero length means the sglist is empty, so the usb host has nothing to map, and
> hence urb->num_mapped_sgs == 0 and the loop goes boom.  I don't know what it
> means to send a bulk URB with no buffers, so...
> 
> ...then I took a tour of how SCSI LLDDs deal with zero-length read/write
> commands.  mpt2sas attaches a junk sg and pushes the command out.  libata
> detects zero-length READ/WRITE SCSI commands and completes the scsi command
> without ever touching hardware.  I wasn't able to get any of my parallel SCSI
> disks to boot, so I could not try that.
> 
> The other problem is when I plug in a different disk (same mfg/model), READ
> CAPACITY 16 intermittently returns the string "USBSUSBSUSBS", which of course
> is garbage.  The kernel then tries to use these values; fortunately, it 
> rejects
> a sector size of 1431519827 ("USBS") and sets the size to zero.

It turns out that this dongle will return "USBSUSBSUSB" to just about
*any* command, such as READ10.  In fact, that's the root cause of the
crash.  The partition code issues a 4k read to the disk (looking for
partition tables).  The dongle returns "USBSUSBSUSB" (13 bytes) which
causes the bio to be advanced by 13 bytes because the URB's
actual_length is stuffed into the SCSI resid(ual length) field.  The
block layer code now wants to read 4083 bytes starting at byte 13,
which, results in 3584 bytes being read ... to somewhere.  This leaves
499 bytes in the bio, which is rounded down to 0 sectors, and thus we
crash on a zero-length READ10 when we try to read the remaining piece
and there's no sg to land the data.  Worse yet, if you somehow patch
all *that* up, now the reader sees USBSUSBSUSB when the bio completes.

Let's disable UAS on this thing entirely.  (Well, you /could/ hack it
to detect USBSUSBSUSB and fail the SCSI command entirely, but... meh.)

Though we should shortcut a zero-length read to avoid crashing the
kernel, since sg_raw can issue such commands.

Patches soon,

--D

> So, I can code up a couple of patches -- one to teach UAS how to deal with 
> zero
> length read and writes; and a second patch to set US_FL_IGNORE_UAS on Apricorn
> bridges.  I tried setting US_FL_NO_READ_CAPACITY_16, but for whatever reason
> sd.c was still trying RC16.
> 
> --D
> 
> > > > 
> > > > (Alas it's now 1am here, so I'm going to bed. :/ )
> > > 
> > > Eh, nuts to sleeping.  dmesg produces this:
> > > 
> > > [  231.128074] usbcore: registered new interface driver usb-storage
> > > [  231.133822] usbcore: registered new interface driver uas
> > > [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
> > > [  252.136927] scsi host6: uas
> > > [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  
> > > 0128 PQ: 0 ANSI: 6
> > > [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
> > > [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
> > > GB/149 GiB)
> > > [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
> > > [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
> > > [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
> > > [  252.145975] s

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong
On Wed, Dec 10, 2014 at 02:29:29AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
> > > On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
> > > > Hi,
> > > > 
> > > > On 09-12-14 20:31, Darrick J. Wong wrote:
> > > > >Hi,
> > > > >
> > > > >I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
> > > > >However, the kernel crashes when I plug it in[1].
> > > > 
> > > > Yes there are some known issues with uas error handling which are fixed
> > > > in 3.18, can you try with a 3.18 kernel please ?
> > > 
> > > The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
> > > fuller
> > > dmesg output.  Looking at the code, it looks like we end up in
> > > queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we 
> > > fall off
> > > the end.

Well, there are (at least) two issues going on here.  The first is that the
SCSI layer passes us zero-length READ10 commands, which is causing this crash.
Zero length means the sglist is empty, so the usb host has nothing to map, and
hence urb->num_mapped_sgs == 0 and the loop goes boom.  I don't know what it
means to send a bulk URB with no buffers, so...

...then I took a tour of how SCSI LLDDs deal with zero-length read/write
commands.  mpt2sas attaches a junk sg and pushes the command out.  libata
detects zero-length READ/WRITE SCSI commands and completes the scsi command
without ever touching hardware.  I wasn't able to get any of my parallel SCSI
disks to boot, so I could not try that.

The other problem is when I plug in a different disk (same mfg/model), READ
CAPACITY 16 intermittently returns the string "USBSUSBSUSBS", which of course
is garbage.  The kernel then tries to use these values; fortunately, it rejects
a sector size of 1431519827 ("USBS") and sets the size to zero.

So, I can code up a couple of patches -- one to teach UAS how to deal with zero
length read and writes; and a second patch to set US_FL_IGNORE_UAS on Apricorn
bridges.  I tried setting US_FL_NO_READ_CAPACITY_16, but for whatever reason
sd.c was still trying RC16.

--D

> > > 
> > > (Alas it's now 1am here, so I'm going to bed. :/ )
> > 
> > Eh, nuts to sleeping.  dmesg produces this:
> > 
> > [  231.128074] usbcore: registered new interface driver usb-storage
> > [  231.133822] usbcore: registered new interface driver uas
> > [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
> > [  252.136927] scsi host6: uas
> > [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  
> > 0128 PQ: 0 ANSI: 6
> > [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
> > [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
> > GB/149 GiB)
> > [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
> > [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
> > [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
> > [  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through
> 
> Huh.  4096-byte physical blocks??  That drive is /not/ a 4k sector drive.
> Here's what the kernel said when I plugged in the other ("Plugable" brand) UAS
> bridge[1]:
> 
> [   32.466870] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
> [   32.498996] usbcore: registered new interface driver usb-storage
> [   37.660963] scsi host6: uas
> [   37.661193] usbcore: registered new interface driver uas
> [   37.661292] queue_bulk_sg_tx: num=1 sg=880447764500 addr=45af41000 
> len=0 pagelink=ea00116bd042
> [   37.661550] queue_bulk_sg_tx: num=1 sg=8804483fb600 addr=45af41000 
> len=0 pagelink=ea00116bd042
> [   37.661744] scsi 6:0:0:0: Direct-Access Plugable USB3-SATA-UASP1  0
> PQ: 0 ANSI: 6
> [   37.661865] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 
> len=0 pagelink=ea00116bd042
> [   37.662053] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 
> len=0 pagelink=ea00116bd042
> [   37.662294] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45af41000 
> len=0 pagelink=ea00116bd042
> [   37.662488] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b6ab000 
> len=0 pagelink=ea00116daac2
> [   37.663041] sd 6:0:0:0: Attached scsi generic sg2 type 0
> [   37.663138] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=44897c000 
> len=0 pagelink=ea0011225f02
> [   37.664420] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
> GB/149 Gi

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong
On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
> > > Hi,
> > > 
> > > On 09-12-14 20:31, Darrick J. Wong wrote:
> > > >Hi,
> > > >
> > > >I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
> > > >However, the kernel crashes when I plug it in[1].
> > > 
> > > Yes there are some known issues with uas error handling which are fixed
> > > in 3.18, can you try with a 3.18 kernel please ?
> > 
> > The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
> > fuller
> > dmesg output.  Looking at the code, it looks like we end up in
> > queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall 
> > off
> > the end.
> > 
> > (Alas it's now 1am here, so I'm going to bed. :/ )
> 
> Eh, nuts to sleeping.  dmesg produces this:
> 
> [  231.128074] usbcore: registered new interface driver usb-storage
> [  231.133822] usbcore: registered new interface driver uas
> [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
> [  252.136927] scsi host6: uas
> [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  0128 
> PQ: 0 ANSI: 6
> [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
> [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
> GB/149 GiB)
> [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
> [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
> [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
> [  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through

Huh.  4096-byte physical blocks??  That drive is /not/ a 4k sector drive.
Here's what the kernel said when I plugged in the other ("Plugable" brand) UAS
bridge[1]:

[   32.466870] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[   32.498996] usbcore: registered new interface driver usb-storage
[   37.660963] scsi host6: uas
[   37.661193] usbcore: registered new interface driver uas
[   37.661292] queue_bulk_sg_tx: num=1 sg=880447764500 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.661550] queue_bulk_sg_tx: num=1 sg=8804483fb600 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.661744] scsi 6:0:0:0: Direct-Access Plugable USB3-SATA-UASP1  0
PQ: 0 ANSI: 6
[   37.661865] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662053] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662294] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662488] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b6ab000 len=0 
pagelink=ea00116daac2
[   37.663041] sd 6:0:0:0: Attached scsi generic sg2 type 0
[   37.663138] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=44897c000 len=0 
pagelink=ea0011225f02
[   37.664420] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[   37.664599] queue_bulk_sg_tx: num=1 sg=880447764400 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.664833] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665022] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665255] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665421] sd 6:0:0:0: [sdc] Write Protect is off
[   37.665532] queue_bulk_sg_tx: num=1 sg=88045b9e0a00 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665735] queue_bulk_sg_tx: num=1 sg=88045b9e0a00 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665877] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[   37.666003] queue_bulk_sg_tx: num=1 sg=88045b9e1700 addr=4587a8e00 len=0 
pagelink=ea001161ea02
[   37.666293] queue_bulk_sg_tx: num=1 sg=88045b9e1700 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.670190] queue_bulk_sg_tx: num=1 sg=88045b9e1600 addr=44897c000 len=0 
pagelink=ea0011225f02
[   37.676364] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.681800] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.687125] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.692335] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.697451] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.702429] queue_bulk_sg_

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong
On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
> > Hi,
> > 
> > On 09-12-14 20:31, Darrick J. Wong wrote:
> > >Hi,
> > >
> > >I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
> > >However, the kernel crashes when I plug it in[1].
> > 
> > Yes there are some known issues with uas error handling which are fixed
> > in 3.18, can you try with a 3.18 kernel please ?
> 
> The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a fuller
> dmesg output.  Looking at the code, it looks like we end up in
> queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall off
> the end.
> 
> (Alas it's now 1am here, so I'm going to bed. :/ )

Eh, nuts to sleeping.  dmesg produces this:

[  231.128074] usbcore: registered new interface driver usb-storage
[  231.133822] usbcore: registered new interface driver uas
[  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[  252.136927] scsi host6: uas
[  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  0128 
PQ: 0 ANSI: 6
[  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
[  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
[  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
[  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
[  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  252.171739] queue_bulk_sg_tx: num=4294967295 sg=8804584e0b00 addr=   
   (null) len=0 pagelink=116b8882
[  252.173706] queue_bulk_sg_tx: num=4294967295 sg=  (null), ABORT


I wrote in a printk to spit out num_sgs and some of the sg data right before
the sg_next() call.  Looks like num_sgs is originally zero?  I then patched
the code to break early if num_sgs == 0:

/* Calculate length for next transfer --
 * Are we done queueing all the TRBs for this sg entry?
 */
this_sg_len -= trb_buff_len;
printk(KERN_ERR "%s: num=%u sg=%p addr=%lx len=%u pagelink=%lx\n", __func__, 
num_sgs, sg, addr, this_sg_len, sg->page_link);
if (this_sg_len == 0) {
if (num_sgs == 0) {
printk(KERN_ERR "%s: breaking early, no sgs??\n", __func__);
break;
}
--num_sgs;
if (num_sgs == 0)
break;
sg = sg_next(sg);
addr = (u64) sg_dma_address(sg);
this_sg_len = sg_dma_len(sg);

This produced this log[1] which I've excerpted here:

[   96.944791] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[   96.972881] usbcore: registered new interface driver usb-storage
[  128.315902] scsi host6: uas
[  128.318605] usbcore: registered new interface driver uas
[  128.318691] queue_bulk_sg_tx: num=1 sg=88044650ed00 addr=446958000 len=0 
pagelink=ea00111a5602
[  128.318960] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=446958000 len=0 
pagelink=ea00111a5602
[  128.321144] scsi 6:0:0:0: Direct-Access Apricorn  0128 
PQ: 0 ANSI: 6
[  128.321165] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=45cbb1000 len=0 
pagelink=ea001172ec42
[  128.323714] queue_bulk_sg_tx: num=1 sg=880457a02100 addr=447738000 len=0 
pagelink=ea00111dce02
[  128.326233] queue_bulk_sg_tx: num=1 sg=880457a02600 addr=45a4c8000 len=0 
pagelink=ea0011693202
[  128.329157] sd 6:0:0:0: Attached scsi generic sg2 type 0
[  128.331328] queue_bulk_sg_tx: num=1 sg=88045795ce00 addr=456ad7000 len=0 
pagelink=ea00115ab5c2
[  128.331428] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[  128.331431] sd 6:0:0:0: [sdc] 4096-byte physical blocks
[  128.331448] queue_bulk_sg_tx: num=1 sg=880457a02100 addr=456ad7000 len=0 
pagelink=ea00115ab5c2
[  128.333772] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=44649e000 len=0 
pagelink=ea0011192782
[  128.336191] queue_bulk_sg_tx: num=1 sg=880457a02700 addr=45683b000 len=0 
pagelink=ea00115a0ec2
[  128.338561] queue_bulk_sg_tx: num=1 sg=880457a02600 addr=37355000 len=0 
pagelink=eadcd542
[  128.340979] queue_bulk_sg_tx: num=1 sg=880457a02c00 addr=8a8e3000 len=0 
pagelink=ea00022a38c2
[  128.343246] sd 6:0:0:0: [sdc] Write Protect is off
[  128.343263] queue_bulk_sg_tx: num=1 sg=880457a02400 addr=8a8e2000 len=0 
pagelink=ea00022a3882
[  128.345461] sd 6:0:0:0: [sdc] No Caching mode page found
[  128.345463] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  128.345475] queue_bulk_sg_tx: num=1 sg=880457a02000 addr=45ba6ba00 len=0 
pagelink=ea00116e9ac2
[  128.347752] queue_bulk_sg_tx: num=1 sg=880457a02

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong
On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
> Hi,
> 
> On 09-12-14 20:31, Darrick J. Wong wrote:
> >Hi,
> >
> >I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
> >However, the kernel crashes when I plug it in[1].
> 
> Yes there are some known issues with uas error handling which are fixed
> in 3.18, can you try with a 3.18 kernel please ?

The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a fuller
dmesg output.  Looking at the code, it looks like we end up in
queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall off
the end.

(Alas it's now 1am here, so I'm going to bed. :/ )

--D

> 
> Note that the device will likely still not work, but it should no
> longer crash things. When running 3.18 please collect the output of
> "dmesg" after plugging in the drive and send that to me, then we'll see
> if we can get it to work from there.
> 
> Regards,
> 
> Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


UAS crash with Apricorn USB3 SATA bridge

2014-12-09 Thread Darrick J. Wong
Hi,

I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
However, the kernel crashes when I plug it in[1].

I'm not sure what this is caused by, but I also have an ASMedia 2105 SATA
bridge that works with UAS just fine.  Not sure if the Apricorn thing is simply
broken, or if this is a bug in UAS.

I've attached the lsusb -v output[2] if that'll help.  I can try to poke around
with the source code if there's time.

--D

[1] 
https://lh6.googleusercontent.com/-oiOwZmkROQk/VIdNGPTWFDI/C3w/bEw6fSmZpkc/s0-U-I/IMG_0167.JPG

[2] Bus 002 Device 004: ID 0984:0301 Apricorn 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   3.00
  bDeviceClass0 (Defined at Interface level)
  bDeviceSubClass 0 
  bDeviceProtocol 0 
  bMaxPacketSize0 9
  idVendor   0x0984 Apricorn
  idProduct  0x0301 
  bcdDevice1.28
  iManufacturer   1 Apricorn
  iProduct2   
  iSerial 3 303930363130464232323031
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength  121
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xc0
  Self Powered
MaxPower2mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   2
  bInterfaceClass 8 Mass Storage
  bInterfaceSubClass  6 SCSI
  bInterfaceProtocol 80 Bulk-Only
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x8b  EP 11 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x0a  EP 10 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   1
  bNumEndpoints   4
  bInterfaceClass 8 Mass Storage
  bInterfaceSubClass  6 SCSI
  bInterfaceProtocol 98 
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x08  EP 8 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   0
Command pipe (0x01)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x89  EP 9 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Status pipe (0x02)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x0a  EP 10 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Data-out pipe (0x04)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x8b  EP 11 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Data-in pipe (0x03)
Binary Object Store Descriptor:
  bLength 5
  bDescriptorType15
  wTotalLength   22
  bNumDeviceCaps  2
  USB 2.0 Extension Device Capability:
bLength 7
bDescriptorType16

[PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2014-11-17 Thread Darrick J. Wong
Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

This patch depends on mkp's earlier patch "block: Introduce
blkdev_issue_zeroout_discard() function".

Signed-off-by: Darrick J. Wong 
---
 block/ioctl.c   |   45 ++---
 include/uapi/linux/fs.h |7 +++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, 
uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-uint64_t len)
+uint64_t len, uint32_t flags)
 {
+   int ret;
+   struct address_space *mapping;
+   uint64_t end = start + len - 1;
+
+   if (flags & ~BLKZEROOUT2_DISCARD_OK)
+   return -EINVAL;
if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;
-   start >>= 9;
-   len >>= 9;
-
-   if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+   if (end >= i_size_read(bdev->bd_inode))
return -EINVAL;
 
-   return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+   /* Invalidate the page cache, including dirty pages */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+  flags & BLKZEROOUT2_DISCARD_OK);
+   if (ret)
+   goto out;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.
+*/
+   ret = invalidate_inode_pages2_range(mapping,
+   start >> PAGE_CACHE_SHIFT,
+   end >> PAGE_CACHE_SHIFT);
+out:
+   return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
 
-   return blk_ioctl_zeroout(bdev, range[0], range[1]);
+   return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+   }
+   case BLKZEROOUT2: {
+   struct blkzeroout2 p;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(&p, (void __user *)arg, sizeof(p)))
+   return -EFAULT;
+
+   return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
}
 
case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+   __u64 start;
+   __u64 length;
+   __u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK 1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function

2014-11-17 Thread Darrick J. Wong
On Fri, Nov 14, 2014 at 03:22:05PM -0500, Martin K. Petersen wrote:
> > "Martin" == Martin K Petersen  writes:
> 
> Martin> What would you prefer as the default for the ext4 use case? To
> Martin> allocate or to discard?
> 
> I didn't get a preference for whether sb_issue_zeroout() should discard
> or allocate.

In the discussions I've had on the ext4 list, we seem to be leaning towards
discard and falling back to allocate if necessary.

--D

> 
> But here's an updated patch 3...
> 
> commit eb23c9e71e08b7f467cbc36990a1a01a94a7b959
> Author: Martin K. Petersen 
> Date:   Thu Nov 6 14:36:05 2014 -0500
> 
> block: Add discard flag to blkdev_issue_zeroout() function
> 
> blkdev_issue_discard() will zero a given block range. This is done by
> way of explicit writing, thus provisioning or allocating the blocks on
> disk.
> 
> There are use cases where the desired behavior is to zero the blocks but
> unprovision them if possible. The blocks must deterministically contain
> zeroes when they are subsequently read back.
> 
> This patch adds a flag to blkdev_issue_zeroout() that provides this
> variant. If the discard flag is set and a block device guarantees
> discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
> the device does not support discard_zeroes_data or if the discard
> request fails we will fall back to first REQ_WRITE_SAME and then a
> regular REQ_WRITE.
> 
> Also update the callers of blkdev_issue_zero() to reflect the new flag
> and make sb_issue_zeroout() prefer the discard approach.
> 
> Signed-off-by: Martin K. Petersen 
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 8411be3c19d3..715e948f58a4 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -283,23 +283,45 @@ static int __blkdev_issue_zeroout(struct block_device 
> *bdev, sector_t sector,
>   * @sector:  start sector
>   * @nr_sects:number of sectors to write
>   * @gfp_mask:memory allocation flags (for bio_alloc)
> + * @discard: whether to discard the block range
>   *
>   * Description:
> - *  Generate and issue number of bios with zerofiled pages.
> +
> + *  Zero-fill a block range.  If the discard flag is set and the block
> + *  device guarantees that subsequent READ operations to the block range
> + *  in question will return zeroes, the blocks will be discarded. Should
> + *  the discard request fail, if the discard flag is not set, or if
> + *  discard_zeroes_data is not supported, this function will resort to
> + *  zeroing the blocks manually, thus provisioning (allocating,
> + *  anchoring) them. If the block device supports the WRITE SAME command
> + *  blkdev_issue_zeroout() will use it to optimize the process of
> + *  clearing the block range. Otherwise the zeroing will be performed
> + *  using regular WRITE calls.
>   */
>  
>  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> -  sector_t nr_sects, gfp_t gfp_mask)
> +  sector_t nr_sects, gfp_t gfp_mask, bool discard)
>  {
> + struct request_queue *q = bdev_get_queue(bdev);
> + unsigned char bdn[BDEVNAME_SIZE];
> +
> + if (discard && blk_queue_discard(q) && q->limits.discard_zeroes_data) {
> +
> + if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
> + return 0;
> +
> + bdevname(bdev, bdn);
> + pr_warn("%s: DISCARD failed. Manually zeroing.\n", bdn);
> + }
> +
>   if (bdev_write_same(bdev)) {
> - unsigned char bdn[BDEVNAME_SIZE];
>  
>   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
>ZERO_PAGE(0)))
>   return 0;
>  
>   bdevname(bdev, bdn);
> - pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
> + pr_warn("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
>   }
>  
>   return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 6c7bf903742f..7d8befde2aca 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -198,7 +198,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
> uint64_t start,
>   if (start + len > (i_size_read(bdev->bd_inode) >> 9))
>   return -EINVAL;
>  
> - return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
> + return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
>  }
>  
>  static int put_ushort(unsigned long arg, unsigned short val)
> diff --git a/drivers/block/drbd/drbd_receiver.c 
> b/drivers/block/drbd/drbd_receiver.c
> index 6960fb064731..ee5b9611c51c 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -1388,7 +1388,7 @@ int drbd_submit_peer_request(struct drbd_device *device,
>   list_add_tail(&peer_req->w.list, &device->active_ee);
>

Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function

2014-11-10 Thread Darrick J. Wong
On Fri, Nov 07, 2014 at 12:08:14AM -0500, Martin K. Petersen wrote:
> blkdev_issue_discard() will zero a given block range on disk. This is
> done by way of either WRITE SAME or regular WRITE. I.e. the blocks on
> disk will be written and thus provisioned.
> 
> There are use cases where the desired behavior is to zero the blocks but
> unprovision them if possible. The blocks must deterministically contain
> zeroes when they are subsequently read back.
> 
> This patch introduces a blkdev_issue_zeroout_discard() call that
> provides this functionality. If a block device guarantees
> discard_zeroes_data the new function will use discard to clear the block
> range. If the device does not support discard_zeroes_data or if the
> discard request fails we will fall back to blkdev_issue_zeroout() to
> ensure predictable results.

Can this be plumbed into a BLK* ioctl too?  I'll write a patch, if this is ok
with everyone:

struct blkzeroout_t {
__u64 start;
__u64 end;
__u32 flags;
};
#define BLKZEROOUT_DISCARD_OK   1

#define BLKZEROOUT_V2   _IOR(0x12, 127, sizeof(struct blkzeroout_t))

...and make it zap the page cache per earlier discussion.  This seems to be a
good fit with what we've been discussing for mke2fs.

--D

> 
> Signed-off-by: Martin K. Petersen 
> ---
>  block/blk-lib.c| 44 ++--
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 8411be3c19d3..2ffec6a01c71 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -278,14 +278,18 @@ static int __blkdev_issue_zeroout(struct block_device 
> *bdev, sector_t sector,
>  }
>  
>  /**
> - * blkdev_issue_zeroout - zero-fill a block range
> + * blkdev_issue_zeroout - zero-fill and provision a block range
>   * @bdev:blockdev to write
>   * @sector:  start sector
>   * @nr_sects:number of sectors to write
>   * @gfp_mask:memory allocation flags (for bio_alloc)
>   *
>   * Description:
> - *  Generate and issue number of bios with zerofiled pages.
> + *  Zero-fill a block range. The blocks will be provisioned
> + *  (allocated/anchored) and are guaranteed to return zeroes when read
> + *  back. This function will attempt to use WRITE SAME to optimize the
> + *  process if the block device supports it. Otherwise it will fall back
> + *  to zeroing the blocks using regular WRITE calls.
>   */
>  
>  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> @@ -305,3 +309,39 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
> sector_t sector,
>   return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
>  }
>  EXPORT_SYMBOL(blkdev_issue_zeroout);
> +
> +/**
> + * blkdev_issue_zeroout_discard - zero-fill and attempt to discard block 
> range
> + * @bdev:blockdev to write
> + * @sector:  start sector
> + * @nr_sects:number of sectors to write
> + * @gfp_mask:memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *  Zero-fill a block range. In contrast to blkdev_issue_zeroout() this
> + *  function will attempt to deprovision (deallocate/discard) the blocks
> + *  in question. It will only do so if the underlying device guarantees
> + *  that subsequent READ operations to the block range in question will
> + *  return zeroes. If the device does not provide hard guarantees or if
> + *  the DISCARD attempt should fail the block range will be explicitly
> + *  zeroed using blkdev_issue_zeroout().
> + */
> +
> +int blkdev_issue_zeroout_discard(struct block_device *bdev, sector_t sector,
> +  sector_t nr_sects, gfp_t gfp_mask)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + if (blk_queue_discard(q) && q->limits.discard_zeroes_data) {
> + unsigned char bdn[BDEVNAME_SIZE];
> +
> + if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
> + return 0;
> +
> + bdevname(bdev, bdn);
> + pr_err("%s: DISCARD failed. Manually zeroing.\n", bdn);
> + }
> +
> + return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
> +}
> +EXPORT_SYMBOL(blkdev_issue_zeroout_discard);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aac0f9ea952a..078b6e5f488a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1164,6 +1164,8 @@ extern int blkdev_issue_write_same(struct block_device 
> *bdev, sector_t sector,
>   sector_t nr_sects, gfp_t gfp_mask, struct page *page);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>   sector_t nr_sects, gfp_t gfp_mask);
> +extern int blkdev_issue_zeroout_discard(struct block_device *bdev,
> + sector_t sector, sector_t nr_sects, gfp_t gfp_mask);
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>   sector_t n

Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 03:53:33PM -0700, Zach Brown wrote:
> > > I'd just remove this generic teardown callback path entirely.  If
> > > there's PI state hanging off the iocb tear it down during iocb teardown.
> > 
> > Hmm, I thought aio_complete /was/ iocb teardown time.
> 
> Well, usually :).  If you build up before aio_run_iocb() then you nead
> to teardown in kiocb_free(), which is also called by aio_complete().

Oh, yeah.  I handle that by tearing down the extensions if stuff fails, though
I don't remember if that was in this version of the patchset.

> > > (Isn't there some allocate-and-copy-from-userspace helper now? But..)
> > 
> >  Is there?  I didn't find one when I looked, but it wasn't an 
> > exhaustive
> > search.
> 
> I could have sworn that I saw something.. ah, right, memdup_user().

Noted. :)

> > > I don't like the rudundancy of the implicit size requirement by a
> > > field's flag being set being duplicated by the explicit size argument.
> > > What does that give us, exactly?
> > 
> > Either another sanity check or another way to screw up, depending on how you
> > look at it.  I'd been considering shortening the size field to u32 and 
> > adding a
> > magic number field, but I wonder if that's really necessary.  Seems like it
> > shouldn't be -- if userland screws up, it's not hard to kill the process.
> > (Or segv it, or...)
> 
> I don't think I'd bother.  The bits should be enough and are already
> necessary to have explicit indicators of fields being set.



> > > Fields in the iocb  As each of these are initialized I'd just
> > > test the presence bits and __get_user() the userspace arguemnts
> > > directly, or copy_from_user() something slightly more complicated on to
> > > the stack.
> > >
> > > That gets rid of us having to care about the size at all.  It stops us
> > > from allocating a kernel copy and pinning it for the duration of the IO.
> > > We'd just be sampling the present userspace arguments as we initialie
> > > the iocb during submission.
> > 
> > I like this idea.  For the PI extension, nothing particularly error-prone
> > happens in teardown, which allows the flexibility to copy_from_user any
> > arguments required, and to copy_to_user any setup errors that happen.  I can
> > get rid a lot of allocate-and-copy nonsense, as you point out.
> > 
> > Ok, I'll migrate my patches towards this strategy, and let's see how much 
> > code
> > goes away. :)
> 
> Cool :).
> 
> > I've also noticed a bug where if you make one of these PI-extended calls on 
> > a
> > file living on a filesystem, it'll extend the io request's range to be
> > filesystem block-aligned, which causes all kinds of havoc with the user
> > provided PI buffers, since they now need to be extended to fit the added
> > blocks.  Alternately, one could require PI IOs to be fs-block aligned when
> > dealing with regular files. 
> 
> I think, like O_DIRECT, it just has to be aligned or fail :(.

Heh.  O_DIRECT is a hilarious maze of twisty unobvious requirements.  Yuck.

#define O_IMNAIVEENOUGHTOTHINKIKNOWWHATTHISDOES O_DIRECT

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 03:33:11PM -0700, Zach Brown wrote:
> > One thing I'm not sure about: What's the largest IO (in terms of # of 
> > blocks,
> > not # of struct iovecs) that I can throw at the kernel?
> 
> Yeah, dunno.  I'd guess big :).  I'd hope that the PI code already has a
> way to clamp the size of bios if there's a limit to the size of PI data
> that can be managed downstream?

I guess if we restricted the size of the PI buffer to a page's worth of
pointers to struct page, that limits us to 128M on x64 with DIF and 512b
sectors.  That's not really a whole lot; I suppose one could (ab)use vmalloc.

Yes, blk-integrity clamps the size of the bio to fit the downstream device's
maximum integrity sg size.  See max_integrity_segments for details, or the
mostly-undocumented sg_prot_tablesize sysfs attribute that reveals it.

I don't know what a practical limit is; scsi_debug sets it to 65536.

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 12:49:47PM -0700, Zach Brown wrote:
> > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
> > res2)
> > struct io_event *ev_page, *event;
> > unsigned long   flags;
> > unsigned tail, pos;
> > +   int ret;
> > +
> > +   ret = io_teardown_extensions(iocb);
> > +   if (ret) {
> > +   if (!res)
> > +   res = ret;
> > +   else if (!res2)
> > +   res2 = ret;
> > +   else
> > +   pr_err("error %d tearing down aio extensions\n", ret);
> > +   }
> 
> This ends up trying to copy the kernel's io_extension copy back to
> userspace from interrupts, which obviously won't fly.
> 
> And to what end?  So that maybe someone can later add an 'extension'
> that can fill in some field that's then copied to userspace?  But by
> copying the entire argument struct back?
> 
> Let's not get ahead of ourselves.  If they're going to try and give
> userspace some feedback after IO completion they're going to have to try
> a lot harder because they don't have acces to the submitting task
> context anymore.  They'd have to pin some reference to a feedback
> mechanism in the in-flight io.  I think we'd want that explicit in the
> iocb, not hiding off on the other side of this extension interface.

I think we'd want to find an extension that really needs this.  PI doesn't.
We can skate by without supporting the teardown errors case for now.

> I'd just remove this generic teardown callback path entirely.  If
> there's PI state hanging off the iocb tear it down during iocb teardown.

Hmm, I thought aio_complete /was/ iocb teardown time.

> > +struct io_extension_type {
> > +   unsigned int type;
> > +   unsigned int extension_struct_size;
> > +   int (*setup_fn)(struct kiocb *, int is_write);
> > +   int (*destroy_fn)(struct kiocb *);
> > +};
> 
> I'd also get rid of all of this.  More below.
> 
> > +static int io_setup_extensions(struct kiocb *req, int is_write,
> > +  struct io_extension __user *ioext)
> > +{
> > +   struct io_extension_type *iet;
> > +   __u64 sz, has;
> > +   int ret;
> > +
> > +   /* Check size of buffer */
> > +   if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz
> > +   return -EFAULT;
> > +   if (sz > PAGE_SIZE ||
> > +   sz > sizeof(struct io_extension) ||
> > +   sz < IO_EXT_SIZE(ie_has))
> > +   return -EINVAL;
> > +
> > +   /* Check that the buffer's big enough */
> > +   if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has
> > +   return -EFAULT;
> > +   ret = io_check_bufsize(has, sz);
> > +   if (ret)
> > +   return ret;
> > +
> > +   /* Copy from userland */
> > +   req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> > +   if (!req->ki_ioext)
> > +   return -ENOMEM;
> > +
> > +   req->ki_ioext->ke_user = ioext;
> > +   if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> > +   ret = -EFAULT;
> > +   goto out;
> > +   }
> 
> (Isn't there some allocate-and-copy-from-userspace helper now? But..)

 Is there?  I didn't find one when I looked, but it wasn't an exhaustive
search.

> I don't like the rudundancy of the implicit size requirement by a
> field's flag being set being duplicated by the explicit size argument.
> What does that give us, exactly?

Either another sanity check or another way to screw up, depending on how you
look at it.  I'd been considering shortening the size field to u32 and adding a
magic number field, but I wonder if that's really necessary.  Seems like it
shouldn't be -- if userland screws up, it's not hard to kill the process.
(Or segv it, or...)

> Our notion of the total size only seems to only matter if we're copying
> the entire struct from userspace and I'm don't think we need to do that.
> 
> For each argument, we're translating it into some kernel equivalent,
> right?

Yes.

> Fields in the iocb  As each of these are initialized I'd just
> test the presence bits and __get_user() the userspace arguemnts
> directly, or copy_from_user() something slightly more complicated on to
> the stack.
>
> That gets rid of us having to care about the size at all.  It stops us
> from allocating a kernel copy and pinning it for the duration of the IO.
> We'd just be sampling the present userspace arguments as we initialie
> the iocb during submission.

I like this idea.  For the PI extension, nothing particularly error-prone
happens in teardown, which allows the flexibility to copy_from_user any
arguments required, and to copy_to_user any setup errors that happen.  I can
get rid a lot of allocate-and-copy nonsense, as you point out.

Ok, I'll migrate my patches towards this strategy, and let's see how much code
goes away. :)

I've also noticed a bug where if you make one of these PI-extended calls on a
file living on a filesystem, it'll extend the io request's range to be
filesystem block-aligned, which causes all kinds 

Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 03:22:20PM -0400, Jeff Moyer wrote:
> "Darrick J. Wong"  writes:
> 
> > Define a generic interface to allow userspace to attach metadata to an
> > IO operation.  This interface will be used initially to implement
> > protection information (PI) pass through, though it ought to be usable
> > by anyone else desiring to extend the IO interface.  It should not be
> > difficult to modify the non-AIO calls to use this mechanism.
> 
> My main issue with this patch is determining what exactly gets returned
> to userspace when there is an issue in the teardown_extensions path.
> It looks like you'll get the first error propagated from
> io_teardown_extensions, others are ignored.  Then, in aio_complete, if
> there was no error with the I/O, then you'll get the teardown error
> reported in event->res, otherwise you'll get it in event->res2.  So,
> what are the valid errors returned by the teardown routine for
> extensions?  How is the userspace app supposed to determine where the
> error came from, the I/O or a failure in the extension teardown?

There's also the question of which extension spat out the error.  One solution
would be to augment struct io_extension with all the error fields that we want
(an extension can declare its own if needed) as we do now, and if errors happen
during setup, we can just copy_to_user them back.  If nothing else fails with
the IO setup, the setup routine can return -EINVAL, and userspace can look for
updated error fields in the struct.

Unfortunately for the teardown error case you'd have to pin the whole page in
memory for the duration of the IO just to have it around.  For now this isn't a
problem because teardown can't fail anyway.

> I think it may make sense to only use res2 for reporting io extension
> teardown failures.  Any new code that will use extensions can certainly
> be written to check both res and res2, and this method would prevent the
> ambiguity I mentioned.

Hmm, doesn't look like anyone actually uses res2 except for USB gadgets.

It's tempting just to shove the first ioextension error code that comes along
into res2 and abort the whole thing, and let userspace guess where the res2
code came from.  I think there's an additional problem with stuffing return
codes: in the case of synchronous IO syscalls, we'd have to deal with how to
cram error codes from (potentially) multiple sources into the single return
value, while not giving userspace any help as to where the code came from.

Now that I've written all that out, I don't like this idea so I'll drop it. :)

> Finally, I know this is an RFC, but please add some man-page changes to
> your patch set, and CC linux-man.  Michael Kerrisk typically has
> valuable advice on new APIs.

I'll do that the next time I rev the patches.  Thank you for the suggestion.

--D
> 
> Cheers,
> Jeff
> 
> >
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  fs/aio.c |  180 
> > +-
> >  include/linux/aio.h  |7 ++
> >  include/uapi/linux/aio_abi.h |   15 +++-
> >  3 files changed, 197 insertions(+), 5 deletions(-)
> >
> >
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 062a5f6..0c40bdc 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
> >  static const struct file_operations aio_ring_fops;
> >  static const struct address_space_operations aio_ctx_aops;
> >  
> > +static int io_teardown_extensions(struct kiocb *req);
> > +static int io_setup_extensions(struct kiocb *req, int is_write,
> > +  struct io_extension __user *ioext);
> > +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
> > +
> >  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> >  {
> > struct qstr this = QSTR_INIT("[aio]", 5);
> > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
> > res2)
> > struct io_event *ev_page, *event;
> > unsigned long   flags;
> > unsigned tail, pos;
> > +   int ret;
> > +
> > +   ret = io_teardown_extensions(iocb);
> > +   if (ret) {
> > +   if (!res)
> > +   res = ret;
> > +   else if (!res2)
> > +   res2 = ret;
> > +   else
> > +   pr_err("error %d tearing down aio extensions\n", ret);
> > +   }
> >  
> > /*
> >  * Special case handling for sync iocbs:
> > @@ -1350,15 +1366,167 @@ rw_common:
> > return 0;
> >

Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 01:01:33PM -0700, Zach Brown wrote:
> > +static int setup_pi_ext(struct kiocb *req, int is_write)
> > +{
> > +   struct file *file = req->ki_filp;
> > +   struct io_extension *ext = &req->ki_ioext->ke_kern;
> > +   void *p;
> > +   unsigned long start, end;
> > +   int retval;
> > +
> > +   if (!(file->f_flags & O_DIRECT)) {
> > +   pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
> > +   return -EINVAL;
> > +   }
> > +
> > +   BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
> > +
> > +   end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
> > +   PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +   start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
> > +   req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
> > +   req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
> > +   req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
> > +   p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
> > +   sizeof(struct page *),
> > +   GFP_NOIO);
> 
> Can userspace give us bad data and get us to generate insane allcation
> attempt warnings?

Easily.  One of the bits I have to work on for the PI part is figuring out how
to check with the PI provider that the arguments (the iovec and the pi buffer)
actually make any sense, in terms of length and alignment requirements (PI
tuples can't cross pages).  I think it's as simple as adding a bio_integrity
ops call, and then calling down to it from the kiocb level.

One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
not # of struct iovecs) that I can throw at the kernel?

> > +   if (p == NULL) {
> > +   pr_err("%s: no room for page array?\n", __func__);
> > +   return -ENOMEM;
> > +   }
> > +   req->ki_ioext->ke_pi_iter.pi_userpages = p;
> > +
> > +   retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
> > +req->ki_ioext->ke_pi_iter.pi_nrpages,
> > +is_write,
> 
> Isn't this is_write backwards?  If it's a write syscall then the PI
> pages is going to be read from.

Yes, I think so.  Good catch!

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code

2014-04-02 Thread Darrick J. Wong
On Wed, Apr 02, 2014 at 12:17:58PM -0700, Zach Brown wrote:
> > +static int bio_integrity_generate_verify(struct bio *bio, int operate)
> >  {
> 
> > +   if (operate)
> > +   sector = bio->bi_iter.bi_sector;
> > +   else
> > +   sector = bio->bi_integrity->bip_iter.bi_sector;
> 
> > +   if (operate) {
> > +   bi->generate_fn(&bix);
> > +   } else {
> > +   ret = bi->verify_fn(&bix);
> > +   if (ret) {
> > +   kunmap_atomic(kaddr);
> > +   return ret;
> > +   }
> > +   }
> 
> I was glad to see this replaced with explicit sector and func arguments
> in later refactoring in the 6/ patch.
> 
> But I don't think the function poiner casts in that 6/ patch are wise
> (Or even safe all the time, given crazy function pointer trampolines?
> Is that still a thing?).  I'd have made a single walk_fn type that
> returns and have the non-returning iterators just return 0.

Noted.  I cleaned all that crap out just yesterday, so now there's only one
walk function and some context data that gets passed to the iterator function.
Much less horrifying.

(I really only included this patch so that I'd have less rebasing work when
3.15-rc1 comes out.)

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] blk-integrity: refactor various routines

2014-03-24 Thread Darrick J. Wong
Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong 
---
 fs/bio-integrity.c  |  120 +--
 include/linux/bio.h |2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
struct bio_integrity_payload *bip = bio->bi_integrity;
struct bio_set *bs = bio->bi_pool;
 
-   if (bip->bip_owns_buf)
-   kfree(bip->bip_buf);
+   kfree(bip->bip_buf);
 
if (bs) {
if (bip->bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
 {
struct bio_integrity_payload *bip = bio->bi_integrity;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   unsigned int nr_sectors;
-
-   BUG_ON(bip->bip_buf == NULL);
+   unsigned int nr_sectors, tag_offset, sectors;
+   void *prot_buf;
+   unsigned int prot_offset, prot_len;
+   struct bio_vec *iv;
+   void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
if (bi->tag_size == 0)
return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
return -1;
}
 
-   if (set)
-   bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
-   else
-   bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+   iv = bip->bip_vec;
+   prot_offset = iv->bv_offset;
+   prot_len = iv->bv_len;
+   prot_buf = kmap_atomic(iv->bv_page);
+   tag_fn = set ? bi->set_tag_fn : bi->get_tag_fn;
+   tag_offset = 0;
+
+   while (nr_sectors) {
+   if (prot_len < bi->tuple_size) {
+   kunmap_atomic(prot_buf);
+   iv++;
+   BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+   prot_offset = iv->bv_offset;
+   prot_len = iv->bv_len;
+   prot_buf = kmap_atomic(iv->bv_page);
+   }
+   sectors = min(prot_len / bi->tuple_size, nr_sectors);
+   tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+   nr_sectors -= sectors;
+   tag_offset += sectors * bi->tuple_size;
+   prot_offset += sectors * bi->tuple_size;
+   prot_len -= sectors * bi->tuple_size;
+   }
+   kunmap_atomic(prot_buf);
 
return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:   bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+  walk_buf_fn *mod_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector;
unsigned int sectors, total, ret;
void *prot_buf;
unsigned int prot_offset, prot_len, bv_offset, bv_len;
struct bio_vec *iv;
struct bio_integrity_payload *bip = bio->bi_integrity;
 
-   if (!bi->mod_user_buf_fn)
+   if (!mod_fn)
return 0;
 
-   sector = bio->bi_iter.bi_sector;
-
total = ret = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
bix.prot_buf = prot_buf + prot_offset;
bix.sector = sector;
 
-   ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+   ret = mod_fn(&bix, bip->bip_user_flags);
if (ret) {
if (ret == -ENOTTY)
ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
kunmap_atomic(prot_buf);
return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * 

[PATCH 5/6] PI IO extension: advertise possible userspace flags

2014-03-24 Thread Darrick J. Wong
Expose possible userland flags to the new PI IO extension so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong 
---
 Documentation/ABI/testing/sysfs-block  |   14 ++
 Documentation/block/data-integrity.txt |   22 +
 block/blk-integrity.c  |   33 
 drivers/scsi/sd_dif.c  |   11 +++
 include/linux/blkdev.h |7 +++
 5 files changed, 87 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
512 bytes of data.
 
 
+What:  /sys/block//integrity/tuple_size
+Date:  March 2014
+Contact:   Darrick J. Wong 
+Description:
+   Size in bytes of the integrity data buffer for each logical
+   block.
+
+What:  /sys/block//integrity/write_user_flags
+Date:  March 2014
+Contact:   Darrick J. Wong 
+Description:
+   Provides a list of flags that userspace can pass to the kernel
+   when supplying integrity data for a write IO.
+
 What:  /sys/block//integrity/write_generate
 Date:  June 2008
 Contact:   Martin K. Petersen 
diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index b72a54f..e33d4a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,29 @@ will require extra work due to the application tag.
   specific to the blk_integrity provider) arrange for pre-processing
   of the user buffer prior to issuing the IO.
 
+  'user_write_flags' points to an array of struct blk_integrity_flag,
+  which maps mod_user_buf_fn flags to a description of what they do.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+The "IO extension" interface has been expanded to provide
+userspace programs with the ability to provide PI data with a WRITE,
+or to receive PI data with a READ.  The fields ie_pi_buf,
+ie_pi_buflen, and ie_pi_flags should contain a pointer to the PI
+buffer, the length of the PI buffer, and any flags that should be
+passed to the PI provider.
+
+This buffer must contain PI tuples.  Tuples must NOT split a page
+boundary.  Valid flag values can be found in
+/sys/block/*/integrity/user_write_flags.  The tuple size can be found
+in /sys/block/*/integrity/tuple_size.
+
+In general, the flags allow the user program to ask the in-kernel
+integrity provider to fill in some parts of the tuples.  For example,
+the T10 DIF provider can fill in the reference tag (sector number) so
+that userspace can choose not to care about the reference tag.
+
 --
 2007-12-24 Martin K. Petersen 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity 
*bi, char *page)
return sprintf(page, "%d\n", (bi->flags & INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+   struct blk_integrity_flag *flag = bi->user_write_flags;
+   char *p = page;
+   ssize_t ret = 0;
+
+   while (flag->value) {
+   ret += snprintf(p, PAGE_SIZE - ret, "0x%x: %s\n",
+   flag->value, flag->descr);
+   p = page + ret;
+   flag++;
+   }
+   return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+   return sprintf(page, "%d\n", bi->tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
.attr = { .name = "format", .mode = S_IRUGO },
.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry 
= {
.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+   .attr = { .name = "write_user_flags", .mode = S_IRUGO },
+   .show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+   .attr = { .name = "tuple_size", .mode = S_IRUGO },
+   .show = integrity_tuple_size_show,
+};
+
 static struct attribute *integrity_attrs[] = {
&integrity_format_entry.attr,
&integrity_tag_size_entry.attr,
&integrity_read_entry.attr,
&integrity_write_entry.attr,
+   &integrity_write_flag

[PATCH 1/6] fs/bio-integrity: remove duplicate code

2014-03-24 Thread Darrick J. Wong
Frøm: Gu Zheng 

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng 
---
 fs/bio-integrity.c |   83 +++-
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector = bio->bi_iter.bi_sector;
-   unsigned int sectors, total;
+   sector_t sector;
+   unsigned int sectors, total, ret;
void *prot_buf = bio->bi_integrity->bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio->bi_iter.bi_sector;
+   else
+   sector = bio->bi_integrity->bip_iter.bi_sector;
+
+   total = ret = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
bix.prot_buf = prot_buf;
bix.sector = sector;
 
-   bi->generate_fn(&bix);
+   if (operate) {
+   bi->generate_fn(&bix);
+   } else {
+   ret = bi->verify_fn(&bix);
+   if (ret) {
+   kunmap_atomic(kaddr);
+   return ret;
+   }
+   }
 
sectors = bv.bv_len / bi->sector_size;
sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio->bi_integrity->bip_iter.bi_sector;
-   unsigned int sectors, ret = 0;
-   void *prot_buf = bio->bi_integrity->bip_buf;
-   int i;
-
-   bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-   bix.sector_size = bi->sector_size;
-
-   bio_for_each_segment_all(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv->bv_page);
-
-   bix.data_buf = kaddr + bv->bv_offset;
-   bix.data_size = bv->bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi->verify_fn(&bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv->bv_len / bi->sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi->tuple_size;
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] aio/dio: enable PI passthrough

2014-03-24 Thread Darrick J. Wong
Provide an IO extension handler that attaches PI data from the io
extension structure to a kiocb, then teach directio how to attach the
pages representing the PI buffer directly to a bio.

Signed-off-by: Darrick J. Wong 
---
 Documentation/block/data-integrity.txt |   11 
 fs/aio.c   |   62 +
 fs/bio-integrity.c |   94 +++-
 fs/direct-io.c |   70 +++-
 include/linux/aio.h|   10 +++
 include/linux/bio.h|   15 +
 include/uapi/linux/aio_abi.h   |6 ++
 mm/filemap.c   |6 ++
 8 files changed, 259 insertions(+), 15 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
   It is up to the receiver to process them and verify data
   integrity upon completion.
 
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+ struct bio_integrity_prep_iter *pi);
+
+  This function should be called before submit_bio; its purpose is to
+  attach an arbitrary array of struct page * containing integrity data
+  to an existing bio.  Primarily this is intended for AIO/DIO to be
+  able to attach a userspace buffer to a bio.
+
+  The bio_integrity_prep_iter should contain the page offset and buffer
+  length of the PI buffer, the number of pages, and the actual array of
+  pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 0c40bdc..3f932c3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1379,7 +1379,69 @@ struct io_extension_type {
int (*destroy_fn)(struct kiocb *);
 };
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int destroy_pi_ext(struct kiocb *req)
+{
+   unsigned int i;
+
+   if (req->ki_ioext->ke_pi_iter.pi_userpages == NULL)
+   return 0;
+
+   for (i = 0; i < req->ki_ioext->ke_pi_iter.pi_nrpages; i++)
+   page_cache_release(req->ki_ioext->ke_pi_iter.pi_userpages[i]);
+   kfree(req->ki_ioext->ke_pi_iter.pi_userpages);
+   req->ki_ioext->ke_pi_iter.pi_userpages = NULL;
+
+   return 0;
+}
+
+static int setup_pi_ext(struct kiocb *req, int is_write)
+{
+   struct file *file = req->ki_filp;
+   struct io_extension *ext = &req->ki_ioext->ke_kern;
+   void *p;
+   unsigned long start, end;
+   int retval;
+
+   if (!(file->f_flags & O_DIRECT)) {
+   pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
+   return -EINVAL;
+   }
+
+   BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
+
+   end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
+   PAGE_SIZE - 1) >> PAGE_SHIFT;
+   start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
+   req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
+   req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
+   req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
+   p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
+   sizeof(struct page *),
+   GFP_NOIO);
+   if (p == NULL) {
+   pr_err("%s: no room for page array?\n", __func__);
+   return -ENOMEM;
+   }
+   req->ki_ioext->ke_pi_iter.pi_userpages = p;
+
+   retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
+req->ki_ioext->ke_pi_iter.pi_nrpages,
+is_write,
+req->ki_ioext->ke_pi_iter.pi_userpages);
+   if (retval != req->ki_ioext->ke_pi_iter.pi_nrpages) {
+   pr_err("%s: couldn't map pages?\n", __func__);
+   req->ki_ioext->ke_pi_iter.pi_nrpages = retval;
+   return -ENOMEM;
+   }
+   req->ki_flags |= KIOCB_DIO_ONLY;
+
+   return 0;
+}
+#endif
+
 static struct io_extension_type extensions[] = {
+   {IO_EXT_PI, IO_EXT_SIZE(ie_pi_ret), setup_pi_ext, destroy_pi_ext},
{IO_EXT_INVALID, 0, NULL, NULL},
 };
 
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..3df9aeb 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
struct bio_vec *iv;
 
if (bip->bip_vcnt >= bip_integrity_vecs(bip)) {
-   printk(KERN_ERR "%s: bip_vec full\n", _

[PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info

2014-03-24 Thread Darrick J. Wong
Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong 
---
 Documentation/block/data-integrity.txt |   11 
 block/blk-integrity.c  |1 
 drivers/scsi/sd_dif.c  |   76 ++
 fs/aio.c   |3 +
 fs/bio-integrity.c |   80 
 fs/direct-io.c |1 
 include/linux/bio.h|3 +
 include/linux/blkdev.h |2 +
 include/uapi/linux/aio_abi.h   |1 
 9 files changed, 162 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
   The bio_integrity_prep_iter should contain the page offset and buffer
   length of the PI buffer, the number of pages, and the actual array of
-  pages, as returned by get_user_pages.
+  pages, as returned by get_user_pages.  The user_flags argument should
+  contain whatever flag values were passed in by userspace; the values
+  of the flags are specific to the block integrity provider, and are
+  passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
   are available per hardware sector.  For DIF this is either 2 or
   0 depending on the value of the Control Mode Page ATO bit.
 
+  'mod_user_buf_fn' updates the appropriate integrity metadata for
+  a WRITE operation.  This function is called when userspace passes
+  in a PI buffer along with file data; the flags argument (which is
+  specific to the blk_integrity provider) arrange for pre-processing
+  of the user buffer prior to issuing the IO.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 --
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct 
blk_integrity *template)
bi->set_tag_fn = template->set_tag_fn;
bi->get_tag_fn = template->get_tag_fn;
bi->tag_size = template->tag_size;
+   bi->mod_user_buf_fn = template->mod_user_buf_fn;
} else
bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD (1)
+#define GENERATE_REF   (2)
+#define GENERATE_APP   (4)
+#define GENERATE_ALL   (7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+int flags)
 {
void *buf = bix->data_buf;
struct sd_dif_tuple *sdt = bix->prot_buf;
sector_t sector = bix->sector;
unsigned int i;
 
+   if (flags & ~GENERATE_ALL)
+   return -EINVAL;
+   if (!flags)
+   return -ENOTTY;
+
for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-   sdt->guard_tag = fn(buf, bix->sector_size);
-   sdt->ref_tag = cpu_to_be32(sector & 0x);
-   sdt->app_tag = 0;
+   if (flags & GENERATE_GUARD)
+   sdt->guard_tag = fn(buf, bix->sector_size);
+   if (flags & GENERATE_REF)
+   sdt->ref_tag = cpu_to_be32(sector & 0x);
+   if (flags & GENERATE_APP)
+   sdt->app_tag = 0;
 
buf += bix->sector_size;
sector++;
}
+
+   return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_crc_fn);
+   sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static vo

[PATCH 2/6] io: define an interface for IO extensions

2014-03-24 Thread Darrick J. Wong
Define a generic interface to allow userspace to attach metadata to an
IO operation.  This interface will be used initially to implement
protection information (PI) pass through, though it ought to be usable
by anyone else desiring to extend the IO interface.  It should not be
difficult to modify the non-AIO calls to use this mechanism.

Signed-off-by: Darrick J. Wong 
---
 fs/aio.c |  180 +-
 include/linux/aio.h  |7 ++
 include/uapi/linux/aio_abi.h |   15 +++-
 3 files changed, 197 insertions(+), 5 deletions(-)


diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..0c40bdc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static int io_teardown_extensions(struct kiocb *req);
+static int io_setup_extensions(struct kiocb *req, int is_write,
+  struct io_extension __user *ioext);
+static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
struct qstr this = QSTR_INIT("[aio]", 5);
@@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
struct io_event *ev_page, *event;
unsigned long   flags;
unsigned tail, pos;
+   int ret;
+
+   ret = io_teardown_extensions(iocb);
+   if (ret) {
+   if (!res)
+   res = ret;
+   else if (!res2)
+   res2 = ret;
+   else
+   pr_err("error %d tearing down aio extensions\n", ret);
+   }
 
/*
 * Special case handling for sync iocbs:
@@ -1350,15 +1366,167 @@ rw_common:
return 0;
 }
 
+/* IO extension code */
+#define REQUIRED_STRUCTURE_SIZE(type, member)  \
+   (offsetof(type, member) + sizeof(((type *)NULL)->member))
+#define IO_EXT_SIZE(member) \
+   REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
+
+struct io_extension_type {
+   unsigned int type;
+   unsigned int extension_struct_size;
+   int (*setup_fn)(struct kiocb *, int is_write);
+   int (*destroy_fn)(struct kiocb *);
+};
+
+static struct io_extension_type extensions[] = {
+   {IO_EXT_INVALID, 0, NULL, NULL},
+};
+
+static int is_write_iocb(struct iocb *iocb)
+{
+   switch (iocb->aio_lio_opcode) {
+   case IOCB_CMD_PWRITE:
+   case IOCB_CMD_PWRITEV:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+static int io_teardown_extensions(struct kiocb *req)
+{
+   struct io_extension_type *iet;
+   int ret, ret2;
+
+   if (req->ki_ioext == NULL)
+   return 0;
+
+   /* Shut down all the extensions */
+   ret = 0;
+   for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
+   if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
+   continue;
+   ret2 = iet->destroy_fn(req);
+   if (ret2 && !ret)
+   ret = ret2;
+   }
+
+   /* Copy out return values */
+   if (unlikely(copy_to_user(req->ki_ioext->ke_user,
+ &req->ki_ioext->ke_kern,
+ sizeof(struct io_extension {
+   if (!ret)
+   ret = -EFAULT;
+   }
+
+   kfree(req->ki_ioext);
+   req->ki_ioext = NULL;
+   return ret;
+}
+
+static int io_check_bufsize(__u64 has, __u64 size)
+{
+   struct io_extension_type *iet;
+   __u64 all_flags = 0;
+
+   for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
+   all_flags |= iet->type;
+   if (!(has & iet->type))
+   continue;
+   if (iet->extension_struct_size > size)
+   return -EINVAL;
+   }
+
+   if (has & ~all_flags)
+   return -EINVAL;
+
+   return 0;
+}
+
+static int io_setup_extensions(struct kiocb *req, int is_write,
+  struct io_extension __user *ioext)
+{
+   struct io_extension_type *iet;
+   __u64 sz, has;
+   int ret;
+
+   /* Check size of buffer */
+   if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz
+   return -EFAULT;
+   if (sz > PAGE_SIZE ||
+   sz > sizeof(struct io_extension) ||
+   sz < IO_EXT_SIZE(ie_has))
+   return -EINVAL;
+
+   /* Check that the buffer's big enough */
+   if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has
+   return -EFAULT;
+   ret = io_check_bufsize(has, sz);
+   if (ret)
+   return ret;
+
+   /* Copy from userland */
+   req->ki_

[RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO

2014-03-24 Thread Darrick J. Wong
This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.
There's a new "IO extension" interface wherein we define a structure
(per zab's comments on the v2 series) io_extension that points to the
the PI data buffer.  These patches are against 3.14-rc7.

NOTE: As far as I know this works, but this is just a refresh of last
week's patchset to start the discussion at LSF, which was moved up to
today.  I've not done rigorous testing, hence the 'donotmerge'.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 implements a generic IO extension interface so that we can
receive a struct io_extension from userspace containing the structure
size, a flag telling us which extensions we'd like to use (ie_has),
and (eventually) extension data.  There's a small framework for
mapping ie_has bits to actual extensions.

Patch #3 provides the plumbing to get the user's buffer all the way to
the block integrity code.  Due to the way that the code deals with the
array of struct page*s that represent the PI buffer, there's an
unfortunate requirement that no PI tuple may cross a page boundary.
Given that so far DIF is only 8 or 16 bytes this isn't a problem yet.
There's also no explicit fallback for the case where the user pages
are not within a device's DMA range.  This patch hooks into the IO
extension interface.

Patch #4 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
"mod_user_buf_fn" hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #5 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #6 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Eventually there will be a patch #7 that makes it so that IO
extensions can be piped through the synchronous IO calls, but it was
nowhere near ready when I sent this patchset. :(

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -pr -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define GENERATE_GUARD  (1)
#define GENERATE_REF(2)
#define GENERATE_APP(4)
#define GENERATE_ALL(7)

#define NR_IOS  (1)
#define NR_IOVS (2)
#define NR_IOCB_EXTS(1)

/* Stuff that should go in libaio.h */
#define IO_EXT_INVALID  (0)
#define IO_EXT_PI   (1) /* protection info attached */

#define IOCB_FLAG_EXTENSIONS(1 << 1)

#define __FIOEXT04000

struct io_extension {
__u64 ie_size;
__u64 ie_has;

/* PI stuff */
__u64 ie_pi_buf;
__u32 ie_pi_buflen;
__u32 ie_pi_ret;
__u32 ie_pi_flags;
};

static void io_prep_extensions(struct iocb *iocb, struct io_extension *ext,
   unsigned int nr)
{
iocb->u.c.flags |= IOCB_FLAG_EXTENSIONS;
iocb->u.c.__pad3 = (long long)ext;
}

static void io_prep_extension(struct io_extension *ext)
{
memset(ext, 0, sizeof(struct io_extension));
ext->ie_size = sizeof(*ext);
}

static void io_prep_extension_pi(struct io_extension *ext, void *buf,
 unsigned int buflen, unsigned int flags)
{
ext->ie_has |= IO_EXT_PI;
ext->ie_pi_buf = (__u64)buf;
ext->ie_pi_buflen = buflen;
ext->ie_pi_flags = flags;
}
/* End stuff for libaio.h */

static void dump_buffer(char *buf, size_t len)
{
size_t off;
char *p;

for (p = buf; p < buf + len; p++) {
off = p - buf;
 

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-23 Thread Darrick J. Wong
On Sun, Mar 23, 2014 at 03:02:44PM +0100, Jan Kara wrote:
> On Sat 22-03-14 02:43:20, Darrick J. Wong wrote:
> > On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> > > > I'll admit, though, that I don't really like having to fetch the 'has'
> > > > bits first to find out how large the rest of the struct is.  Maybe
> > > > that's not worth worrying about.
> > > 
> > > I'm not worrying about having to pluck 'has' out of the structure, but 
> > > needing
> > > a function to tell me how big of a buffer I need for a given pile of flags
> > > seems ... icky.  But maybe the ease of modifying strace and security 
> > > auditors
> > > would make it worth it?
> > 
> > How about explicitly specifying the structure size in struct some_more_args,
> > and checking that against whatever we find in .has?  Hm.  I still think 
> > that's
> > too clever for my brain to keep together for long.
> > 
> > I'm also nervous that we could be creating this monster of a structure 
> > wherein
> > some user wants to tack the first and last hints ever created onto an IO, so
> > now we have to lug this huge structure around that has space for hints that
> > we're not going to use, and most of which is zeroes.
>   Well, why does it matter that the structure would be big? Are do you
> think the memory consumption would matter?

I doubt the memory consumption will be a big deal (compared to the size of the
IOs), but I'm a little concerned about the overhead of copying a mostly-zeroes
user buffer into the kernel.  I guess it's not a big deal to copy the whole
thing now and if people complain about the overhead, switch it to let the IO
attribute controllers selectively copy_from_user later.

--D
> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-22 Thread Darrick J. Wong
On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> > On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> > >
> > > > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > > > extensible data structure that can be grown without completely revising 
> > > > the ABI as new parameters are added.  We need something that is either 
> > > > TLV based, or an extensible array.
> > > 
> > > Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to 
> > > indicate
> > > that this struct iocb has extensions attached to it.  Then, 
> > > iocb.aio_reserved2
> > > becomes a pointer to an array of extension descriptors, and 
> > > iocb.aio_reqprio
> > > becomes a u16 that tells us the array length.  The libaio.h equivalents 
> > > are
> > > iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> > > 
> > > Next, let's define a conceptual structure for aio extensions:
> > > 
> > > struct iocb_extension {
> > >   void *ie_buf;
> > >   unsigned int ie_buflen;
> > >   unsigned int ie_type;
> > >   unsigned int ie_flags;
> > > };
> > > 
> > > The actual definitions can be defined in a similar fashion to the other 
> > > aio
> > > structures so that the structures are padded to the same layout 
> > > regardless of
> > > bitness.  As mentioned above, iocb.aio_reserved2 points to an array of 
> > > these.
> > 
> > I'm firmly in the camp that doesn't want to go down this abstract road.
> > We had this conversation with Kent when he wanted to do something very
> > similar.
> 
> Could you point me to this discussion?  I'd like to read it.

Is it "[RFC, PATCH] Extensible AIO interface"?
http://lkml.iu.edu//hypermail/linux/kernel/1210.0/00651.html 

Regrettably that discussion happened right during that period where I was
pleasantly AWOL from work for a few months. :)

Will read ... tomorrow.

> > What happens if there are duplicate ie_types?  Is that universally
> > prohibited, validity left up to the types that are duplicated?
> 
> Yes.
> 
> > What if the len is not the right size?  Who checks that?
> 
> The extension driver, presumably.
> 
> >  What if the extension (they're arguments, but one thing at a time) is
> >  writable and the buf pointers overlap or is unaligned?  Is that cool, who
> >  checks it?
> 
> Each extension driver has to check the alignment.  I don't know what to do
> about buffer pointer overlap; if you want to shoot yourself in the foot that's
> fine with me.
> 
> > Who defines the acceptable set?

(This was an "I don't know", for anyone who cares.)

> 
> >  Can drivers make up their own weird types?
> 
> How do you mean?  As far as whatever's in the ie_buf, I think that depends on
> the extension.
> 
> >  How does strace print all this?  How does the security module universe
> >  declare policies that can forbid or allow these things?
> 
> I don't know.
> 
> > Personally, I think this level of dynamism is not worth the complexity.
> > 
> > Can we instead just have a nice easy struct with fixed members that only
> > grows?
> > 
> > struct some_more_args {
> > u64 has; /* = HAS_PI_VEC; */
> > u64 pi_vec_ptr;
> > u64 pi_vec_nr_segs;
> > };
> > 
> > struct some_more_args {
> > u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
> > u64 pi_vec_ptr;
> > u64 pi_vec_nr_segs;
> > u64 magic_thing;
> > };
> > 
> > If it only grows and has bits indicating presence then I think we're
> > good.   You only fetch the space for the bits that are indicated.  You
> > can return errors for bits you don't recognize.  You could perhaps offer
> > some way to announce the bits you recognize.
> 
>  I was gonna just -EINVAL for types we don't recognize, or which don't
> apply in this scenario.
> 
> > I'll admit, though, that I don't really like having to fetch the 'has'
> > bits first to find out how large the rest of the struct is.  Maybe
> > that's not worth worrying about.
> 
> I'm not worrying about having to pluck 'has' out of the structure, but needing
> a function to tell me how big of a buffer I need for a given pile of flags
> seems ... i

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong
On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> >
> > > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > > extensible data structure that can be grown without completely revising 
> > > the ABI as new parameters are added.  We need something that is either 
> > > TLV based, or an extensible array.
> > 
> > Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
> > that this struct iocb has extensions attached to it.  Then, 
> > iocb.aio_reserved2
> > becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
> > becomes a u16 that tells us the array length.  The libaio.h equivalents are
> > iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> > 
> > Next, let's define a conceptual structure for aio extensions:
> > 
> > struct iocb_extension {
> > void *ie_buf;
> > unsigned int ie_buflen;
> > unsigned int ie_type;
> > unsigned int ie_flags;
> > };
> > 
> > The actual definitions can be defined in a similar fashion to the other aio
> > structures so that the structures are padded to the same layout regardless 
> > of
> > bitness.  As mentioned above, iocb.aio_reserved2 points to an array of 
> > these.
> 
> I'm firmly in the camp that doesn't want to go down this abstract road.
> We had this conversation with Kent when he wanted to do something very
> similar.

Could you point me to this discussion?  I'd like to read it.

> What happens if there are duplicate ie_types?  Is that universally
> prohibited, validity left up to the types that are duplicated?

Yes.

> What if the len is not the right size?  Who checks that?

The extension driver, presumably.

>  What if the extension (they're arguments, but one thing at a time) is
>  writable and the buf pointers overlap or is unaligned?  Is that cool, who
>  checks it?

Each extension driver has to check the alignment.  I don't know what to do
about buffer pointer overlap; if you want to shoot yourself in the foot that's
fine with me.

> Who defines the acceptable set?


>  Can drivers make up their own weird types?

How do you mean?  As far as whatever's in the ie_buf, I think that depends on
the extension.

>  How does strace print all this?  How does the security module universe
>  declare policies that can forbid or allow these things?

I don't know.

> Personally, I think this level of dynamism is not worth the complexity.
> 
> Can we instead just have a nice easy struct with fixed members that only
> grows?
> 
> struct some_more_args {
>   u64 has; /* = HAS_PI_VEC; */
>   u64 pi_vec_ptr;
>   u64 pi_vec_nr_segs;
> };
> 
> struct some_more_args {
>   u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
>   u64 pi_vec_ptr;
>   u64 pi_vec_nr_segs;
>   u64 magic_thing;
> };
> 
> If it only grows and has bits indicating presence then I think we're
> good.   You only fetch the space for the bits that are indicated.  You
> can return errors for bits you don't recognize.  You could perhaps offer
> some way to announce the bits you recognize.

 I was gonna just -EINVAL for types we don't recognize, or which don't
apply in this scenario.

> I'll admit, though, that I don't really like having to fetch the 'has'
> bits first to find out how large the rest of the struct is.  Maybe
> that's not worth worrying about.

I'm not worrying about having to pluck 'has' out of the structure, but needing
a function to tell me how big of a buffer I need for a given pile of flags
seems ... icky.  But maybe the ease of modifying strace and security auditors
would make it worth it?

> Thoughts?  Am I out to lunch here?

I don't have a problem adopting your design, aside from the complications of
figuring out how big struct some_more_args really is.

> > Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave 
> > that
> > to the extension designer to decide if they want to support either a S-G 
> > list,
> > one big (vaddr) buffer, or toggle flags?
> 
> No idea.  Either seems doable.  I'd aim for simpler to reduce the number
> of weird cases to handle or forbid (iovecs with a byte per page!) unless
> Martin thinks people want to vector the PI goo.

For now I'll leave it as a simple buffer until I hear otherwise.

> > I think so.  Let's see how much we can get done.
> 
> FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
> around.

Me too!

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong
On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> Hi folks,
> 
> On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> > On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > > arg list is interpreted to point to a buffer containing a header,
> > > followed by the the PI data.
> > 
> > Instead of adding commands that indicate that the final element is a
> > magical pi buffer, why not expand the iocb?
> > 
> > In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> > is a pointer to an extension of the iocb.  In that extension could be a
> > full iov *, nr_segs for PI data.
> 
> I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> extensible data structure that can be grown without completely revising 
> the ABI as new parameters are added.  We need something that is either 
> TLV based, or an extensible array.

Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
becomes a u16 that tells us the array length.  The libaio.h equivalents are
iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.

Next, let's define a conceptual structure for aio extensions:

struct iocb_extension {
void *ie_buf;
unsigned int ie_buflen;
unsigned int ie_type;
unsigned int ie_flags;
};

The actual definitions can be defined in a similar fashion to the other aio
structures so that the structures are padded to the same layout regardless of
bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.

Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
to the extension designer to decide if they want to support either a S-G list,
one big (vaddr) buffer, or toggle flags?

For the PI passthrough, I'll define IOCB_EXT_PI as the first ie_type, and move
the flags argument out of the PI buffer and into ie_flags.

I could also make an IOCB_EXT_REQPRIO where ie_flags = reqprio, but since the
kernel ignores it right now, I don't see much point.

> > You'd then translate that into a bigger kernel kiocb with a specific
> > pointer to PI data rather than having to bubble the tests for this magic
> > final iovec down through the kernel.
> > 
> > +   if (iocb->ki_flags & KIOCB_USE_PI) {
> > +   nr_segs--;
> > +   pi_iov = (struct iovec *)(iov + nr_segs);
> > +   }
> > 
> > I suggest this because there's already pressure to extend the iocb.
> > Folks want io priority inputs, completion time outputs, etc.
> 
> There are already folks at other companies looking at similar extensions.  
> I think there are folks at Google who have similar requirements.

To everyone else interested in AIO extensions: I'd love to hear your ideas.

> Do you have time to put in some effort into defining these extensions?

I think so.  Let's see how much we can get done.

--D
> 
>   -ben
> 
> > It's a much cleaner way to extend the interface without an explosion of
> > command enums that are really combinations of per-io arguments that are
> > present or not.
> > 
> > And heck, on the sync rw syscall side, add variant that have a pointer
> > to this same extension struct.  There's nothing inherently aio specific
> > about having lots more per-io inputs and outputs.
> > 
> > - z
> 
> -- 
> "Thought is the essence of where you are now."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong
On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > This RFC provides a rough implementation of a mechanism to allow
> > userspace to attach protection information (e.g. T10 DIF) data to a
> > disk write and to receive the information alongside a disk read.  The
> > interface is an extension to the AIO interface: two new commands
> > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > arg list is interpreted to point to a buffer containing a header,
> > followed by the the PI data.
> 
> Instead of adding commands that indicate that the final element is a
> magical pi buffer, why not expand the iocb?
> 
> In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> is a pointer to an extension of the iocb.  In that extension could be a
> full iov *, nr_segs for PI data.
> 
> You'd then translate that into a bigger kernel kiocb with a specific
> pointer to PI data rather than having to bubble the tests for this magic
> final iovec down through the kernel.
> 
> +   if (iocb->ki_flags & KIOCB_USE_PI) {
> +   nr_segs--;
> +   pi_iov = (struct iovec *)(iov + nr_segs);
> +   }
> 
> I suggest this because there's already pressure to extend the iocb.
> Folks want io priority inputs, completion time outputs, etc.

I'm curious about the reqprio field -- it seems like it was put there to
request some kind of IO priority change, but the kernel doesn't use it.

If aio_reserved2 becomes a (flag-guarded) pointer to an array of aio
extensions, I'd be tempted to reuse the reqprio to signal the length of the
extension array, and if anyone wants to start using reqprio, they could add it
as an extension.

(More about this in my response to Ben LaHaise.)

> It's a much cleaner way to extend the interface without an explosion of
> command enums that are really combinations of per-io arguments that are
> present or not.

Agreed.

> And heck, on the sync rw syscall side, add variant that have a pointer
> to this same extension struct.  There's nothing inherently aio specific
> about having lots more per-io inputs and outputs.

I'm curious -- what kinds of extensions do you envision for sync()?

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong
On Fri, Mar 21, 2014 at 10:57:31AM -0400, Jeff Moyer wrote:
> "Darrick J. Wong"  writes:
> 
> > This RFC provides a rough implementation of a mechanism to allow
> > userspace to attach protection information (e.g. T10 DIF) data to a
> > disk write and to receive the information alongside a disk read.  The
> > interface is an extension to the AIO interface: two new commands
> > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> 
> Sorry for the shallow question, but what does that M stand for?

Hmmm... I really don't remember why I picked 'M'.  Probably because it implied
that the IO has extra 'M'etadata associated with it.

But now I see, 'VM' connotes something entirely wrong.

--D
> 
> Cheers,
> Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info

2014-03-20 Thread Darrick J. Wong
Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong 
---
 Documentation/block/data-integrity.txt |   11 
 block/blk-integrity.c  |1 
 drivers/scsi/sd_dif.c  |   76 
 fs/bio-integrity.c |   87 +++-
 fs/direct-io.c |   15 ++
 include/linux/bio.h|3 +
 include/linux/blkdev.h |2 +
 7 files changed, 178 insertions(+), 17 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
   The bio_integrity_prep_iter should contain the page offset and buffer
   length of the PI buffer, the number of pages, and the actual array of
-  pages, as returned by get_user_pages.
+  pages, as returned by get_user_pages.  The user_flags argument should
+  contain whatever flag values were passed in by userspace; the values
+  of the flags are specific to the block integrity provider, and are
+  passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
   are available per hardware sector.  For DIF this is either 2 or
   0 depending on the value of the Control Mode Page ATO bit.
 
+  'mod_user_buf_fn' updates the appropriate integrity metadata for
+  a WRITE operation.  This function is called when userspace passes
+  in a PI buffer along with file data; the flags argument (which is
+  specific to the blk_integrity provider) arrange for pre-processing
+  of the user buffer prior to issuing the IO.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 --
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct 
blk_integrity *template)
bi->set_tag_fn = template->set_tag_fn;
bi->get_tag_fn = template->get_tag_fn;
bi->tag_size = template->tag_size;
+   bi->mod_user_buf_fn = template->mod_user_buf_fn;
} else
bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD (1)
+#define GENERATE_REF   (2)
+#define GENERATE_APP   (4)
+#define GENERATE_ALL   (7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+int flags)
 {
void *buf = bix->data_buf;
struct sd_dif_tuple *sdt = bix->prot_buf;
sector_t sector = bix->sector;
unsigned int i;
 
+   if (flags & ~GENERATE_ALL)
+   return -EINVAL;
+   if (!flags)
+   return -ENOTTY;
+
for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-   sdt->guard_tag = fn(buf, bix->sector_size);
-   sdt->ref_tag = cpu_to_be32(sector & 0x);
-   sdt->app_tag = 0;
+   if (flags & GENERATE_GUARD)
+   sdt->guard_tag = fn(buf, bix->sector_size);
+   if (flags & GENERATE_REF)
+   sdt->ref_tag = cpu_to_be32(sector & 0x);
+   if (flags & GENERATE_APP)
+   sdt->app_tag = 0;
 
buf += bix->sector_size;
sector++;
}
+
+   return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_crc_fn);
+   sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_ip_fn

[PATCH 4/5] aio/dio: advertise possible userspace flags

2014-03-20 Thread Darrick J. Wong
Expose possible userland flags to the new AIO/DIO PI interface so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong 
---
 Documentation/ABI/testing/sysfs-block  |   14 ++
 Documentation/block/data-integrity.txt |   26 +
 block/blk-integrity.c  |   33 
 drivers/scsi/sd_dif.c  |   11 +++
 include/linux/blkdev.h |7 +++
 5 files changed, 91 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
512 bytes of data.
 
 
+What:  /sys/block//integrity/tuple_size
+Date:  March 2014
+Contact:   Darrick J. Wong 
+Description:
+   Size in bytes of the integrity data buffer for each logical
+   block.
+
+What:  /sys/block//integrity/write_user_flags
+Date:  March 2014
+Contact:   Darrick J. Wong 
+Description:
+   Provides a list of flags that userspace can pass to the kernel
+   when supplying integrity data for a write IO.
+
 What:  /sys/block//integrity/write_generate
 Date:  June 2008
 Contact:   Martin K. Petersen 
diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index b72a54f..38a83a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,33 @@ will require extra work due to the application tag.
   specific to the blk_integrity provider) arrange for pre-processing
   of the user buffer prior to issuing the IO.
 
+  'user_write_flags' points to an array of struct blk_integrity_flag,
+  which maps mod_user_buf_fn flags to a description of what they do.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+The AIO/DIO interface has been extended with a new API to provide
+userspace programs the ability to provide PI data with a WRITE, or
+to receive PI data with a READ.  There are two new AIO commands,
+IOCB_CMD_PREADVM and IOCB_CMD_PWRITEVM.  They have the same general
+struct iocb format as IOCB_CMD_PREADV and IOCB_CMD_PWRITEV, respectively.
+The final struct iovec should point to the buffer that contains the
+PI data.
+
+This buffer must be aligned to a page boundary, and it must have the
+following format: Flags are stored in a 32-bit integer.  There must
+then be padding out to the next multiple of the tuple size.  After
+that comes the tuple data.  Valid flag values can be found in
+/sys/block/*/integrity/user_write_flags.  The tuple size can be found
+in /sys/block/*/integrity/tuple_size.  Tuples must not split a page
+boundary.
+
+In general, the flags allow the user program to ask the in-kernel
+integrity provider to fill in some parts of the tuples.  For example,
+the T10 DIF provider can fill in the reference tag (sector number) so
+that userspace can choose not to care about the reference tag.
+
 --
 2007-12-24 Martin K. Petersen 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity 
*bi, char *page)
return sprintf(page, "%d\n", (bi->flags & INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+   struct blk_integrity_flag *flag = bi->user_write_flags;
+   char *p = page;
+   ssize_t ret = 0;
+
+   while (flag->value) {
+   ret += snprintf(p, PAGE_SIZE - ret, "0x%x: %s\n",
+   flag->value, flag->descr);
+   p = page + ret;
+   flag++;
+   }
+   return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+   return sprintf(page, "%d\n", bi->tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
.attr = { .name = "format", .mode = S_IRUGO },
.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry 
= {
.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+   .attr = { .name = "write_user_flags", .mode = S_IRUGO },
+   .show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+   .attr = { .name = "tuple_size", .mode = S_IRUGO }

[PATCH 5/5] blk-integrity: refactor various routines

2014-03-20 Thread Darrick J. Wong
Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong 
---
 fs/bio-integrity.c  |  120 +--
 include/linux/bio.h |2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
struct bio_integrity_payload *bip = bio->bi_integrity;
struct bio_set *bs = bio->bi_pool;
 
-   if (bip->bip_owns_buf)
-   kfree(bip->bip_buf);
+   kfree(bip->bip_buf);
 
if (bs) {
if (bip->bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
 {
struct bio_integrity_payload *bip = bio->bi_integrity;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   unsigned int nr_sectors;
-
-   BUG_ON(bip->bip_buf == NULL);
+   unsigned int nr_sectors, tag_offset, sectors;
+   void *prot_buf;
+   unsigned int prot_offset, prot_len;
+   struct bio_vec *iv;
+   void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
if (bi->tag_size == 0)
return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
return -1;
}
 
-   if (set)
-   bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
-   else
-   bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+   iv = bip->bip_vec;
+   prot_offset = iv->bv_offset;
+   prot_len = iv->bv_len;
+   prot_buf = kmap_atomic(iv->bv_page);
+   tag_fn = set ? bi->set_tag_fn : bi->get_tag_fn;
+   tag_offset = 0;
+
+   while (nr_sectors) {
+   if (prot_len < bi->tuple_size) {
+   kunmap_atomic(prot_buf);
+   iv++;
+   BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+   prot_offset = iv->bv_offset;
+   prot_len = iv->bv_len;
+   prot_buf = kmap_atomic(iv->bv_page);
+   }
+   sectors = min(prot_len / bi->tuple_size, nr_sectors);
+   tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+   nr_sectors -= sectors;
+   tag_offset += sectors * bi->tuple_size;
+   prot_offset += sectors * bi->tuple_size;
+   prot_len -= sectors * bi->tuple_size;
+   }
+   kunmap_atomic(prot_buf);
 
return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:   bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+  walk_buf_fn *mod_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector;
unsigned int sectors, total, ret;
void *prot_buf;
unsigned int prot_offset, prot_len, bv_offset, bv_len;
struct bio_vec *iv;
struct bio_integrity_payload *bip = bio->bi_integrity;
 
-   if (!bi->mod_user_buf_fn)
+   if (!mod_fn)
return 0;
 
-   sector = bio->bi_iter.bi_sector;
-
total = ret = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
bix.prot_buf = prot_buf + prot_offset;
bix.sector = sector;
 
-   ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+   ret = mod_fn(&bix, bip->bip_user_flags);
if (ret) {
if (ret == -ENOTTY)
ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
kunmap_atomic(prot_buf);
return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * 

[PATCH 1/5] fs/bio-integrity: remove duplicate code

2014-03-20 Thread Darrick J. Wong
Frøm: Gu Zheng 

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng 
---
 fs/bio-integrity.c |   83 +++-
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector = bio->bi_iter.bi_sector;
-   unsigned int sectors, total;
+   sector_t sector;
+   unsigned int sectors, total, ret;
void *prot_buf = bio->bi_integrity->bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio->bi_iter.bi_sector;
+   else
+   sector = bio->bi_integrity->bip_iter.bi_sector;
+
+   total = ret = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
bix.prot_buf = prot_buf;
bix.sector = sector;
 
-   bi->generate_fn(&bix);
+   if (operate) {
+   bi->generate_fn(&bix);
+   } else {
+   ret = bi->verify_fn(&bix);
+   if (ret) {
+   kunmap_atomic(kaddr);
+   return ret;
+   }
+   }
 
sectors = bv.bv_len / bi->sector_size;
sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio->bi_integrity->bip_iter.bi_sector;
-   unsigned int sectors, ret = 0;
-   void *prot_buf = bio->bi_integrity->bip_buf;
-   int i;
-
-   bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-   bix.sector_size = bi->sector_size;
-
-   bio_for_each_segment_all(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv->bv_page);
-
-   bix.data_buf = kaddr + bv->bv_offset;
-   bix.data_size = bv->bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi->verify_fn(&bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv->bv_len / bi->sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi->tuple_size;
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] aio/dio: enable DIX passthrough

2014-03-20 Thread Darrick J. Wong
Provide a set of new AIO commands (IOCB_CMD_P{READ,WRITE}VM) that
utilize the last iovec of the iovec array to convey protection
information to and from userspace.

Signed-off-by: Darrick J. Wong 
---
 Documentation/block/data-integrity.txt |   11 ++
 fs/aio.c   |   22 
 fs/bio-integrity.c |   93 +++
 fs/direct-io.c |  157 +---
 include/linux/aio.h|3 +
 include/linux/bio.h|   15 +++
 include/uapi/linux/aio_abi.h   |2 
 mm/filemap.c   |7 +
 8 files changed, 294 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
   It is up to the receiver to process them and verify data
   integrity upon completion.
 
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+ struct bio_integrity_prep_iter *pi);
+
+  This function should be called before submit_bio; its purpose is to
+  attach an arbitrary array of struct page * containing integrity data
+  to an existing bio.  Primarily this is intended for AIO/DIO to be
+  able to attach a userspace buffer to a bio.
+
+  The bio_integrity_prep_iter should contain the page offset and buffer
+  length of the PI buffer, the number of pages, and the actual array of
+  pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..5d425d8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1259,6 +1259,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
opcode,
struct iovec inline_vec, *iovec = &inline_vec;
 
switch (opcode) {
+   case IOCB_CMD_PREADVM:
+   if (!(file->f_flags & O_DIRECT))
+   return -EINVAL;
+   req->ki_flags |= KIOCB_USE_PI;
+
case IOCB_CMD_PREAD:
case IOCB_CMD_PREADV:
mode= FMODE_READ;
@@ -1266,6 +1271,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
opcode,
rw_op   = file->f_op->aio_read;
goto rw_common;
 
+   case IOCB_CMD_PWRITEVM:
+   if (!(file->f_flags & O_DIRECT))
+   return -EINVAL;
+   req->ki_flags |= KIOCB_USE_PI;
+
case IOCB_CMD_PWRITE:
case IOCB_CMD_PWRITEV:
mode= FMODE_WRITE;
@@ -1280,7 +1290,9 @@ rw_common:
return -EINVAL;
 
ret = (opcode == IOCB_CMD_PREADV ||
-  opcode == IOCB_CMD_PWRITEV)
+  opcode == IOCB_CMD_PWRITEV ||
+  opcode == IOCB_CMD_PREADVM ||
+  opcode == IOCB_CMD_PWRITEVM)
? aio_setup_vectored_rw(req, rw, buf, &nr_segs,
&iovec, compat)
: aio_setup_single_vector(req, rw, buf, &nr_segs,
@@ -1288,6 +1300,13 @@ rw_common:
if (ret)
return ret;
 
+   if ((req->ki_flags & KIOCB_USE_PI) && nr_segs < 2) {
+   pr_err("%s: not enough iovecs for PI!\n", __func__);
+   if (iovec != &inline_vec)
+   kfree(iovec);
+   return -EINVAL;
+   }
+
ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes);
if (ret < 0) {
if (iovec != &inline_vec)
@@ -1407,6 +1426,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->ki_user_data = iocb->aio_data;
req->ki_pos = iocb->aio_offset;
req->ki_nbytes = iocb->aio_nbytes;
+   req->ki_flags = 0;
 
ret = aio_run_iocb(req, iocb->aio_lio_opcode,
   (char __user *)(unsigned long)iocb->aio_buf,
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..af398f0 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
struct bio_vec *iv;
 
if (bip->bip_vcnt >= bip_integrity_vecs(bip)) {
-   printk(KERN_ERR "%s: bip_vec full\n", __func__);
+   pr_err("%s: bip_vec full\n", __func__);
return 0;
}
 
@@ -250,7 +250,7 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
DIV_ROUND_UP(le

[RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-20 Thread Darrick J. Wong
This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.  The
interface is an extension to the AIO interface: two new commands
(IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
arg list is interpreted to point to a buffer containing a header,
followed by the the PI data.  These patches are against 3.14-rc7.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 provides the plumbing to get the user's buffer all the way to
the block integrity code.  I'm not quite sure if the mechanism I took
(passing the results of get_user_pages around) actually works in all
cases (such as the user's buffer being swapped out), but it survives
a simple test.  Due to the way that the code deals with the array of
struct page*s that represent the PI buffer, there's an unfortunate
requirement that no PI tuple may cross a page boundary.  Given that
so far DIF is only 8 or 16 bytes this isn't a problem... yet.  There's
also no explicit fallback for the case where the user pages are not
within a device's DMA range.

Patch #3 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
"mod_user_buf_fn" hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #4 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #5 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -p r -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define IOCB_CMD_PREADVM(9)
#define IOCB_CMD_PWRITEVM   (10)
#define GENERATE_GUARD  (1)
#define GENERATE_REF(2)
#define GENERATE_APP(4)
#define GENERATE_ALL(7)

#define NR_IOS  (1)

static void dump_buffer(char *buf, size_t len)
{
size_t off;
char *p;

for (p = buf; p < buf + len; p++) {
off = p - buf;
if (off % 32 == 0) {
if (p != buf)
printf("\n");
printf("%05zu:", off);
}
printf(" %02x", *p & 0xFF);
}
printf("\n");
}

/* Table generated using the following polynomium:
 * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
 * gt: 0x8bb7
 */
static const uint16_t t10_dif_crc_table[256] = {
0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
0xCA69, 0x41DE, 0x56B0, 0xDD07, 0

Re: status of block-integrity

2014-01-06 Thread Darrick J. Wong
On Fri, Jan 03, 2014 at 03:03:42PM -0500, Martin K. Petersen wrote:
> > "Hannes" == Hannes Reinecke  writes:
> 
> Hannes> Personally, I doubt it's a good idea to kill it off, but a
> Hannes> proper (userland) API for it has been a long time missing.
> 
> Before we throw the baby out with the bath water, maybe Darrick can fill
> us in on the progress of the aio passthrough interface?

I haven't made much progress on it -- I haven't seen any earnest demand for it.

Last year Chuck Lever said that some NFS working group was looking defining an
interface it... has there been any progress?  It doesn't sound like there has
been.

--D
> 
> -- 
> Martin K. PetersenOracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Darrick J. Wong
On Thu, Feb 07, 2013 at 02:09:17PM -0500, Martin K. Petersen wrote:
> >>>>> "Darrick" == Darrick J Wong  writes:
> 
> Darrick> and more recently I've theorized that we could add a magic
> Darrick> fcntl/ioctl to make the kernel recognize, say, the first iovec
> Darrick> of a O_DIRECT *{read,write}v call as the PI buffer, which I
> Darrick> think is similar to how DIX gets PI data to a disk.  But it's
> Darrick> not like I have any code to show for it.
> 
> I don't particularly like the "stick it in the first iovec" magic. Also,
> we need a bit more than this. A handful of knobs need to be present to
> convey how the PI should be sliced and diced. So then we get into the
> territory where the first iovec is a PI descriptor of some sort. And
> then the second entry is the PI buffer.

Hm, well if we're adding another IO_CMD_ anyway, it probably isn't that hard to
find space to stuff in an extra pointer or two to a PI descriptor + buffer.

(or a pointer to a descriptor that itself points to a buffer...)

> Darrick> I /think/ it's fairly straightforward to change the directio
> Darrick> submit code to find the userspace PI buffer and amend the block
> Darrick> integrity code to attach our own PI buffer.  
> 
> I recommend that you check out how I do this in oracleasm.

Is there a newer one than this?
https://oss.oracle.com/projects/oracleasm/files/sources/

(Nov. 2008?)

> 
> Darrick> You'd still have to let the block layer set the sector # field,
> Darrick> but afaik that won't affect the crc or the app tag.
> 
> Correct. But the right way would be to pass the ref tag seed in as part
> of the IOCB and let sd or the HBA hardware do the remapping.



--D
> 
> -- 
> Martin K. PetersenOracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Darrick J. Wong
On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
> On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
> > 
> > On Feb 6, 2013, at 3:24 PM, "Darrick J. Wong"  
> > wrote:
> > 
> > > On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
> > >> Hi,
> > >> 
> > >> I'm interested in discussing how to pass protection information to and 
> > >> from
> > >> userspace.  Maybe Martin could be enlisted for the discussion.
> > >> 
> > >> I read that some work has already been done in this area but have not 
> > >> been able
> > >> to locate it.  It looks like the bio-integrity code already makes it 
> > >> possible
> > >> to generate the t10-dif crc in the filesystem.  It would be good to be 
> > >> able to
> > >> get the guard and application tags back out to backup applications such 
> > >> as
> > >> xfsdump.  Enabling other applications to generate their own tags in 
> > >> userspace
> > >> is also interesting.
> > > 
> > > This one's been on my list for a couple of years (and companies) too.  A 
> > > few
> > > years ago Joel Becker had support for it in his sys_dio proposal (that 
> > > hasn't
> > > gone anywhere), and more recently I've theorized that we could add a magic
> > > fcntl/ioctl to make the kernel recognize, say, the first iovec of a 
> > > O_DIRECT
> > > *{read,write}v call as the PI buffer, which I think is similar to how DIX 
> > > gets
> > > PI data to a disk.  But it's not like I have any code to show for it.
> > > 
> > > I /think/ it's fairly straightforward to change the directio submit code 
> > > to
> > > find the userspace PI buffer and amend the block integrity code to attach 
> > > our
> > > own PI buffer.  You'd still have to let the block layer set the sector # 
> > > field,
> > > but afaik that won't affect the crc or the app tag.
> > > 
> > > I hear that the NFS guys want to propose some sort of protocol for 
> > > transmitting
> > > PI data (across NFS), but I haven't seen anything concrete yet.
> > 
> > I'm writing a requirements document for the NFS protocol which I can 
> > discuss at LSF.  The use cases for NFS for now would be virtual disk 
> > devices (hypervisors) or direct NFS access to storage from user space.
> > 
> > Like everyone else we are waiting for a magical VFS and user space API to 
> > appear that can pass PI to and from storage.
> 
> I'm happy to chat about it.  Unfortunately, like Darrick says, sys_dio()
> coding hasn't happened.  I do think we're better off with some kind of
> explicit API than some magic state on the file.  I mean, even something
> like:
> 
>   ssize_t write_with_pi(int fd, const void *buf, size_t count,
> const void *pi, size_t pi_count);
> 
> It's not as nice as a non-historical API (eg sys_dio), but it also
> probably plays nicer with buffered I/O.

I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
and all the other plumbing necessary to make that happen...

void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
   int iovcnt, long long offset, const void *pi,
   size_t pi_count);

--D
> 
> Joel
> 
> > 
> > > Well, I hope I'll scrape together the time to hack together a PoC before 
> > > LSF...
> > > on the other hand, I ran the discussion about PI userland interfaces at 
> > > LPC2011
> > > and (shamefully) haven't done anything yet.
> > > 
> > > 
> > > 
> > > --D
> > >> 
> > >> Regards,
> > >>  Ben
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" 
> > >> in
> > >> the body of a message to majord...@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" 
> > > in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > -- 
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> > 
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> 
> "I think it would be a good idea."  
> - Mahatma Ghandi, when asked what he thought of Western
>   civilization
> 
>   http://www.jlbec.org/
>   jl...@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-06 Thread Darrick J. Wong
On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
> Hi,
> 
> I'm interested in discussing how to pass protection information to and from
> userspace.  Maybe Martin could be enlisted for the discussion.
> 
> I read that some work has already been done in this area but have not been 
> able
> to locate it.  It looks like the bio-integrity code already makes it possible
> to generate the t10-dif crc in the filesystem.  It would be good to be able to
> get the guard and application tags back out to backup applications such as
> xfsdump.  Enabling other applications to generate their own tags in userspace
> is also interesting.

This one's been on my list for a couple of years (and companies) too.  A few
years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
gone anywhere), and more recently I've theorized that we could add a magic
fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
*{read,write}v call as the PI buffer, which I think is similar to how DIX gets
PI data to a disk.  But it's not like I have any code to show for it.

I /think/ it's fairly straightforward to change the directio submit code to
find the userspace PI buffer and amend the block integrity code to attach our
own PI buffer.  You'd still have to let the block layer set the sector # field,
but afaik that won't affect the crc or the app tag.

I hear that the NFS guys want to propose some sort of protocol for transmitting
PI data (across NFS), but I haven't seen anything concrete yet.

Well, I hope I'll scrape together the time to hack together a PoC before LSF...
on the other hand, I ran the discussion about PI userland interfaces at LPC2011
and (shamefully) haven't done anything yet.



--D
> 
> Regards,
>   Ben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sd: Don't incorrectly "promote" DIF type0 into DIF type1 disks.

2012-11-27 Thread Darrick J. Wong
If I run the following command:
# modprobe scsi_debug dev_size_mb=64 ato=1 dix=1 dif=0

then I see the following in the dmesg log:

[   25.859145] scsi_debug: host protection DIX0

Ok, DIX0, which means "no integrity extensions at all", and no DIF support at 
all.
I'm not sure why you'd advertise DIX0 at all, but so far so good.

(DIX is the mechanism by which the OS sends integrity data to the disk in
whatever format DIF specifies.  You need DIF for DIX to do anything.)

[   25.860214] scsi0 : scsi_debug, version 1.82 [20100324], dev_size_mb=64, 
opts=0x0
[   25.863418] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   0004 
PQ: 0 ANSI: 5
[   25.880079] sd 0:0:0:0: [sda] 131072 512-byte logical blocks: (67.1 MB/64.0 
MiB)
[   25.884133] sd 0:0:0:0: [sda] Write Protect is off
[   25.892205] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[   25.920344]  sda: unknown partition table
[   25.926704] sd_dif_config_host: type=0 dif=0 dix=8
[   25.931651] sd 0:0:0:0: [sda] Enabling DIX T10-DIF-TYPE1-CRC protection

Huh??  Here we are turning on DIX support as if the disk supports DIF type1.
This seems strange to me because we didn't advertise any DIF support at all.

[   25.952208] sd 0:0:0:0: [sda] Attached SCSI disk
[   25.977977] BUG: unable to handle kernel paging request at 000ffc02
[   25.980262] IP: [] resp_read.part.38+0x145/0x420 
[scsi_debug]

Uhoh, that shouldn't happen.  The SCSI layer sent along what looks like a type1
read request even though the disk wasn't really prepared to handle it, and
kaboom.  I don't think this particular combination is terribly common, but we
could at least not misprogram the disk when we see it.

If the disk advertises DIF type 0, we can skip the rest of the DIF setup.
Right now, the SCSI layer "promotes" a DIF type 0 disk into a DIF type 1 disk,
which seems incorrect.

Signed-off-by: Darrick J. Wong 
---
 drivers/scsi/sd_dif.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 04998f3..ede5b7b 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -313,6 +313,10 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
u8 type = sdkp->protection_type;
int dif, dix;
 
+   /* Don't promote DIF type0 into type1 support. */
+   if (type == SD_DIF_TYPE0_PROTECTION)
+   return;
+
dif = scsi_host_dif_capable(sdp->host, type);
dix = scsi_host_dix_capable(sdp->host, type);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Device not ready' issue on mpt2sas since 3.1.10

2012-07-10 Thread Darrick J. Wong
On Mon, Jul 09, 2012 at 06:24:15PM -0400, Robert Trace wrote:
> On 07/09/2012 04:45 PM, Darrick J. Wong wrote:
> >
> > I suspect that /sys/devices//manage_start_stop 
> > = 0
> > for the SATA devices hanging off the SAS controller.
> 
> Yep, looks like you're right.  For my system:
> 
> # cat /sys/block/sd?/device/scsi_disk/*/manage_start_stop
> 1
> 1
> 1
> 1
> 1
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 
> Those first 5 disks are SATA disks on SATA controllers.  The last 8
> disks are SATA disks on the SAS controller.
> 
> > Setting that sysfs
> > attribute to 1 is supposed to enable the SCSI layer to send TUR when it sees
> > "LU not ready", as well as spin down the drives at suspend/poweroff time.
> 
> Setting it to 1 doesn't seem to have made any difference, however.
> 
> # cat /sys/block/sdm/device/scsi_disk/14\:0\:7\:0/manage_start_stop
> 0
> # echo 1 > /sys/block/sdm/device/scsi_disk/14\:0\:7\:/manage_start_stop
> # cat /sys/block/sdm/device/scsi_disk/14\:0\:7\:0/manage_start_stop
> 1
> # hdparm -y /dev/sdm
> 
> /dev/sdm:
>  issuing standby command
> # hdparm -C /dev/sdm
> 
> /dev/sdm:
>  drive state is:  standby
> # dd if=/dev/sdm of=/dev/null bs=512 count=1
> dd: reading `/dev/sdm': Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.00117802 s, 0.0 kB/s
> 
> ... and on the scsi logging side, I see the read(10) to the disk which
> immediately returns "Not Ready" and the I/O failure bubbles up the
> chain.  And afterwards, the disk is still asleep.
> 
> # hdparm -C /dev/sdm
> 
> /dev/sdm:
>  drive state is:  standby
> 
> Also, TURs don't appear to actually wake the disk up (should they?).
> The only thing I've found that'll wake the disk up is an explicit START
> UNIT command.

Sorry, I misspoke, manage_start_stop=1 sends START UNIT, not TUR.  Also, it
only manages spindown/up at suspend/resume time, hence the behavior you see.
The relevant source code is sd_start_stop_device() in drivers/scsi/sd.c.

--D
> 
> -- Rob
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Device not ready' issue on mpt2sas since 3.1.10

2012-07-09 Thread Darrick J. Wong
On Mon, Jul 09, 2012 at 03:37:09PM -0400, Robert Trace wrote:
> > I did some further research regarding my problem.
> > It appears to me the fault does not lie with the mpt2sas driver (not
> > that I can definitely exclude it), but with the md implementation.
> 
> I'm actually discovering some of the same issues (LSI 9211-8i w/ SATA
> disks), but I've come to a slightly different conclusion.
> 
> I noticed that when my SATA disks are on a SATA controller and they spin
> down (or are spun down via hdparm -y), then they response to TUR (TEST
> UNIT READY) commands with an OK.  Any I/O sent to these disks simply
> wait while the disks spin up and then complete as usual.
> 
> However, my SATA disks on the SAS controller respond to TUR with the
> sense error "Not Ready/Initializing command required".  Any I/O sent to
> these disks immediately fails.  You saw this in your logging:
> 
> > [  604.838640] sd 2:0:0:0: [sda] Device not ready
> > [  604.838645] sd 2:0:0:0: [sda]  Result: hostbyte=DID_OK
> > driverbyte=DRIVER_SENSE
> > [  604.838655] sd 2:0:0:0: [sda]  Sense Key : Not Ready [current]
> > [  604.838663] sd 2:0:0:0: [sda]  Add. Sense: Logical unit not ready,
> > initializing command required
> > [  604.838668] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 00 00 08 00 00 00
> > 20 00
> > [  604.838680] end_request: I/O error, dev sda, sector 2048
> > [  604.838688] Buffer I/O error on device md127, logical block 0
> > [  604.838695] Buffer I/O error on device md127, logical block 1
> > [  604.838699] Buffer I/O error on device md127, logical block 2
> > [  604.838702] Buffer I/O error on device md127, logical block 3
> 
> Sending an explicit START UNIT command to these sleeping disks will wake
> them up and then they behave normally.  (BTW, you can issue TURs and
> START UNITs via the sg_turs and sg_start commands).
> 
> I've reproduced this behavior on the raw disks themselves, no MD layer
> involved (although the freak-out by my MD layer is what alerted me to
> this issue too... Having your entire array punted the first time you
> access it is a little scary :-).  I'm also on raw hardware and I've seen
> this behavior on kernels 3.0.33 through 3.4.4.
> 
> So, SATA disks respond differently depending on the controller they're
> on.  I don't know if this is a SCSI thing, a SAS thing or a
> firmware/driver thing for the 9211.

I suspect that /sys/devices//manage_start_stop = 0
for the SATA devices hanging off the SAS controller.  Setting that sysfs
attribute to 1 is supposed to enable the SCSI layer to send TUR when it sees
"LU not ready", as well as spin down the drives at suspend/poweroff time.

--D
> 
> Now, whether or not the MD layer should be assembling arrays from
> "failed" disks is, I think, a separate issue.
> 
> -- Rob
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] aic94xx: Use sas_request_addr() to provide SAS WWN if the adapter lacks one

2008-02-19 Thread Darrick J. Wong
If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
make libsas get one for us.

Resend of 8 Oct 2007 patch, now based off 2.6.25-rc2 + scsi_misc.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx.h  |   16 
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   20 +---
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 9 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..aee235f 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
@@ -68,21 +67,6 @@ static inline void asd_stringify_sas_addr(char *p, const u8 
*sas_addr)
*p = '\0';
 }
 
-static inline void asd_destringify_sas_addr(u8 *sas_addr, const char *p)
-{
-   int i;
-   for (i = 0; i < SAS_ADDR_SIZE; i++) {
-   u8 h, l;
-   if (!*p)
-   break;
-   h = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   l = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   sas_addr[i] = (h<<4) | l;
-   }
-}
-
 struct asd_ha_struct;
 struct asd_ascb;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 098b5f3..940a207 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "aic94xx.h"
 #include "aic94xx_reg.h"
@@ -38,16 +39,14 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha->hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha->hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha->hw_prof.sas_addr);
+   /* adapter came with a sas address */
+   if (asd_ha->hw_prof.sas_addr[0])
+   return 0;
+
+   return sas_request_addr(asd_ha->sas_ha.core.shost,
+   asd_ha->hw_prof.sas_addr);
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +656,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha->hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk("No SAS Address provided for %s\n",
   pci_name(asd_ha->pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index 5d761eb..1824b0b 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -56,8 +56,6 @@ MODULE_PARM_DESC(collector, "\n"
"\tThe aic94xx SAS LLDD supports both modes.\n"
"\tDefault: 0 (Direct Mode).\n");
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = "";
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] libsas: Provide a transport-level facility to request SAS addrs

2008-02-19 Thread Darrick J. Wong
Provide a facility to use the request_firmware() interface to get a SAS
address from userspace.  This can be used by SAS LLDDs that cannot
obtain the address from the host adapter.

Resend of 8 Oct. 2007 patch, now based off 2.6.25-rc2 + scsi_misc.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_scsi_host.c |   41 +++
 include/scsi/libsas.h   |2 ++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..583d249 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -24,6 +24,8 @@
  */
 
 #include 
+#include 
+#include 
 
 #include "sas_internal.h"
 
@@ -1050,6 +1052,45 @@ void sas_target_destroy(struct scsi_target *starget)
return;
 }
 
+static void sas_parse_addr(u8 *sas_addr, const char *p)
+{
+   int i;
+   for (i = 0; i < SAS_ADDR_SIZE; i++) {
+   u8 h, l;
+   if (!*p)
+   break;
+   h = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   l = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   sas_addr[i] = (h<<4) | l;
+   }
+}
+
+#define SAS_STRING_ADDR_SIZE   16
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr)
+{
+   int res;
+   const struct firmware *fw;
+
+   res = request_firmware(&fw, "sas_addr", &shost->shost_gendev);
+   if (res)
+   return res;
+
+   if (fw->size < SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   sas_parse_addr(addr, fw->data);
+
+out:
+   release_firmware(fw);
+   return res;
+}
+EXPORT_SYMBOL_GPL(sas_request_addr);
+
 EXPORT_SYMBOL_GPL(sas_queuecommand);
 EXPORT_SYMBOL_GPL(sas_target_alloc);
 EXPORT_SYMBOL_GPL(sas_slave_configure);
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 3ffd6b5..5f183de 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -676,4 +676,6 @@ extern int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
 extern void sas_ssp_task_response(struct device *dev, struct sas_task *task,
  struct ssp_response_iu *iu);
 
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr);
+
 #endif /* _SASLIB_H_ */
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)

2008-02-19 Thread Darrick J. Wong
If we send an ABORT_TASK ascb that doesn't return within the timeout period,
we should not free that ascb because the sequencer is still holding onto it.
Hopefully it will fix what James Bottomley describes below:

On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:

> Unfortunately, there's a bug in TMF timeout handling in the driver, it
> leaves the sequencer entry pending, but frees the ascb.  If the
> sequencer ever picks this up it will get very confused, as it does a
> while down in the trace:
> 
> > aic94xx: BUG:sequencer:dl:no ascb?!
> > aic94xx: BUG:sequencer:dl:no ascb?!
> 
> That's where the sequencer adds an ascb to the done list that we've
> already freed.  From this point on confusion reigns and the error
> handler eventually offlines the device.
> 
> I'll see if I can come up with patches to fix this ... or at least
> mitigate the problems it causes.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx_tmf.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c 
b/drivers/scsi/aic94xx/aic94xx_tmf.c
index b52124f..4b24bd3 100644
--- a/drivers/scsi/aic94xx/aic94xx_tmf.c
+++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
@@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task)
   AIC94XX_SCB_TIMEOUT);
spin_lock_irqsave(&task->task_state_lock, flags);
if (leftover < 1)
-   res = TMF_RESP_FUNC_FAILED;
+   goto out_not_reported;
if (task->task_state_flags & SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(&task->task_state_lock, flags);
@@ -487,6 +487,11 @@ out:
asd_ascb_free(ascb);
ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res);
return res;
+
+out_not_reported:
+   spin_unlock_irqrestore(&task->task_state_lock, flags);
+   ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task);
+   return res;
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 03/13] mptbase: reset ioc initiator during PCI resume

2008-02-07 Thread Darrick J. Wong
On Thu, Feb 07, 2008 at 06:41:25PM -0600, James Bottomley wrote:
> On Mon, 2008-02-04 at 23:53 -0800, [EMAIL PROTECTED] wrote:
> > From: "Darrick J. Wong" <[EMAIL PROTECTED]>
> > 
> > It appears that the LSI SAS 1064E chip needs to be reset after a
> > suspend/resume cycle before the driver attempts further communications with
> > the chip.  Without this patch, resuming the chip results in this error
> > message being printed repeatedly and no more disk I/O.
> > 
> > mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 offsetof=6!
> > 
> > So far it seems to fix suspend/resume on all the MPT Fusion cards I have
> > (SAS and U320 SCSI) but since I don't know the internals of that chip I
> > can't say for sure if this is a proper fix.
> > 
> > Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
> > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> 
> Ping on this, please Eric.

As far as I can tell, Eric isn't really involved with this patch
anymore, and handed it over to [EMAIL PROTECTED]  I received email
from him (her?  Apologies, I'm not sufficiently familiar with Indian
names) this morning saying that a modified version of it would go out to
linux-scsi in a day or two.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-01-30 Thread Darrick J. Wong
On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
> 
> V28.  My controller functions well with a single drive (low-medium load).  
> Unfortunately, all attempts to get the mirrors in sync fail and usually hang 
> the whole box.

Adaptec posted a V30 sequencer on their website; does that fix the
problems?

http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load

2008-01-14 Thread Darrick J. Wong
On Mon, Jan 14, 2008 at 03:49:16PM +0100, Jan Sembera wrote:
> Hi,
> 
>   we have array of 16 SAS disks connected to Adaptec controllers
> ...
> this elsewhere and I was recommended to send it to linux-scsi.

Hmm... I think Peter Bogdanovic was hitting this error recently (cc'd).
There are a lot of PRIMITIVE_RECVD messages in the log, which make me
wonder if the expander is being flaky or something?  The commands that
start timing out under heavy load followed by the repeated broadcasts
might be indicative of that, since the sequencer firmware and the kernel
driver are up to date.  Unfortunately, I don't have any LSI expanders...

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libsas: fix sense_buffer overrun

2008-01-14 Thread Darrick J. Wong
Looks sane to me;
Acked-by: Darrick J. Wong <[EMAIL PROTECTED]>

--D

On Sun, Jan 13, 2008 at 02:20:18AM +0900, FUJITA Tomonori wrote:
> 
> Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
> ---
>  drivers/scsi/libsas/sas_scsi_host.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
> b/drivers/scsi/libsas/sas_scsi_host.c
> index b784089..828fed1 100644
> --- a/drivers/scsi/libsas/sas_scsi_host.c
> +++ b/drivers/scsi/libsas/sas_scsi_host.c
> @@ -108,7 +108,7 @@ static void sas_scsi_task_done(struct sas_task *task)
>   break;
>   case SAM_CHECK_COND:
>   memcpy(sc->sense_buffer, ts->buf,
> -max(SCSI_SENSE_BUFFERSIZE, ts->buf_valid_size));
> +min(SCSI_SENSE_BUFFERSIZE, ts->buf_valid_size));
>   stat = SAM_CHECK_COND;
>   break;
>   default:
> -- 
> 1.5.3.4
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libsas: don't use made up error codes

2007-12-31 Thread Darrick J. Wong
On Sun, Dec 30, 2007 at 12:37:31PM -0600, James Bottomley wrote:
> This is bad for two reasons:
> 
>  1. If they're returned to outside applications, no-one knows what
> they mean.
>  2. Eventually they'll clash with the ever expanding standard error
> codes.
> 
> The problem error code in question is ETASK.  I've replaced this by
> ECOMM (communications error on send) a network error code that seems to
> most closely relay what ETASK meant.

Yay, cleanups :)

Acked-by: Darrick J. Wong <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drivers/scsi/: Spelling fixes

2007-12-17 Thread Darrick J. Wong
On Mon, Dec 17, 2007 at 11:40:14AM -0800, Joe Perches wrote:

>  drivers/scsi/scsi_transport_sas.c |2 +-

SAS bits are
Acked-by: Darrick J. Wong <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.

2007-12-04 Thread Darrick J. Wong
On Tue, Dec 04, 2007 at 05:48:33PM -0500, Jeff Garzik wrote:

> As an aside, issues like this really really imply a need to move libsas 
> away from the old libata EH stuff (like brking did with ipr, in patches).

Hm... does the new libata EH handle the case of "device was
unplugged, don't bother trying to send any more commands"?

In general, I agree that sas-ata should adopt the new EH.
Unfortunately, I believe the old way of sas-ata configuring ATA ports is
somehow not compatible with the new EH stuff and causes a crash during
the device probe with my patch to move sas-ata to the new EH.  If I
apply the patch that migrates sas-ata to use brking's latest ata-sas
configuration mechanism (the one that creates real ata_hosts), I see
(a) lots and lots of ATA hosts getting created (one per ATA port;
possibly undesirable if you've a SAS topology with a lot of SATA disks)
and (b) NCQ disks don't seem to work if you unplug the disk and plug
it back in (unless NCQ is disabled entirely).  Jeff, by any chance have
you tried plugging SATA devices into your SAS controllers?

James Bottomley wondered if it would be easier to have sas-ata call only
into the parts of libata that convert SCSI commands to ATA taskfiles,
though I'm unsure how many wormy cans that would open.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libsas: Don't issue commands to devices that have been hot-removed.

2007-12-04 Thread Darrick J. Wong
Hrm... does this patch help?  You'll get a bunch of ATA/SAS disk errors
printed to the screen if you yank the disk, but at least libsas won't
get stuck waiting for the cache-flush commands to time out.
---
sd will get hung up issuing commands to flush write cache if a SAS device
is unplugged without warning.  Change libsas to reject commands to domain
devices that have already gone away.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_ata.c   |4 
 drivers/scsi/libsas/sas_expander.c  |3 +++
 drivers/scsi/libsas/sas_port.c  |2 ++
 drivers/scsi/libsas/sas_scsi_host.c |7 +++
 include/scsi/libsas.h   |1 +
 5 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 0829b55..f5e5213 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -161,6 +161,10 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd 
*qc)
unsigned int num = 0;
unsigned int xfer = 0;
 
+   /* If the device fell off, no sense in issuing commands */
+   if (dev->gone)
+   return AC_ERR_SYSTEM;
+
task = sas_alloc_task(GFP_ATOMIC);
if (!task)
return AC_ERR_SYSTEM;
diff --git a/drivers/scsi/libsas/sas_expander.c 
b/drivers/scsi/libsas/sas_expander.c
index 27674fe..4ba4d2a 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1680,6 +1680,7 @@ static void sas_unregister_ex_tree(struct domain_device 
*dev)
struct domain_device *child, *n;
 
list_for_each_entry_safe(child, n, &ex->children, siblings) {
+   child->gone = 1;
if (child->dev_type == EDGE_DEV ||
child->dev_type == FANOUT_DEV)
sas_unregister_ex_tree(child);
@@ -1699,6 +1700,7 @@ static void sas_unregister_devs_sas_addr(struct 
domain_device *parent,
list_for_each_entry_safe(child, n, &ex_dev->children, siblings) {
if (SAS_ADDR(child->sas_addr) ==
SAS_ADDR(phy->attached_sas_addr)) {
+   child->gone = 1;
if (child->dev_type == EDGE_DEV ||
child->dev_type == FANOUT_DEV)
sas_unregister_ex_tree(child);
@@ -1707,6 +1709,7 @@ static void sas_unregister_devs_sas_addr(struct 
domain_device *parent,
break;
}
}
+   parent->gone = 1;
sas_disable_routing(parent, phy->attached_sas_addr);
memset(phy->attached_sas_addr, 0, SAS_ADDR_SIZE);
sas_port_delete_phy(phy->port, phy->phy);
diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
index b6f0243..2e82097 100644
--- a/drivers/scsi/libsas/sas_port.c
+++ b/drivers/scsi/libsas/sas_port.c
@@ -144,6 +144,8 @@ void sas_deform_port(struct asd_sas_phy *phy)
port->port_dev->pathways--;
 
if (port->num_phys == 1) {
+   if (port->port_dev)
+   port->port_dev->gone = 1;
sas_unregister_domain_devices(port);
sas_port_delete(port->port);
port->port = NULL;
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index c29ba47..61d2679 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -228,6 +228,13 @@ int sas_queuecommand(struct scsi_cmnd *cmd,
goto out;
}
 
+   /* If the device fell off, no sense in issuing commands */
+   if (dev->gone) {
+   cmd->result = DID_BAD_TARGET << 16;
+   scsi_done(cmd);
+   goto out;
+   }
+
res = -ENOMEM;
task = sas_create_task(cmd, dev, GFP_ATOMIC);
if (!task)
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 8ad7465..73c5b15 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -207,6 +207,7 @@ struct domain_device {
 };
 
 void *lldd_dev;
+   int gone;
 };
 
 struct sas_discovery_event {
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives

2007-12-03 Thread Darrick J. Wong
On Mon, Dec 03, 2007 at 02:43:09PM -0500, Jeff Garzik wrote:

> But what do you mean by "device removal code can get hung up"?  That sounds 
> like a bug we should fix.

At the moment, libsas' sas_rphy_remove function doesn't distinguish between
removing a device before or after the disk has been disconnected.
Hence, sd_shutdown tries to tell the disk to flush the write cache, even
in the case that the disk is already gone.  Maybe the solution is to
modify aic94xx to remove the device's DDB registration prior to sending
the "device gone" event to libsas so that all subsequent commands bounce
with "no such device" instead of going out to lunch.

(I'll look into this later, as I myself am going out to lunch right now.)

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives

2007-12-03 Thread Darrick J. Wong
On Mon, Dec 03, 2007 at 05:09:54PM +0100, Krzysztof B??aszkowski wrote:
> 
> I noticed also another failure when i removed a drive. The event was not 
> notified by anything (ie the block device and corresponding sg were 
> registered) so i run dd on this truly "virtual" drive.
> 
> dd reached D state (as well as scsi_wq) . i think it shouldn't happen no 
> matter it was AIC failure or LSI expander failure.

"It's wireless!" ;)

Seriously, though, it's a good idea to tell the kernel that you're
about to unplug a disk before actually doing it:

echo 1 > /sys/block/sdX/device/delete

This way, the kernel can tell the disk to flush its caches long before
power actually gets removed.  Otherwise, the device removal code can
get hung up just like you observed, and whatever's in the write cache
may or may not actually get written to the media.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives

2007-11-30 Thread Darrick J. Wong
On Fri, Nov 30, 2007 at 10:22:07AM +0100, Krzysztof B??aszkowski wrote:
> Hello all,
> 
> I noticed this according to syslog. furthermore if aic94xx is connected to 
> single sata drive only then there is no crash but device is not recognized 
> too. (mysterious: "ERROR: Unidentified device type 5").

There's been a substantial amount of bugfixes (as well as SATA support)
that went into the aic94xx/libsas code between .22 and .23; could you
please give that a try?

Also, what kind of devices are attached when the system crashes?  From
that stack trace it looks like the software thought there was a SATA
disk attached to an expander...?

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] libsas: Use new ATA configuration mechanism

2007-11-12 Thread Darrick J. Wong
Update sas_ata to use the new ata_sas_rphy mechanisms as provided by
Brian King, and simplify ATA device discovery...

WARNING WARNING WARNING!  This patch is experimental, use at your own
risk.

Comments-requested-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_ata.c   |  206 +--
 drivers/scsi/libsas/sas_discover.c  |4 +
 drivers/scsi/libsas/sas_scsi_host.c |   37 +-
 include/scsi/libsas.h   |4 -
 4 files changed, 91 insertions(+), 160 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index a9925d5..c6b4213 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -35,6 +35,13 @@
 #include "../scsi_transport_api.h"
 #include 
 
+struct sas_ata_descr {
+   struct ata_sas_rphy rphy;
+   struct scsi_host_template sht;
+};
+
+#define ata_rphy_to_descr(x) container_of((x), struct sas_ata_descr, rphy)
+
 static int sas_issue_ata_srst(struct domain_device *dev);
 
 static enum ata_completion_errors sas_to_ata_err(struct task_status_struct *ts)
@@ -323,55 +330,6 @@ static void sas_ata_tf_read(struct ata_port *ap, struct 
ata_taskfile *tf)
memcpy(tf, &dev->sata_dev.tf, sizeof (*tf));
 }
 
-static int sas_ata_scr_write(struct ata_port *ap, unsigned int sc_reg_in,
- u32 val)
-{
-   struct domain_device *dev = ap->private_data;
-
-   SAS_DPRINTK("STUB %s\n", __FUNCTION__);
-   switch (sc_reg_in) {
-   case SCR_STATUS:
-   dev->sata_dev.sstatus = val;
-   break;
-   case SCR_CONTROL:
-   dev->sata_dev.scontrol = val;
-   break;
-   case SCR_ERROR:
-   dev->sata_dev.serror = val;
-   break;
-   case SCR_ACTIVE:
-   dev->sata_dev.ap->link.sactive = val;
-   break;
-   default:
-   return -EINVAL;
-   }
-   return 0;
-}
-
-static int sas_ata_scr_read(struct ata_port *ap, unsigned int sc_reg_in,
-   u32 *val)
-{
-   struct domain_device *dev = ap->private_data;
-
-   SAS_DPRINTK("STUB %s\n", __FUNCTION__);
-   switch (sc_reg_in) {
-   case SCR_STATUS:
-   *val = dev->sata_dev.sstatus;
-   return 0;
-   case SCR_CONTROL:
-   *val = dev->sata_dev.scontrol;
-   return 0;
-   case SCR_ERROR:
-   *val = dev->sata_dev.serror;
-   return 0;
-   case SCR_ACTIVE:
-   *val = dev->sata_dev.ap->link.sactive;
-   return 0;
-   default:
-   return -EINVAL;
-   }
-}
-
 static struct ata_port_operations sas_sata_ops = {
.check_status   = sas_ata_check_status,
.check_altstatus= sas_ata_check_status,
@@ -385,8 +343,6 @@ static struct ata_port_operations sas_sata_ops = {
.qc_issue   = sas_ata_qc_issue,
.port_start = ata_sas_port_start,
.port_stop  = ata_sas_port_stop,
-   .scr_read   = sas_ata_scr_read,
-   .scr_write  = sas_ata_scr_write
 };
 
 static struct ata_port_info sata_port_info = {
@@ -398,33 +354,6 @@ static struct ata_port_info sata_port_info = {
.port_ops = &sas_sata_ops
 };
 
-int sas_ata_init_host_and_port(struct domain_device *found_dev,
-  struct scsi_target *starget)
-{
-   struct Scsi_Host *shost = dev_to_shost(&starget->dev);
-   struct sas_ha_struct *ha = SHOST_TO_SAS_HA(shost);
-   struct ata_port *ap;
-
-   ata_host_init(&found_dev->sata_dev.ata_host,
- ha->dev,
- sata_port_info.flags,
- &sas_sata_ops);
-   ap = ata_sas_port_alloc(&found_dev->sata_dev.ata_host,
-   &sata_port_info,
-   shost);
-   if (!ap) {
-   SAS_DPRINTK("ata_sas_port_alloc failed.\n");
-   return -ENODEV;
-   }
-
-   ap->private_data = found_dev;
-   ap->cbl = ATA_CBL_SATA;
-   ap->scsi_host = shost;
-   found_dev->sata_dev.ap = ap;
-
-   return 0;
-}
-
 void sas_ata_task_abort(struct sas_task *task)
 {
struct ata_queued_cmd *qc = task->uldd_task;
@@ -601,50 +530,6 @@ out:
 }
 
 /* -- SATA -- */
-
-static void sas_get_ata_command_set(struct domain_device *dev)
-{
-   struct dev_to_host_fis *fis =
-   (struct dev_to_host_fis *) dev->frame_rcvd;
-
-   if ((fis->sector_count == 1 && /* ATA *

[PATCH 1/2] libsas: Convert ATA bridge to use new EH

2007-11-12 Thread Darrick J. Wong
Migrate the sas_ata bridge to use the new libata EH strategy, and
finally implement correct software reset.

WARNING WARNING WARNING!  This patch is for experimental use only; it is
nowhere near complete!  Especially the sas_ata_freeze() function.  This
patch may eat your data and kill your trees.

jgarzik: If an ATA command was in-progress at the time of a port freeze,
can complete after thawing?  (Does that even make sense?)

Comments-requested-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_ata.c |   86 ++---
 1 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 0829b55..a9925d5 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -35,6 +35,8 @@
 #include "../scsi_transport_api.h"
 #include 
 
+static int sas_issue_ata_srst(struct domain_device *dev);
+
 static enum ata_completion_errors sas_to_ata_err(struct task_status_struct *ts)
 {
/* Cheesy attempt to translate SAS errors into ATA.  Hah! */
@@ -233,37 +235,58 @@ static u8 sas_ata_check_status(struct ata_port *ap)
return dev->sata_dev.tf.command;
 }
 
-static void sas_ata_phy_reset(struct ata_port *ap)
+static void sas_ata_freeze(struct ata_port *ap)
 {
-   struct domain_device *dev = ap->private_data;
-   struct sas_internal *i =
-   to_sas_internal(dev->port->ha->core.shost->transportt);
-   int res = 0;
+   /* reroute qc_done for all qc's on this port to a dumb free func */
+   /* i wonder if we can get away with throwing out anything that
+* completes in this time frame, or if we must find the commands
+* that are in progress and cancel only those? */
+   printk(KERN_ERR "%s: STUB\n", __FUNCTION__);
+}
 
-   if (i->dft->lldd_I_T_nexus_reset)
-   res = i->dft->lldd_I_T_nexus_reset(dev);
+static void sas_ata_thaw(struct ata_port *ap)
+{
+   /* empty */
+   printk(KERN_ERR "%s: STUB\n", __FUNCTION__);
+}
 
-   if (res)
-   SAS_DPRINTK("%s: Unable to reset I T nexus?\n", __FUNCTION__);
+static int sas_ata_soft_reset(struct ata_link *link, unsigned int *classes,
+  unsigned long deadline)
+{
+   struct ata_port *ap = link->ap;
+   struct domain_device *dev = ap->private_data;
+   int res;
 
+   /* Send SRST to device */
+   res = sas_issue_ata_srst(dev);
+   printk(KERN_ERR "srst 0 returns %d\n", res);
+
+   /* Set new device type */
switch (dev->sata_dev.command_set) {
case ATA_COMMAND_SET:
SAS_DPRINTK("%s: Found ATA device.\n", __FUNCTION__);
-   ap->link.device[0].class = ATA_DEV_ATA;
+   *classes = ATA_DEV_ATA;
break;
case ATAPI_COMMAND_SET:
SAS_DPRINTK("%s: Found ATAPI device.\n", __FUNCTION__);
-   ap->link.device[0].class = ATA_DEV_ATAPI;
+   *classes = ATA_DEV_ATAPI;
break;
default:
SAS_DPRINTK("%s: Unknown SATA command set: %d.\n",
__FUNCTION__,
dev->sata_dev.command_set);
-   ap->link.device[0].class = ATA_DEV_UNKNOWN;
-   break;
+   *classes = ATA_DEV_UNKNOWN;
+   break;
}
 
-   ap->cbl = ATA_CBL_SATA;
+   /* FIXME: What if SRST fails? */
+   return 0;
+}
+
+static void sas_ata_error_handler(struct ata_port *ap)
+{
+   ata_do_eh(ap, NULL, sas_ata_soft_reset, NULL, NULL);
+   //uh... hopefully there's no commands left in here?
 }
 
 static void sas_ata_post_internal(struct ata_queued_cmd *qc)
@@ -353,7 +376,9 @@ static struct ata_port_operations sas_sata_ops = {
.check_status   = sas_ata_check_status,
.check_altstatus= sas_ata_check_status,
.dev_select = ata_noop_dev_select,
-   .phy_reset  = sas_ata_phy_reset,
+   .error_handler  = sas_ata_error_handler,
+   .freeze = sas_ata_freeze,
+   .thaw   = sas_ata_thaw,
.post_internal_cmd  = sas_ata_post_internal,
.tf_read= sas_ata_tf_read,
.qc_prep= ata_noop_qc_prep,
@@ -658,6 +683,37 @@ out:
return res;
 }
 
+static int sas_issue_ata_srst(struct domain_device *dev)
+{
+   int res = 0;
+   struct sas_task *task;
+   struct dev_to_host_fis *d2h_fis = (struct dev_to_host_fis *)
+   &dev->frame_rcvd[0];
+
+   res = -ENOMEM;
+   task = sas_alloc_task(GFP_KERNEL);
+   if (!task)
+   

[PATCH 2/2] libsas: Fix various sparse complaints

2007-11-05 Thread Darrick J. Wong
Annotate sas_queuecommand with locking details, and clean up a few
more sparse warnings about static/non-static declarations.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_scsi_host.c |6 +-
 include/scsi/libsas.h   |4 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 0fa0296..c29ba47 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -202,6 +202,10 @@ int sas_queue_up(struct sas_task *task)
  */
 int sas_queuecommand(struct scsi_cmnd *cmd,
 void (*scsi_done)(struct scsi_cmnd *))
+   __releases(host->host_lock)
+   __acquires(dev->sata_dev.ap->lock)
+   __releases(dev->sata_dev.ap->lock)
+   __acquires(host->host_lock)
 {
int res = 0;
struct domain_device *dev = cmd_to_domain_dev(cmd);
@@ -412,7 +416,7 @@ static int sas_recover_I_T(struct domain_device *dev)
 }
 
 /* Find the sas_phy that's attached to this device */
-struct sas_phy *find_local_sas_phy(struct domain_device *dev)
+static struct sas_phy *find_local_sas_phy(struct domain_device *dev)
 {
struct domain_device *pdev = dev->parent;
struct ex_phy *exphy = NULL;
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index fe24bbc..cd11fe2 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -563,7 +563,7 @@ struct sas_task {
struct work_struct abort_work;
 };
 
-
+extern struct kmem_cache *sas_task_cache;
 
 #define SAS_TASK_STATE_PENDING  1
 #define SAS_TASK_STATE_DONE 2
@@ -573,7 +573,6 @@ struct sas_task {
 
 static inline struct sas_task *sas_alloc_task(gfp_t flags)
 {
-   extern struct kmem_cache *sas_task_cache;
struct sas_task *task = kmem_cache_zalloc(sas_task_cache, flags);
 
if (task) {
@@ -590,7 +589,6 @@ static inline struct sas_task *sas_alloc_task(gfp_t flags)
 static inline void sas_free_task(struct sas_task *task)
 {
if (task) {
-   extern struct kmem_cache *sas_task_cache;
BUG_ON(!list_empty(&task->list));
kmem_cache_free(sas_task_cache, task);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] libsas: Convert sas_proto users to sas_protocol

2007-11-05 Thread Darrick J. Wong
sparse complains about the mixing of enums in libsas.  Since the
underlying numeric values of both enums are the same, combine them
to get rid of the warning.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx_dev.c  |6 +++---
 drivers/scsi/aic94xx/aic94xx_dump.c |4 ++--
 drivers/scsi/aic94xx/aic94xx_hwi.c  |2 +-
 drivers/scsi/aic94xx/aic94xx_scb.c  |6 +++---
 drivers/scsi/aic94xx/aic94xx_task.c |   30 +++---
 drivers/scsi/aic94xx/aic94xx_tmf.c  |   12 ++--
 drivers/scsi/libsas/sas_discover.c  |2 +-
 drivers/scsi/libsas/sas_expander.c  |6 +++---
 drivers/scsi/libsas/sas_internal.h  |2 +-
 include/scsi/libsas.h   |   18 +-
 include/scsi/sas.h  |   13 ++---
 include/scsi/scsi_transport_sas.h   |8 +---
 12 files changed, 51 insertions(+), 58 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_dev.c 
b/drivers/scsi/aic94xx/aic94xx_dev.c
index 3dce618..72042ca 100644
--- a/drivers/scsi/aic94xx/aic94xx_dev.c
+++ b/drivers/scsi/aic94xx/aic94xx_dev.c
@@ -165,7 +165,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
if (dev->port->oob_mode != SATA_OOB_MODE) {
flags |= OPEN_REQUIRED;
if ((dev->dev_type == SATA_DEV) ||
-   (dev->tproto & SAS_PROTO_STP)) {
+   (dev->tproto & SAS_PROTOCOL_STP)) {
struct smp_resp *rps_resp = &dev->sata_dev.rps_resp;
if (rps_resp->frame_type == SMP_RESPONSE &&
rps_resp->function == SMP_REPORT_PHY_SATA &&
@@ -193,7 +193,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
asd_ddbsite_write_byte(asd_ha, ddb, DDB_TARG_FLAGS, flags);
 
flags = 0;
-   if (dev->tproto & SAS_PROTO_STP)
+   if (dev->tproto & SAS_PROTOCOL_STP)
flags |= STP_CL_POL_NO_TX;
asd_ddbsite_write_byte(asd_ha, ddb, DDB_TARG_FLAGS2, flags);
 
@@ -201,7 +201,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
asd_ddbsite_write_word(asd_ha, ddb, SEND_QUEUE_TAIL, 0x);
asd_ddbsite_write_word(asd_ha, ddb, SISTER_DDB, 0x);
 
-   if (dev->dev_type == SATA_DEV || (dev->tproto & SAS_PROTO_STP)) {
+   if (dev->dev_type == SATA_DEV || (dev->tproto & SAS_PROTOCOL_STP)) {
i = asd_init_sata(dev);
if (i < 0) {
asd_free_ddb(asd_ha, ddb);
diff --git a/drivers/scsi/aic94xx/aic94xx_dump.c 
b/drivers/scsi/aic94xx/aic94xx_dump.c
index 6bd8e30..3d8c4ff 100644
--- a/drivers/scsi/aic94xx/aic94xx_dump.c
+++ b/drivers/scsi/aic94xx/aic94xx_dump.c
@@ -903,11 +903,11 @@ void asd_dump_frame_rcvd(struct asd_phy *phy,
int i;
 
switch ((dl->status_block[1] & 0x70) >> 3) {
-   case SAS_PROTO_STP:
+   case SAS_PROTOCOL_STP:
ASD_DPRINTK("STP proto device-to-host FIS:\n");
break;
default:
-   case SAS_PROTO_SSP:
+   case SAS_PROTOCOL_SSP:
ASD_DPRINTK("SAS proto IDENTIFY:\n");
break;
}
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index fb2be39..940a207 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -90,7 +90,7 @@ static int asd_init_phy(struct asd_phy *phy)
 
sas_phy->enabled = 1;
sas_phy->class = SAS;
-   sas_phy->iproto = SAS_PROTO_ALL;
+   sas_phy->iproto = SAS_PROTOCOL_ALL;
sas_phy->tproto = 0;
sas_phy->type = PHY_TYPE_PHYSICAL;
sas_phy->role = PHY_ROLE_INITIATOR;
diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c 
b/drivers/scsi/aic94xx/aic94xx_scb.c
index db6ab1a..0febad4 100644
--- a/drivers/scsi/aic94xx/aic94xx_scb.c
+++ b/drivers/scsi/aic94xx/aic94xx_scb.c
@@ -788,12 +788,12 @@ void asd_build_control_phy(struct asd_ascb *ascb, int 
phy_id, u8 subfunc)
 
/* initiator port settings are in the hi nibble */
if (phy->sas_phy.role == PHY_ROLE_INITIATOR)
-   control_phy->port_type = SAS_PROTO_ALL << 4;
+   control_phy->port_type = SAS_PROTOCOL_ALL << 4;
else if (phy->sas_phy.role == PHY_ROLE_TARGET)
-   control_phy->port_type = SAS_PROTO_ALL;
+   control_phy->port_type = SAS_PROTOCOL_ALL;
else
control_phy->port_type =
-   (SAS_PROTO_ALL << 4) | SAS_PROTO_ALL;
+   (SAS_PROTOCOL_ALL << 4) | SAS_PROTOCOL_ALL;
 
/* link reset retries, this should be nominal */
control_phy->link_reset_retries = 10;
diff --git a/d

Re: [2.6 patch] scsi/aic94xx/: cleanups

2007-11-05 Thread Darrick J. Wong
On Mon, Nov 05, 2007 at 06:07:29PM +0100, Adrian Bunk wrote:
> This patch contains the following cleanups:
> - static functions in .c files shouldn't be marked inline
> - make needlessly global code static
> - #if 0 unused code

asd_unpause_lseq can be removed; the other if 0'd functions are debug
functions and can probably stay.

Otherwise, ack.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-09 Thread Darrick J. Wong
On Tue, Oct 09, 2007 at 09:41:47AM -0700, Andrew Vasquez wrote:
> On Tue, 09 Oct 2007, James Smart wrote:
> 
> >  Why do you prefer request_firmware() vs something over sysfs ?
> > 
> >  Does environments like the kdump kernel also have access to data needed
> >  by request_firmware() ?

Assuming the driver-loading parts of the kdump kernel's initrd are the
same (udev, bunch of modules, firmwares, etc) as the regular kernel's
initrd, this shouldn't be a problem.

In the specific case of aic94xx, one needs request_firmware() and
associated infrastructure to load firmware blobs into the controller in
order to issue any I/O at all.

> There's already much in the way of automation and infrastructure
> present in supporting the request_firwmare() interfaces (perhaps not
> the best of names) which can provide for a level of flexibility beyond
> a basic 'soft_port_name' interface.
> 
> Though I don't see why both can't coexist cleanly -- I take it the use
> case you are considering is: software recognizes no valid WWPN
> available, query via request_firmware() fails, software halts
> initialization (rather than fail), and awaits the admin to poke
> '0x123456.. > /sys/.../fc_host/soft_port_name', causing a ping to the
> driver and continuation of initialization with requested portname?

Hmm... could we use such a sysfs attribute to reassign adapter WWNs at
arbitrary times?  Is that even a good idea?

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] aic94xx: Use sas_request_addr() to provide SAS addr if the adapter lacks one

2007-10-08 Thread Darrick J. Wong
If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
make libsas get one for us.  Also clean out some old code that had been
used to do this in the past.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx.h  |   16 
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   21 ++---
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 10 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..aee235f 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
@@ -68,21 +67,6 @@ static inline void asd_stringify_sas_addr(char *p, const u8 
*sas_addr)
*p = '\0';
 }
 
-static inline void asd_destringify_sas_addr(u8 *sas_addr, const char *p)
-{
-   int i;
-   for (i = 0; i < SAS_ADDR_SIZE; i++) {
-   u8 h, l;
-   if (!*p)
-   break;
-   h = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   l = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   sas_addr[i] = (h<<4) | l;
-   }
-}
-
 struct asd_ha_struct;
 struct asd_ascb;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 0cd7eed..1dc5400 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "aic94xx.h"
 #include "aic94xx_reg.h"
@@ -38,16 +39,14 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha->hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha->hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha->hw_prof.sas_addr);
+   /* adapter came with a sas address */
+   if (asd_ha->hw_prof.sas_addr[0])
+   return 0;
+
+   return sas_request_addr(asd_ha->sas_ha.core.shost,
+   asd_ha->hw_prof.sas_addr);
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +657,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha->hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk("No SAS Address provided for %s\n",
   pci_name(asd_ha->pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index b70d6e7..5c99f27 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -54,8 +54,6 @@ MODULE_PARM_DESC(collector, "\n"
"\tThe aic94xx SAS LLDD supports both modes.\n"
"\tDefault: 0 (Direct Mode).\n");
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = "";
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] libsas: Provide a transport-level facility to request SAS addrs

2007-10-08 Thread Darrick J. Wong
Use the request_firmware() interface to get a SAS address from userspace.
This way, there's no debate as to who or how an address gets generated;
it's up to the administrator to provide one if the driver can't find one
on its own.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_scsi_host.c |   41 +++
 include/scsi/libsas.h   |3 +++
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 7663841..0fa0296 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -24,6 +24,8 @@
  */
 
 #include 
+#include 
+#include 
 
 #include "sas_internal.h"
 
@@ -1047,6 +1049,45 @@ void sas_target_destroy(struct scsi_target *starget)
return;
 }
 
+static void sas_parse_addr(u8 *sas_addr, const char *p)
+{
+   int i;
+   for (i = 0; i < SAS_ADDR_SIZE; i++) {
+   u8 h, l;
+   if (!*p)
+   break;
+   h = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   l = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   sas_addr[i] = (h<<4) | l;
+   }
+}
+
+#define SAS_STRING_ADDR_SIZE   16
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr)
+{
+   int res;
+   const struct firmware *fw;
+
+   res = request_firmware(&fw, "sas_addr", &shost->shost_gendev);
+   if (res)
+   return res;
+
+   if (fw->size < SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   sas_parse_addr(addr, fw->data);
+
+out:
+   release_firmware(fw);
+   return res;
+}
+EXPORT_SYMBOL_GPL(sas_request_addr);
+
 EXPORT_SYMBOL_GPL(sas_queuecommand);
 EXPORT_SYMBOL_GPL(sas_target_alloc);
 EXPORT_SYMBOL_GPL(sas_slave_configure);
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 8dda2d6..58aa2aa 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -676,4 +676,7 @@ extern int sas_ioctl(struct scsi_device *sdev, int cmd, 
void __user *arg);
 
 extern int sas_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
   struct request *req);
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr);
+
 #endif /* _SASLIB_H_ */
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-08 Thread Darrick J. Wong
On Mon, Oct 08, 2007 at 03:48:32PM -0700, Andrew Vasquez wrote:

> So how about factoring that out to a transport-level interface.  How
> about something along the lines of the following patch, whereby the
> software driver upon detecting no valid WWPN, makes an upcall to each
> interface's 'request_wwn()'.  The data passed in from shost_gendev
> should be enough for some helper script to cull relevent device bits
> and perhaps offer some level of persistence...  Off base?

Hrm... jejb made a remark that it might be better to pass the
scsi_host's device into request_firmware() as your example does, so I'll
pitch in a patch to do likewise with libsas--the scsi_host knows the
actual device it's coming from, and userland can sort that all out later
anyway via DEVPATH.

I suppose one could also have multiple scsi_hosts per PCI device, which
means that my first patch would stumble horribly in more than a few
cases.

> Darrick, forgive the FC example, I don't do SAS...

That's ok, I don't do FC. :)  Looks mostly good to me...

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-08 Thread Darrick J. Wong
If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
use the request_firmware() interface to get one from userspace.  This
way, there's no debate as to who or how an address gets generated--it's
totally up to the administrator to provide it if the card doesn't have one.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx.h  |1 -
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   40 +--
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..935d558 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 0cd7eed..82a12cc 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "aic94xx.h"
 #include "aic94xx_reg.h"
@@ -38,16 +39,34 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+#define SAS_STRING_ADDR_SIZE   16
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha->hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha->hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha->hw_prof.sas_addr);
+   const struct firmware *fw;
+   int res;
+
+   /* adapter came with a sas address */
+   if (asd_ha->hw_prof.sas_addr[0])
+   return 0;
+
+   ASD_DPRINTK("No address found for %s; asking for one...\n",
+   pci_name(asd_ha->pcidev));
+
+   /* else go ask userspace */
+   res = request_firmware(&fw, "sas_addr", &asd_ha->pcidev->dev);
+   if (res)
+   return res;
+
+   if (fw->size < SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   asd_destringify_sas_addr(asd_ha->hw_prof.sas_addr, fw->data);
+
+out:
+   release_firmware(fw);
+   return res;
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +676,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha->hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk("No SAS Address provided for %s\n",
   pci_name(asd_ha->pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index b70d6e7..5c99f27 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -54,8 +54,6 @@ MODULE_PARM_DESC(collector, "\n"
"\tThe aic94xx SAS LLDD supports both modes.\n"
"\tDefault: 0 (Direct Mode).\n");
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = "";
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/17] mptbase: reset ioc initiator during PCI resume

2007-10-02 Thread Darrick J. Wong
On Tue, Oct 02, 2007 at 04:51:48PM -0600, Moore, Eric wrote:

> I replied to this thread a couple times last week, and no response from
> Darrick.   I doubt this is required becase the MESSAGE_UNIT_RESET is
> issued from inside mpt_do_ioc_recovery.  I need some logs with debug
> enabled.   Darrick did you see my email?

Yep.  Replied to it, too.  Apparently it never got to you, so I've
attached it below.

--D

-

On Thu, Sep 20, 2007 at 07:06:35PM -0600, Moore, Eric wrote:
> Darrick - MESSAGE_UNIT_RESET is already issued from inside
> mpt_do_ioc_recovery(), so you don't need to send this in advance of
> that.YOu will find that occuring from the function MakeIocReady.
> Anyways... would it be possible for you to enable debug logging so I can
> see what problem your having?   I suggest MPT_DEBUG and MPT_DEBUG_INIT.
> If its possible for you to manually load mptbase, that way you can set
> the command line option. 

I took a look at MakeIocReady(), and this section caught my eye:

/* Is it already READY? */
if (!statefault && (ioc_state & MPI_IOC_STATE_MASK) == MPI_IOC_STATE_READY)
return 0;

So I turned on a whole lot more debugging (mpt_debug_level=65535), and
caught this from the dhsprintk() just above that code snippet:

mptbase::MakeIocReady, ioc0 [raw] state=1000

state=1000 seems to correspond with MPI_IOC_STATE_READY, which means
that the adapter isn't getting reset because the chip claims to be
ready.  It doesn't seem to be ready, as demonstrated by the original error
message that I reported with the patch.  I'll append the log entries
pertaining to mpt to the end of this message.

--D

(Driver sign-on message if you were curious)

[  164.467481] Fusion MPT base driver 3.04.05
[  164.471706] Copyright (c) 1999-2007 LSI Logic Corporation
[  164.492483] Fusion MPT SAS Host driver 3.04.05
[  167.066482] ACPI: PCI Interrupt :0c:03.0[A] -> <6>ACPI: PCI Interrupt 
:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[  167.066534] mptbase: Initiating ioc0 bringup
[  167.761481] ioc0: LSISAS1064E B0: Capabilities={Initiator}
[  178.681050] scsi6 : ioc0: LSISAS1064E B0, FwRev=00060200h, Ports=1, 
MaxQ=511, IRQ=16
[  178.741821] scsi 6:0:0:0: Direct-Access IBM-ESXS GNA073C3ESTT0Z N BH0C 
PQ: 0 ANSI: 5
[  178.816476] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.825198] sd 6:0:0:0: [sda] Write Protect is off
[  178.830088] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.831204] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.845101] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.853483] sd 6:0:0:0: [sda] Write Protect is off
[  178.858343] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.859961] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.869069]  sda: sda1 sda2 sda3 sda4
[  178.877690] sd 6:0:0:0: [sda] Attached SCSI disk
[  178.912356] sd 6:0:0:0: Attached scsi generic sg0 type 0

(put system to sleep)

[  821.678155] mptbase: ioc0: pci-suspend: pdev=0x81003f64a000, 
slot=:01:00.0, Entering operating state [D3]
[  821.678195] mptbase: ioc0: Sending IOC reset(0x40)!
[  821.813585] mptbase: ioc0: WaitForDoorbell ACK (count=16)
[  821.814120] ACPI: PCI interrupt for device :01:00.0 disabled

(wake system up)

[  891.307583] mptbase: ioc0: pci-resume: pdev=0x81003f64a000, 
slot=:01:00.0, Previous operating state [D3]
[  891.431146] PM: Writing back config space on device :01:00.0 at offset 1 
(was 10, writing 100107)
[  891.431174] ACPI: PCI Interrupt :01:00.0[A] -> GSI 16 (level, low) -> 
IRQ 16
[  891.431179] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x1000
[  891.431182] mptbase: Initiating ioc0 recovery
[  891.431184] mptbase::MakeIocReady, ioc0 [raw] state=1000
[  891.431187] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.723823] mptbase: ioc0: WaitForDoorbell INT (cnt=412) howlong=5
[  894.723826] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=412
[  894.723830] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.731815] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.731817] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=1
[  894.739806] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.747799] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.755791] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763781] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763784] mptbase: ioc0: Handshake request frame (@810028c81918) header
[  894.763786] mptbase: ioc0: HandShake request post done, WaitCnt=0
[  894.763789] mptbase: ioc0: WaitForDoorbell INT (cnt=0) howlong=5
[  894.771775] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.771778] mptbase: ioc0: WaitCnt=1 First handshake reply word=0300
[  894.779766] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.77976

Re: [PATCH] aic94xx: fix SMP request DMA direction

2007-09-30 Thread Darrick J. Wong
On Sat, Sep 29, 2007 at 02:25:33AM -0400, Jeff Garzik wrote:
> Muli Ben-Yehuda wrote:
>> On Fri, Sep 28, 2007 at 04:55:34PM -0700, Darrick J. Wong wrote:
>>> On Thu, Sep 27, 2007 at 10:33:41PM -0400, Jeff Garzik wrote:
>>>> Unless I'm missing something, the SMP request goes /to/ the PCI device 
>>>> :)
>>>>
>>>> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
>>> ACK; builds ok and SMP commands seem to work ok (not that they
>>> didn't before).
>> Could this explain some weirdness we were seeing with aic94xx and
>> Calgary/CalIOC2 enabled, or are SMP commands not likely to be used in
>> normal operation? We map the IOMMU entries differently for FROMDEVICE
>> (RW) and TODEVICE(RO).
>
> SMP == scsi management == not used during normal data transfer.
>
> It could certainly explain flakiness if you have expanders, though

Actually, SMP commands are used during device discovery to find things
attached to expanders, so it seems likely that "it blows up almost
immediately after loading the module" symptoms are a result of this bug.

That said, the bug that Jeff fixed resulted in extra permissions (+w)
being set for the SMP request buffer, so that's probably why I've never
seen any problems manifesting on x260/x3800 systems.

(Unless the CalIOC2 has a write only mode?)

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aic94xx: fix SMP request DMA direction

2007-09-28 Thread Darrick J. Wong
On Thu, Sep 27, 2007 at 10:33:41PM -0400, Jeff Garzik wrote:
> 
> Unless I'm missing something, the SMP request goes /to/ the PCI device :)
> 
> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

ACK; builds ok and SMP commands seem to work ok (not that they didn't
before).

--Darrick

> ---
>  drivers/scsi/aic94xx/aic94xx_task.c |4 -
>  2 files changed, 83 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/aic94xx/aic94xx_task.c 
> b/drivers/scsi/aic94xx/aic94xx_task.c
> index d5d8cab..ab13824 100644
> --- a/drivers/scsi/aic94xx/aic94xx_task.c
> +++ b/drivers/scsi/aic94xx/aic94xx_task.c
> @@ -451,7 +451,7 @@ static int asd_build_smp_ascb(struct asd_ascb *ascb, 
> struct sas_task *task,
>   struct scb *scb;
> 
>   pci_map_sg(asd_ha->pcidev, &task->smp_task.smp_req, 1,
> -PCI_DMA_FROMDEVICE);
> +PCI_DMA_TODEVICE);
>   pci_map_sg(asd_ha->pcidev, &task->smp_task.smp_resp, 1,
>  PCI_DMA_FROMDEVICE);
> 
> @@ -486,7 +486,7 @@ static void asd_unbuild_smp_ascb(struct asd_ascb *a)
> 
>   BUG_ON(!task);
>   pci_unmap_sg(a->ha->pcidev, &task->smp_task.smp_req, 1,
> -  PCI_DMA_FROMDEVICE);
> +  PCI_DMA_TODEVICE);
>   pci_unmap_sg(a->ha->pcidev, &task->smp_task.smp_resp, 1,
>PCI_DMA_FROMDEVICE);
>  }
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mptbase: Reset ioc initiator during PCI resume

2007-09-24 Thread Darrick J. Wong
On Thu, Sep 20, 2007 at 07:06:35PM -0600, Moore, Eric wrote:
> Darrick - MESSAGE_UNIT_RESET is already issued from inside
> mpt_do_ioc_recovery(), so you don't need to send this in advance of
> that.YOu will find that occuring from the function MakeIocReady.
> Anyways... would it be possible for you to enable debug logging so I can
> see what problem your having?   I suggest MPT_DEBUG and MPT_DEBUG_INIT.
> If its possible for you to manually load mptbase, that way you can set
> the command line option. 

I took a look at MakeIocReady(), and this section caught my eye:

/* Is it already READY? */
if (!statefault && (ioc_state & MPI_IOC_STATE_MASK) == MPI_IOC_STATE_READY)
return 0;

So I turned on a whole lot more debugging (mpt_debug_level=65535), and
caught this from the dhsprintk() just above that code snippet:

mptbase::MakeIocReady, ioc0 [raw] state=1000

state=1000 seems to correspond with MPI_IOC_STATE_READY, which means
that the adapter isn't getting reset because the chip claims to be
ready.  It doesn't seem to be ready, as demonstrated by the original error
message that I reported with the patch.  I'll append the log entries
pertaining to mpt to the end of this message.

--D

(Driver sign-on message if you were curious)

[  164.467481] Fusion MPT base driver 3.04.05
[  164.471706] Copyright (c) 1999-2007 LSI Logic Corporation
[  164.492483] Fusion MPT SAS Host driver 3.04.05
[  167.066482] ACPI: PCI Interrupt :0c:03.0[A] -> <6>ACPI: PCI Interrupt 
:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[  167.066534] mptbase: Initiating ioc0 bringup
[  167.761481] ioc0: LSISAS1064E B0: Capabilities={Initiator}
[  178.681050] scsi6 : ioc0: LSISAS1064E B0, FwRev=00060200h, Ports=1, 
MaxQ=511, IRQ=16
[  178.741821] scsi 6:0:0:0: Direct-Access IBM-ESXS GNA073C3ESTT0Z N BH0C 
PQ: 0 ANSI: 5
[  178.816476] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.825198] sd 6:0:0:0: [sda] Write Protect is off
[  178.830088] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.831204] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.845101] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.853483] sd 6:0:0:0: [sda] Write Protect is off
[  178.858343] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.859961] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.869069]  sda: sda1 sda2 sda3 sda4
[  178.877690] sd 6:0:0:0: [sda] Attached SCSI disk
[  178.912356] sd 6:0:0:0: Attached scsi generic sg0 type 0

(put system to sleep)

[  821.678155] mptbase: ioc0: pci-suspend: pdev=0x81003f64a000, 
slot=:01:00.0, Entering operating state [D3]
[  821.678195] mptbase: ioc0: Sending IOC reset(0x40)!
[  821.813585] mptbase: ioc0: WaitForDoorbell ACK (count=16)
[  821.814120] ACPI: PCI interrupt for device :01:00.0 disabled

(wake system up)

[  891.307583] mptbase: ioc0: pci-resume: pdev=0x81003f64a000, 
slot=:01:00.0, Previous operating state [D3]
[  891.431146] PM: Writing back config space on device :01:00.0 at offset 1 
(was 10, writing 100107)
[  891.431174] ACPI: PCI Interrupt :01:00.0[A] -> GSI 16 (level, low) -> 
IRQ 16
[  891.431179] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x1000
[  891.431182] mptbase: Initiating ioc0 recovery
[  891.431184] mptbase::MakeIocReady, ioc0 [raw] state=1000
[  891.431187] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.723823] mptbase: ioc0: WaitForDoorbell INT (cnt=412) howlong=5
[  894.723826] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=412
[  894.723830] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.731815] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.731817] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=1
[  894.739806] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.747799] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.755791] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763781] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763784] mptbase: ioc0: Handshake request frame (@810028c81918) header
[  894.763786] mptbase: ioc0: HandShake request post done, WaitCnt=0
[  894.763789] mptbase: ioc0: WaitForDoorbell INT (cnt=0) howlong=5
[  894.771775] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.771778] mptbase: ioc0: WaitCnt=1 First handshake reply word=0300
[  894.779766] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.779769] mptbase: ioc0: Got Handshake reply:
[  894.779770] mptbase: ioc0: WaitForDoorbell REPLY WaitCnt=1 (sz=1)
[  894.779772] mptbase: ioc0: HandShake reply count=1
[  894.779775] mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 
offsetof=6!

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Clean up IOC reset code to obey coding style

2007-09-20 Thread Darrick J. Wong
Randy Dunlap scolded me for introducing poorly styled code.  Since it
was a copy-and-paste block from mpt_suspend(), fix both.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/message/fusion/mptbase.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 40b8b41..2952a54 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1721,10 +1721,9 @@ mpt_suspend(struct pci_dev *pdev, pm_message_t state)
pci_save_state(pdev);
 
/* put ioc into READY_STATE */
-   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   if (SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP))
printk(MYIOC_s_ERR_FMT
"pci-suspend:  IOC msg unit reset failed!\n", ioc->name);
-   }
 
/* disable interrupts */
CHIPREG_WRITE32(&ioc->chip->IntMask, 0x);
@@ -1773,10 +1772,9 @@ mpt_resume(struct pci_dev *pdev)
CHIPREG_READ32(&ioc->chip->Doorbell));
 
/* put ioc into READY_STATE */
-   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   if (SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP))
printk(MYIOC_s_ERR_FMT
"pci-resume:  IOC msg unit reset failed!\n", ioc->name);
-   }
 
/* bring ioc to operational state */
if ((recovery_state = mpt_do_ioc_recovery(ioc,
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mptbase: Reset ioc initiator during PCI resume

2007-09-20 Thread Darrick J. Wong
It appears that the LSI SAS 1064E chip needs to be reset after a
suspend/resume cycle before the driver attempts further communications with
the chip.  Without this patch, resuming the chip results in this error
message being printed repeatedly and no more disk I/O.

mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 offsetof=6!

So far it seems to fix suspend/resume on all the MPT Fusion cards I have
(SAS and U320 SCSI) but since I don't know the internals of that chip I
can't say for sure if this is a proper fix.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/message/fusion/mptbase.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 414c109..97895bd 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1772,6 +1772,12 @@ mpt_resume(struct pci_dev *pdev)
(mpt_GetIocState(ioc, 1) >> MPI_IOC_STATE_SHIFT),
CHIPREG_READ32(&ioc->chip->Doorbell));
 
+   /* put ioc into READY_STATE */
+   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   printk(MYIOC_s_ERR_FMT
+   "pci-resume:  IOC msg unit reset failed!\n", ioc->name);
+   }
+
/* bring ioc to operational state */
if ((recovery_state = mpt_do_ioc_recovery(ioc,
MPT_HOSTEVENT_IOC_RECOVER, CAN_SLEEP)) != 0) {
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libsas: SMP request handler shouldn't crash when rphy is NULL

2007-07-24 Thread Darrick J. Wong
sas_smp_handler crashes when smp utils are used with an aic94xx host
because certain devices (the sas_host itself, specifically) lack rphy
structures.  No rphy means no SMP target support, but we shouldn't crash
here.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_expander.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_expander.c 
b/drivers/scsi/libsas/sas_expander.c
index b500f0c..8603ae6 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1879,7 +1879,7 @@ int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
struct request *req)
 {
struct domain_device *dev;
-   int ret, type = rphy->identify.device_type;
+   int ret, type;
struct request *rsp = req->next_rq;
 
if (!rsp) {
@@ -1888,12 +1888,13 @@ int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
return -EINVAL;
}
 
-   /* seems aic94xx doesn't support */
+   /* no rphy means no smp target support (ie aic94xx host) */
if (!rphy) {
printk("%s: can we send a smp request to a host?\n",
   __FUNCTION__);
return -EINVAL;
}
+   type = rphy->identify.device_type;
 
if (type != SAS_EDGE_EXPANDER_DEVICE &&
type != SAS_FANOUT_EXPANDER_DEVICE) {
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dtc: Coding police and printk levels

2007-06-22 Thread Darrick J. Wong
On Fri, Jun 22, 2007 at 02:26:29PM +0100, Alan Cox wrote:
> @@ -244,7 +242,7 @@
>   if (check_signature(base + 
> signatures[sig].offset, signatures[sig].string, 
> strlen(signatures[sig].string))) {
>   addr = 
> bases[current_base].address;
>  #if (DTCDEBUG & DTCDEBUG_INIT)
> - printk("scsi-dtc : detected 
> board.\n");
> + printk(KERB_DEBUG "scsi-dtc : 
> detected board.\n");

I think you meant KERN_DEBUG ?

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch added to scsi-pending-2.6: [SCSI] libsas: convert to use the data buffer accessors

2007-05-29 Thread Darrick J. Wong
On Sun, May 27, 2007 at 05:37:43PM +, James Bottomley wrote:
> [SCSI] libsas: convert to use the data buffer accessors

> This patch is pending because it requires ACKs from:
> 
> Darrick J. Wong <[EMAIL PROTECTED]>

ACK.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] aic94xx: asd_clear_nexus should fail if the cleared task does not complete

2007-05-16 Thread Darrick J. Wong
Every so often, the driver will call asd_clear_nexus to clean out a task.
It is supposed to be the case that the CLEAR NEXUS does not go on the done
list until after the task itself has been put on the done list, but for
some reason this doesn't always happen.  Thus, the
wait_for_completion_timeout call times out, and we return success.  This
makes libsas free the task even though the task hasn't completed, leading
to a BUG_ON message from aic94xx_hwi.c around line 341.  We should return
failure from asd_clear_nexus so that libsas tries again; at a bare minimum
it shouldn't be freeing active tasks.  I _think_ this will fix one of
the SCB timeout crash problems (though I've not been able to reproduce
it lately...)

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aic94xx/aic94xx_tmf.c |   14 ++
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c 
b/drivers/scsi/aic94xx/aic94xx_tmf.c
index 9a14a6d..c0d0b7d 100644
--- a/drivers/scsi/aic94xx/aic94xx_tmf.c
+++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
@@ -290,6 +290,7 @@ static void asd_tmf_tasklet_complete(str
 static inline int asd_clear_nexus(struct sas_task *task)
 {
int res = TMF_RESP_FUNC_FAILED;
+   int leftover;
struct asd_ascb *tascb = task->lldd_task;
unsigned long flags;
 
@@ -298,10 +299,12 @@ static inline int asd_clear_nexus(struct
res = asd_clear_nexus_tag(task);
else
res = asd_clear_nexus_index(task);
-   wait_for_completion_timeout(&tascb->completion,
-   AIC94XX_SCB_TIMEOUT);
+   leftover = wait_for_completion_timeout(&tascb->completion,
+  AIC94XX_SCB_TIMEOUT);
ASD_DPRINTK("came back from clear nexus\n");
spin_lock_irqsave(&task->task_state_lock, flags);
+   if (leftover < 1)
+   res = TMF_RESP_FUNC_FAILED;
if (task->task_state_flags & SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(&task->task_state_lock, flags);
@@ -350,6 +353,7 @@ int asd_abort_task(struct sas_task *task
unsigned long flags;
struct asd_ascb *ascb = NULL;
struct scb *scb;
+   int leftover;
 
spin_lock_irqsave(&task->task_state_lock, flags);
if (task->task_state_flags & SAS_TASK_STATE_DONE) {
@@ -455,9 +459,11 @@ int asd_abort_task(struct sas_task *task
break;
case TF_TMF_TASK_DONE + 0xFF00: /* done but not reported yet */
res = TMF_RESP_FUNC_FAILED;
-   wait_for_completion_timeout(&tascb->completion,
-   AIC94XX_SCB_TIMEOUT);
+   leftover = wait_for_completion_timeout(&tascb->completion,
+  AIC94XX_SCB_TIMEOUT);
spin_lock_irqsave(&task->task_state_lock, flags);
+   if (leftover < 1)
+   res = TMF_RESP_FUNC_FAILED;
if (task->task_state_flags & SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(&task->task_state_lock, flags);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aacraid: superfluous adapter reset for IBM 8 series ServeRAID controllers

2007-05-02 Thread Darrick J. Wong
Darrick J. Wong wrote:
> Salyzyn, Mark wrote:
>> The kexec patch introduced a superfluous (and otherwise inert) reset of
>> some adapters. The register can have a hardware default value that has
>> zeros for the undefined interrupts. This patch refines the test of the
>> interrupt enable register to focus on only the interrupts that affect
>> the driver in order to detect if an incomplete shutdown of the Adapter
>> had occurred (kdump).
> 
> Tests out ok on the affected machines, so:

/me shoves foot in mouth.  Crashes on a 2410SA aacraid card; OIMR =
0xF7.  De-ack.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aacraid: superfluous adapter reset for IBM 8 series ServeRAID controllers

2007-05-01 Thread Darrick J. Wong
Salyzyn, Mark wrote:
> The kexec patch introduced a superfluous (and otherwise inert) reset of
> some adapters. The register can have a hardware default value that has
> zeros for the undefined interrupts. This patch refines the test of the
> interrupt enable register to focus on only the interrupts that affect
> the driver in order to detect if an incomplete shutdown of the Adapter
> had occurred (kdump).

Tests out ok on the affected machines, so:

Acked-by: Darrick J. Wong <[EMAIL PROTECTED]>

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash with AIC94xx (one step forward, hope it's lucky)

2007-05-01 Thread Darrick J. Wong
Constantin Teodorescu wrote:

> 03:02:15 kernel: [ cut here ]
> 03:02:15 kernel: kernel BUG at drivers/scsi/aic94xx/aic94xx_hwi.h:354!

On the odd chance you still have this controller (and have the time to
test out patches), would you mind applying this patch:

http://sweaglesw.net/~djwong/docs/17-aic94xx-hwi-bugon_1.patch

and reporting back to me what happens?

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-27 Thread Darrick J. Wong
Salyzyn, Mark wrote:
> As an option for a patch (later), what was the actual value of the
> Munit.OIMR register (on the x3550 and the x3650 please, just in case)?

0xF.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-27 Thread Darrick J. Wong
Salyzyn, Mark wrote:

> In my unit tests of aacraid_kexec_5.patch, restart was not called for
> normal operations. If you are just doing a normal boot, what conditions
> are causing restart to be called in your case? Is it a warm restart?
> Some kind of operation that leaves the Adapter in an initialized state,
> or a bug in the driver making sure that interrupts are disabled when
> shut down. Inquiring minds want to know!

This is a normal boot of a "Serveraid 8k-l" on an IBM x3550.  One
wrinkle in the configuration is that the system is booted off the
network, though I don't see how that would affect the aacraid's state.
It looks like the MUnit.OIMR test just after the "Failure to reset here
is an option..." comment is succeeding.  The crash seems to happen
regardless of whether we had just done a warm or cold boot.  The option
ROM had run during POST, if that makes any difference.  No kexec/kdump
have been configured.  For that matter, neither kexec nor kdump have
ever been run in the lifetime of the machine.

Also observed on an IBM x3650.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-26 Thread Darrick J. Wong
Commit 8418852d11f0bbaeebeedd4243560d8fdc85410d to scsi-misc resulted in
the substitution of calls to rx_sync_cmd with a function pointer
abstraction.  aac_rx_restart_adapter requires a pointer to a sync_cmd
function, which is not set up before its first invocation.  That causes
the driver to crash at startup.  Move the initializers (we need both
rx_sync_cmd and enable_int pointers) further up to proceed the
restart_adapter call.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/aacraid/rx.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
index 0c71315..b7810d6 100644
--- a/drivers/scsi/aacraid/rx.c
+++ b/drivers/scsi/aacraid/rx.c
@@ -537,6 +537,8 @@ int _aac_rx_init(struct aac_dev *dev)
printk(KERN_WARNING "%s: unable to map adapter.\n", name);
goto error_iounmap;
}
+   dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
+   aac_adapter_comm(dev, AAC_COMM_PRODUCER);
 
/* Failure to reset here is an option ... */
dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
@@ -598,7 +600,6 @@ int _aac_rx_init(struct aac_dev *dev)
dev->a_ops.adapter_interrupt = aac_rx_interrupt_adapter;
dev->a_ops.adapter_disable_int = aac_rx_disable_interrupt;
dev->a_ops.adapter_notify = aac_rx_notify_adapter;
-   dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
dev->a_ops.adapter_check_health = aac_rx_check_health;
dev->a_ops.adapter_restart = aac_rx_restart_adapter;
 
@@ -606,7 +607,6 @@ int _aac_rx_init(struct aac_dev *dev)
 *  First clear out all interrupts.  Then enable the one's that we
 *  can handle.
 */
-   aac_adapter_comm(dev, AAC_COMM_PRODUCER);
aac_adapter_disable_int(dev);
rx_writel(dev, MUnit.ODR, 0x);
aac_adapter_enable_int(dev);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash with AIC94xx (one step forward, hope it's lucky)

2007-04-26 Thread Darrick J. Wong
Constantin Teodorescu wrote:

> So ... should I ask for other controller quotation ?
> Could you recommend me a good SAS controller, with 8 internal ports,
> supporting Linux , with 99.% reliability ? :-)
> 
> I have the following options : Intel® RAID Controller SRCSAS18E
> (Parowan)  and   LSI MegaRAID SAS 8408E
> 
> so ... your bet ? :-)

I don't know anything about either of those controllers, though the LSI
1068E has worked quite reliably for me.  I decline to make any
statements about 99.% reliability, however.

--D

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx driver woes

2007-03-31 Thread Darrick J. Wong
Douglas Gilbert wrote:

> So that is almost 12 months that I have been reporting
> this driver as broken. Is it just me or my hardware?

I seem to recall you saying that the LSI Fusion card was plugged into
the same expander as the 48300?  If so, does unplugging the Fusion card
from the expander make it work?

> aic94xx: Found sequencer Firmware version 1.1 (V17/10c6)

Have you tried the V30 sequencer?

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2]: sas_ata: Don't reset the phy in post_internal_command

2007-02-22 Thread Darrick J. Wong
We don't need to reset the SAS phy in sas_ata_post_internal; all
that is necessary is to clear out the task from the SAS HA.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_ata.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index c92f4b6..d91c5ba 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -281,11 +281,6 @@ static void sas_ata_post_internal(struct
 
qc->driver_data = NULL;
if (task) {
-   /* Should this be a AT(API) device reset? */
-   spin_lock_irqsave(&task->task_state_lock, flags);
-   task->task_state_flags |= SAS_TASK_NEED_DEV_RESET;
-   spin_unlock_irqrestore(&task->task_state_lock, flags);
-
task->uldd_task = NULL;
__sas_task_abort(task);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] sas_ata: Rename ata_queued_cmd->lldd_task to driver_data

2007-02-22 Thread Darrick J. Wong
Per Tejun's request, rename the lldd_task field and add comments about it.

Signed-off-by: Darrick J. Wong <[EMAIL PROTECTED]>
---

 drivers/scsi/libsas/sas_ata.c |8 
 include/linux/libata.h|4 +++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 2db2589..c92f4b6 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -122,7 +122,7 @@ static void sas_ata_task_done(struct sas
}
}
 
-   qc->lldd_task = NULL;
+   qc->driver_data = NULL;
if (qc->scsicmd)
ASSIGN_SAS_TASK(qc->scsicmd, NULL);
ata_qc_complete(qc);
@@ -192,7 +192,7 @@ static unsigned int sas_ata_qc_issue(str
task->scatter = qc->__sg;
task->ata_task.retry_count = 1;
task->task_state_flags = SAS_TASK_STATE_PENDING;
-   qc->lldd_task = task;
+   qc->driver_data = task;
 
switch (qc->tf.protocol) {
case ATA_PROT_NCQ:
@@ -276,10 +276,10 @@ static void sas_ata_post_internal(struct
 * bother with sas_ata_task_done.  But we still
 * ought to abort the task.
 */
-   struct sas_task *task = qc->lldd_task;
+   struct sas_task *task = qc->driver_data;
unsigned long flags;
 
-   qc->lldd_task = NULL;
+   qc->driver_data = NULL;
if (task) {
/* Should this be a AT(API) device reset? */
spin_lock_irqsave(&task->task_state_lock, flags);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index a20646c..a8eafc7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -445,7 +445,9 @@ struct ata_queued_cmd {
ata_qc_cb_t complete_fn;
 
void*private_data;
-   void*lldd_task;
+
+   /* This is owned by a low level libata client */
+   void*driver_data;
 };
 
 struct ata_port_stats {
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help if u can.

2007-02-21 Thread Darrick J. Wong
John Scarpa wrote:
> First a very big thanks to all of u! I have been suffering a serious
> lack of sleep problem lately..  i should have noticed that by whom has
> been submitting the past 500 fixes and updates!
> 
> Quick question, is the driver still consider experimental??

Very much so.  The SAS bits are fairly stable nowadays, but the rest is
still YMWV. :)

> the guys i
> work with say it doesn't support sata drives and it's still experimental

SATA support is under development.  Patches exist in the git tree here:
http://www.kernel.org/git/?p=linux/kernel/git/jejb/aic94xx-sas-2.6.git;a=summary

> so don't use it.  And i can't find anything on the state of this driver.
> 
> PS.  I should have said i dropped that aic94xx-seq.fw in
> /lib,/lib/firmware,/lib64,/lib64/firmware  (still have yet to get this
> sucker to work)

Yes, you need a udev that's new enough to know how to handle the
firmware loading interface.  Typically, udev will load firmware from
/lib/firmware, though I suppose that depends on the distribution.  Not
sure if RH/Fedora support fw loading, newer Ubuntu-E and SuSE do...

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help if u can.

2007-02-20 Thread Darrick J. Wong
Douglas Gilbert wrote:

> It would be reasonable to assume that Luben is the maintainer
> of this code although the MAINTAINERS file has no entry
> for the aic94xx driver.
> 
> This code was effectively removed from Luben's control
> about 18 months ago and has passed through several sets
> of hands since then. None of the people concerned want
> to identify themselves in the source or explain
> what has been done in writing. Why?

Laziness, in my case.  I suppose it would be useful to document the fact
that I've made changes to libsas/aic94xx.  Though the "what has been
done" part ... I was hoping the commit messages would suffice.

> The existing copyright notices should remain but what
> about the massive changes to that driver in 2006?
> Where is the indication of whom John should contact?
>
> Perhaps something like:
>  "For maintenance of this driver contact
>   linux-scsi@vger.kernel.org"

Alexis can answer this part.

> FYI John, Luben maintains his own version of the
> aic94xx and that version holds the sequencer firmware
> in a binary blob within the driver. Due to kernel policy,
> that blob was moved to separate file by someone else. The
> error that you are reporting suggests that the firmware
> file for the aic94xx cannot be found. On my system
> the file that you may be missing looks like this:
> # ls -l /lib/firmware/aic94xx-seq.fw
> -rw-rw-r-- 1 root root 22622 Aug 29 17:36 /lib/firmware/aic94xx-seq.fw

http://kernel.org/pub/linux/kernel/people/jejb/aic94xx-seq.fw

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG in libata from ata_sas_port_alloc

2007-02-15 Thread Darrick J. Wong
James Bottomley wrote:

> The problem is that memory obtained by devm_kzalloc() cannot be returned
> by kfree() ... they come from different allocation lists.  The solution
> is probably to have a corresponding ata_probe_ent_free(), I just don't
> exactly see how to tell if the object came from the devm_kzalloc or not
> (unless it gets marked).

Just a shot in the dark, but could we simply make whatever changes are
necessary to make all sas-ata LLDDs managed and then use devm_kzalloc?
Though (and I may be totally wrong here) if it's the case that
devres_head is made (or not made) to be part of a list _only_ before we
reach ata_probe_ent_alloc, we could put a similar if check into the free
function.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any multipath SAS support in Linux?

2007-02-14 Thread Darrick J. Wong
Orion Poplawski wrote:
> I'm thinking about trying to setup a two node HA storage cluster
> connected to an external SAS box.  Is such a thing possible at this time?

I've had success with aic94xx + dm_multipath before.  There has recently
been a bug in the multipath tools wherein it fails to detect disk type
due to the removal of a "bus" attribute in sysfs. and I don't know if
that's been fixed.  (Aside from building your own with the sysfs part
removed)

Note that success == I set it up, started I/O, yanked some cables, and
it kept chugging. :)

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >