Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Joel Becker
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote:
 On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
   
   I expect it would be relatively simple to get large blocksizes working
   on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
   amounts of work, perhaps someone can do a proof-of-concept on powerpc
   (or ia64) with 64k blocksize.
  
  Reality check: 64k block sizes on 64k page Linux machines has been
  used in production on XFS for at least 10 years. It's exactly the
  same case as 4k block size on 4k page size - one page, one buffer
  head, one filesystem block.
 
 This is true for ext4 as well.  Block size == page size support is
 pretty easy; the hard part is when block size  page size, due to
 assumptions in the VM layer that requires that FS system needs to do a
 lot of extra work to fudge around.  So the real problem comes with
 trying to support 64k block sizes on a 4k page architecture, and can
 we do it in a way where every single file system doesn't have to do
 their own specific hacks to work around assumptions made in the VM
 layer.

Yup, ditto for ocfs2.

Joel

-- 

One of the symptoms of an approaching nervous breakdown is the
 belief that one's work is terribly important.
 - Bertrand Russell 

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Joel Becker
On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
   On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
 [agreement cut because it's boring for the reader]
   Realistically, if you look at what the I/O schedulers output on a
   standard (spinning rust) workload, it's mostly large transfers.
   Obviously these are misalgned at the ends, but we can fix some of that
   in the scheduler.  Particularly if the FS helps us with layout.  My
   instinct tells me that we can fix 99% of this with layout on the FS + io
   schedulers ... the remaining 1% goes to the drive as needing to do RMW
   in the device, but the net impact to our throughput shouldn't be that
   great.
  
  There are a few workloads where the VM and the FS would team up to make
  this fairly miserable
  
  Small files.  Delayed allocation fixes a lot of this, but the VM doesn't
  realize that fileA, fileB, fileC, and fileD all need to be written at
  the same time to avoid RMW.  Btrfs and MD have setup plugging callbacks
  to accumulate full stripes as much as possible, but it still hurts.
  
  Metadata.  These writes are very latency sensitive and we'll gain a lot
  if the FS is explicitly trying to build full sector IOs.
 
 OK, so these two cases I buy ... the question is can we do something
 about them today without increasing the block size?
 
 The metadata problem, in particular, might be block independent: we
 still have a lot of small chunks to write out at fractured locations.
 With a large block size, the FS knows it's been bad and can expect the
 rolled up newspaper, but it's not clear what it could do about it.
 
 The small files issue looks like something we should be tackling today
 since writing out adjacent files would actually help us get bigger
 transfers.

ocfs2 can actually take significant advantage here, because we store
small file data in-inode.  This would grow our in-inode size from ~3K to
~15K or ~63K.  We'd actually have to do more work to start putting more
than one inode in a block (thought that would be a promising avenue too
once the coordination is solved generically.

Joel


-- 

One of the symptoms of an approaching nervous breakdown is the
 belief that one's work is terribly important.
 - Bertrand Russell 

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-21 Thread Joel Becker
On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
 One topic that has been lurking forever at the edges is the current
 4k limitation for file system block sizes. Some devices in
 production today and others coming soon have larger sectors and it
 would be interesting to see if it is time to poke at this topic
 again.
 
 LSF/MM seems to be pretty much the only event of the year that most
 of the key people will be present, so should be a great topic for a
 joint session.

Oh yes, I want in on this.  We handle 4k/16k/64k pages seamlessly, and
we would want to do the same for larger sectors.  In theory, our code
should handle it with the appropriate defines updated.

Joel

-- 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage file systems

2014-01-07 Thread Joel Becker
On Mon, Jan 06, 2014 at 05:32:56PM -0500, faibish, sorin wrote:
 Speaking of persistent memory I would like to discuss the PMFS as well as 
 RDMA aspects of the persistent memory model. Also I would like to discuss KV 
 stores and object stores on persistent memory. I was involved in the PMFS as 
 a tester and I found several issues that I would like to discuss with the 
 community. I assume that maybe others from Intel could join this discussion 
 except for Andy and Matt which already asked for this topic. Thanks

Ooh, and the cluster/remote filesystem stories there (eg, RDMA etc) are
probably pretty cool.

Joel

 
 ./Sorin
 
 -Original Message-
 From: linux-fsdevel-ow...@vger.kernel.org 
 [mailto:linux-fsdevel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
 Sent: Monday, January 06, 2014 5:21 PM
 To: linux-scsi@vger.kernel.org; linux-...@vger.kernel.org; 
 linux...@kvack.org; linux-fsde...@vger.kernel.org; 
 lsf...@lists.linux-foundation.org
 Cc: linux-ker...@vger.kernel.org
 Subject: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of 
 storage  file systems
 
 
 I would like to attend this year and continue to talk about the work on 
 enabling the new class of persistent memory devices. Specifically, very 
 interested in talking about both using a block driver under our existing 
 stack and also progress at the file system layer (adding xip/mmap tweaks to 
 existing file systems and looking at new file systems).
 
 We also have a lot of work left to do on unifying management, it would be 
 good to resync on that.
 
 Regards,
 
 Ric
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in 
 the body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 

 The herd instinct among economists makes sheep look like
 independant thinkers.

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-08 Thread Joel Becker
On Thu, Feb 07, 2013 at 02:12:57PM -0500, Martin K. Petersen wrote:
  Joel == Joel Becker jl...@evilplan.org writes:
 
 Joel I'm happy to chat about it.  Unfortunately, like Darrick says,
 Joel sys_dio() coding hasn't happened.  I do think we're better off
 Joel with some kind of explicit API than some magic state on the file.
 Joel I mean, even something like:
 
 Joel ssize_t write_with_pi(int fd, const void *buf, size_t count,
 Joel   const void *pi, size_t pi_count);
 
 Joel It's not as nice as a non-historical API (eg sys_dio), but it also
 Joel probably plays nicer with buffered I/O.
 
 Pretty much everyone I have talked to that are interested in explicitly
 attaching PI (as opposed to relying on the kernel doing it) are using
 Linux aio.
 
 I am not opposed to having more read()/write() like interface as
 well. But I think it's important to cater to the I/O paradigm used by
 the applications interested in this. It's a lot easier to tweak a few
 IOCB fields than it is to rewrite how an application does I/O.

You know I'm not going to argue with this.  I was merely stating that
I'm flexible in how we start :-)

Joel

 
 -- 
 Martin K. PetersenOracle Linux Engineering
 --
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Depend on the rabbit's foot if you will, but remember, it didn't
 help the rabbit.
- R. E. Shay

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC][ATTEND] protection information batched I/O interfaces.

2013-02-08 Thread Joel Becker
I'm definitely interested in attending to discuss PI injection from
userspace, batched I/O interfaces, and potential O_DIRECT cleanups.

Joel

--
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-08 Thread Joel Becker
On Thu, Feb 07, 2013 at 04:04:36PM -0500, J. Bruce Fields wrote:
 On Thu, Feb 07, 2013 at 09:36:39AM -0800, Joel Becker wrote:
  Dear LSF committee,
  I'd like to explicitly request attendance for this discussion
  :-)
 
 http://marc.info/?l=linux-fsdevelm=135894412908342w=2
 
   Also, the way I compile the list of requests is from thread
   heads ...  that means don't send your attendee request as a
   reply to something else either otherwise it might get missed.

Ack.  Send as such.

Thanks,
Joel

 
 --b.
 
  
  Joel
  
  On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote:
   On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
Boaz Harrosh bharr...@panasas.com writes:
 
 For aio we just need to add additional fields to an existing 
 structure.
 
 So yeah, I'd be interested in that discussion as well.

Sure, it's easy to start there, but then you eventually end up having to
add a non-aio interface as well.  Let's not take the latter off the
table.
   
   I agree that a sync variant should't be ignored, but needing a sync
   interface with PI arguments also shouldn't get in the way of adding
   support to the aio+dio path.  Simply because it's what people use :/.
   
I'm not sure how that's directly related to aio, but ok.  If we're going
to rewrite the aio code, I think Zach's acall would be a good start, at
least on the API front:
  http://lwn.net/Articles/316806/
   
   Yeah, I'm happy to chat about this stuff if people are interested.  I
   think I'd do things differently today than what was done in that aged
   acall prototype.
   
   - z
   --
   To unsubscribe from this list: send the line unsubscribe linux-scsi in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  -- 
  
  You can get more with a kind word and a gun than you can with
   a kind word alone.
   - Al Capone
  
  http://www.jlbec.org/
  jl...@evilplan.org
  --
  To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

You look in her eyes, the music begins to play.
 Hopeless romantics, here we go again.

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Joel Becker
On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
 
 On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote:
 
  On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
  Hi,
  
  I'm interested in discussing how to pass protection information to and from
  userspace.  Maybe Martin could be enlisted for the discussion.
  
  I read that some work has already been done in this area but have not been 
  able
  to locate it.  It looks like the bio-integrity code already makes it 
  possible
  to generate the t10-dif crc in the filesystem.  It would be good to be 
  able to
  get the guard and application tags back out to backup applications such as
  xfsdump.  Enabling other applications to generate their own tags in 
  userspace
  is also interesting.
  
  This one's been on my list for a couple of years (and companies) too.  A few
  years ago Joel Becker had support for it in his sys_dio proposal (that 
  hasn't
  gone anywhere), and more recently I've theorized that we could add a magic
  fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
  *{read,write}v call as the PI buffer, which I think is similar to how DIX 
  gets
  PI data to a disk.  But it's not like I have any code to show for it.
  
  I /think/ it's fairly straightforward to change the directio submit code to
  find the userspace PI buffer and amend the block integrity code to attach 
  our
  own PI buffer.  You'd still have to let the block layer set the sector # 
  field,
  but afaik that won't affect the crc or the app tag.
  
  I hear that the NFS guys want to propose some sort of protocol for 
  transmitting
  PI data (across NFS), but I haven't seen anything concrete yet.
 
 I'm writing a requirements document for the NFS protocol which I can discuss 
 at LSF.  The use cases for NFS for now would be virtual disk devices 
 (hypervisors) or direct NFS access to storage from user space.
 
 Like everyone else we are waiting for a magical VFS and user space API to 
 appear that can pass PI to and from storage.

I'm happy to chat about it.  Unfortunately, like Darrick says, sys_dio()
coding hasn't happened.  I do think we're better off with some kind of
explicit API than some magic state on the file.  I mean, even something
like:

ssize_t write_with_pi(int fd, const void *buf, size_t count,
  const void *pi, size_t pi_count);

It's not as nice as a non-historical API (eg sys_dio), but it also
probably plays nicer with buffered I/O.

Joel

 
  Well, I hope I'll scrape together the time to hack together a PoC before 
  LSF...
  on the other hand, I ran the discussion about PI userland interfaces at 
  LPC2011
  and (shamefully) haven't done anything yet.
  
  end rambling
  
  --D
  
  Regards,
 Ben
  --
  To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  --
  To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

I think it would be a good idea.  
- Mahatma Ghandi, when asked what he thought of Western
  civilization

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Joel Becker
Dear LSF committee,
I'd like to explicitly request attendance for this discussion
:-)

Joel

On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote:
 On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote:
  Boaz Harrosh bharr...@panasas.com writes:
   
   For aio we just need to add additional fields to an existing structure.
   
   So yeah, I'd be interested in that discussion as well.
  
  Sure, it's easy to start there, but then you eventually end up having to
  add a non-aio interface as well.  Let's not take the latter off the
  table.
 
 I agree that a sync variant should't be ignored, but needing a sync
 interface with PI arguments also shouldn't get in the way of adding
 support to the aio+dio path.  Simply because it's what people use :/.
 
  I'm not sure how that's directly related to aio, but ok.  If we're going
  to rewrite the aio code, I think Zach's acall would be a good start, at
  least on the API front:
http://lwn.net/Articles/316806/
 
 Yeah, I'm happy to chat about this stuff if people are interested.  I
 think I'd do things differently today than what was done in that aged
 acall prototype.
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

You can get more with a kind word and a gun than you can with
 a kind word alone.
 - Al Capone

http://www.jlbec.org/
jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html