Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 02:34:52PM +, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. So, from someone who was done in the trenches of the large filesystem block size code wars, the main objection to Christoph lameter's patchset was that it used high order compound pages in the page cache so that nothing at filesystem level needed to be changed to support large block sizes. The patch to enable XFS to use 64k block sizes with Christoph's patches was simply removing 5 lines of code that limited the block size to PAGE_SIZE. And everything just worked. Given that compound pages are used all over the place now and we also have page migration, compaction and other MM support that greatly improves high order memory allocation, perhaps we should revisit this approach. As to Nick's fsblock rewrite, he basically rewrote all the bufferhead head code to handle filesystem blocks larger than a page whilst leaving the page cache untouched. i.e. the complete opposite approach. The problem with this approach is that every filesystem needs to be re-written to use fsblocks rather than bufferheads. For some filesystems that isn't hard (e.g. ext2) but for filesystems that use bufferheads in the core of their journalling subsystems that's a completely different story. And for filesystems like XFS, it doesn't solve any of the problem with using bufferheads that we have now, so it simply introduces a huge amount of IO path rework and validation without providing any advantage from a feature or performance point of view. i.e. extent based filesystems mostly negate the impact of filesystem block size on IO performance... Realistically, if I'm going to do something in XFS to add block size page size support, I'm going to do it wiht somethign XFS can track through it's own journal so I can add data=journal functionality with the same filesystem block/extent header structures used to track the pages in blocks larger than PAGE_SIZE. And given that we already have such infrastructure in XFS to support directory blocks larger than filesystem block size FWIW, as to the original large sector size support question, XFS already supports sector sizes up to 32k in size. The limitation is actually a limitation of the journal format, so going larger than that would take some work... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 09:21:40AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: On 01/22/2014 09:34 AM, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. I will have to see if I can get a storage vendor to make a public statement, but there are vendors hoping to see this land in Linux in the next few years. What about the second and third questions -- is someone working on this right now or planning to? Have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? No, the reason bufferheads were replaced was that a bufferhead can only reference a single page. i.e. the structure is that a page can reference multipl bufferheads (block size = page size) but a bufferhead can't refernce multiple pages which is what is needed for block size page size. fsblock was designed to handle both cases. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 11:50:02AM -0800, Andrew Morton wrote: On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley james.bottom...@hansenpartnership.com wrote: But this, I think, is the fundamental point for debate. If we can pull alignment and other tricks to solve 99% of the problem is there a need for radical VM surgery? Is there anything coming down the pipe in the future that may move the devices ahead of the tricks? I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. Reality check: 64k block sizes on 64k page Linux machines has been used in production on XFS for at least 10 years. It's exactly the same case as 4k block size on 4k page size - one page, one buffer head, one filesystem block. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] scsi-sd: removed unused SD_PASSTHROUGH_RETRIES
From: Sha Zhengju handai@taobao.com Signed-off-by: Sha Zhengju handai@taobao.com --- drivers/scsi/sd.h |1 - 1 file changed, 1 deletion(-) diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 26895ff..3bbe4df 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -24,7 +24,6 @@ * Number of allowed retries */ #define SD_MAX_RETRIES 5 -#define SD_PASSTHROUGH_RETRIES 1 #define SD_MAX_MEDIUM_TIMEOUTS 2 /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] isci: update version to 1.2
The version of isci driver has not been updated for 2 years. It was 83 isci commits ago. Suspend/resume support has been implemented and many bugs have been fixed since 1.1. Now update the version to 1.2. Signed-off-by: Lukasz Dorau lukasz.do...@intel.com Cc: sta...@vger.kernel.org --- drivers/scsi/isci/init.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c index d25d0d8..695b34e 100644 --- a/drivers/scsi/isci/init.c +++ b/drivers/scsi/isci/init.c @@ -66,7 +66,7 @@ #include probe_roms.h #define MAJ 1 -#define MIN 1 +#define MIN 2 #define BUILD 0 #define DRV_VERSION __stringify(MAJ) . __stringify(MIN) . \ __stringify(BUILD) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] isci: update version to 1.2
On Thursday, January 23, 2014 10:39 AM Lukasz Dorau lukasz.do...@intel.com wrote: The version of isci driver has not been updated for 2 years. It was 83 isci commits ago. Suspend/resume support has been implemented and many bugs have been fixed since 1.1. Now update the version to 1.2. Signed-off-by: Lukasz Dorau lukasz.do...@intel.com Cc: sta...@vger.kernel.org Oops... By mistake I have sent the wrong version of the patch. I'm sorry. Please disregard it. Lukasz
[PATCH] isci: update version to 1.2
The version of isci driver has not been updated for 2 years. It was 83 isci commits ago. Suspend/resume support has been implemented and many bugs have been fixed since 1.1. Now update the version to 1.2. Signed-off-by: Lukasz Dorau lukasz.do...@intel.com Signed-off-by: Dave Jiang dave.ji...@intel.com Signed-off-by: Maciej Patelczyk maciej.patelc...@intel.com --- drivers/scsi/isci/init.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c index d25d0d8..695b34e 100644 --- a/drivers/scsi/isci/init.c +++ b/drivers/scsi/isci/init.c @@ -66,7 +66,7 @@ #include probe_roms.h #define MAJ 1 -#define MIN 1 +#define MIN 2 #define BUILD 0 #define DRV_VERSION __stringify(MAJ) . __stringify(MIN) . \ __stringify(BUILD) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [usb-storage] Re: usb disk recognized but fails
Whoaa!! I recompiled the master again, but now with a little bit modified configuration, mainly I disabled the CONFIG_USB_STORAGE_CYPRESS_ATACB and it works like a charm! Disk is properly and immediately detected and works! I also tried to boot to standard kernel and disable loading ums_cypress by putting it on the blacklist but it didn't worked out. The disk wasn't detected at all (no message about plug-in event nor report about disk size). I strongly belive that it is Linux kernel problem, not the disk's (apart it might need some quirks). If I remember correctly there hasn't been the ums_cypress from the begining, right? So, perhaps the time when it was added corresponds with the time when it worked for me last time. Best regards and thanks for all your help and wish for a quick fix in the mainstream, Milan Svoboda --- .config.old 2014-01-23 12:57:17.831854511 +0100 +++ .config 2014-01-23 10:13:20.899234729 +0100 @@ -1,6 +1,6 @@ # # Automatically generated file; DO NOT EDIT. -# Linux/x86 3.12.8-1 Kernel Configuration +# Linux/x86 3.13.0 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y @@ -39,7 +39,6 @@ CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11 -CONFIG_ARCH_CPU_PROBE_RELEASE=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_IRQ_WORK=y @@ -76,7 +75,6 @@ CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y -CONFIG_AUDIT_LOGINUID_IMMUTABLE=y # # IRQ subsystem @@ -143,6 +141,7 @@ CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=19 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y +CONFIG_ARCH_SUPPORTS_INT128=y CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y CONFIG_ARCH_USES_NUMA_PROT_NONE=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y @@ -167,7 +166,7 @@ CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set -CONFIG_CHECKPOINT_RESTORE=y +# CONFIG_CHECKPOINT_RESTORE is not set CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y @@ -247,7 +246,6 @@ CONFIG_HAVE_OPTPROBES=y CONFIG_HAVE_KPROBES_ON_FTRACE=y CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_HAVE_DMA_ATTRS=y -CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_GENERIC_SMP_IDLE_THREAD=y CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_DMA_API_DEBUG=y @@ -266,11 +264,18 @@ CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSIO CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y CONFIG_HAVE_ARCH_SECCOMP_FILTER=y CONFIG_SECCOMP_FILTER=y +CONFIG_HAVE_CC_STACKPROTECTOR=y +# CONFIG_CC_STACKPROTECTOR is not set +CONFIG_CC_STACKPROTECTOR_NONE=y +# CONFIG_CC_STACKPROTECTOR_REGULAR is not set +# CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_HAVE_CONTEXT_TRACKING=y +CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y CONFIG_HAVE_ARCH_SOFT_DIRTY=y CONFIG_MODULES_USE_ELF_RELA=y +CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y CONFIG_OLD_SIGSUSPEND3=y CONFIG_COMPAT_OLD_SIGACTION=y @@ -282,6 +287,7 @@ CONFIG_COMPAT_OLD_SIGACTION=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 +# CONFIG_SYSTEM_TRUSTED_KEYRING is not set CONFIG_MODULES=y CONFIG_MODULE_FORCE_LOAD=y CONFIG_MODULE_UNLOAD=y @@ -453,6 +459,7 @@ CONFIG_MEMORY_HOTPLUG_SPARSE=y CONFIG_MEMORY_HOTREMOVE=y CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 +CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y CONFIG_BALLOON_COMPACTION=y CONFIG_COMPACTION=y CONFIG_MIGRATION=y @@ -475,7 +482,6 @@ CONFIG_FRONTSWAP=y # CONFIG_CMA is not set CONFIG_ZBUD=y CONFIG_ZSWAP=y -CONFIG_MEM_SOFT_DIRTY=y CONFIG_X86_CHECK_BIOS_CORRUPTION=y CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y CONFIG_X86_RESERVE_LOW=64 @@ -490,7 +496,6 @@ CONFIG_X86_SMAP=y CONFIG_EFI=y CONFIG_EFI_STUB=y CONFIG_SECCOMP=y -CONFIG_CC_STACKPROTECTOR=y # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y @@ -533,13 +538,13 @@ CONFIG_PM_DEBUG=y CONFIG_PM_ADVANCED_DEBUG=y # CONFIG_PM_TEST_SUSPEND is not set CONFIG_PM_SLEEP_DEBUG=y +# CONFIG_DPM_WATCHDOG is not set CONFIG_PM_TRACE=y CONFIG_PM_TRACE_RTC=y # CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y # CONFIG_ACPI_PROCFS is not set -# CONFIG_ACPI_PROCFS_POWER is not set CONFIG_ACPI_EC_DEBUGFS=m CONFIG_ACPI_AC=m CONFIG_ACPI_BATTERY=m @@ -555,7 +560,6 @@ CONFIG_ACPI_THERMAL=m CONFIG_ACPI_NUMA=y # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_INITRD_TABLE_OVERRIDE=y -CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_PCI_SLOT=y CONFIG_X86_PM_TIMER=y @@ -571,13 +575,13 @@ CONFIG_ACPI_APEI_PCIEAER=y CONFIG_ACPI_APEI_MEMORY_FAILURE=y CONFIG_ACPI_APEI_EINJ=m CONFIG_ACPI_APEI_ERST_DEBUG=m +# CONFIG_ACPI_EXTLOG is not set CONFIG_SFI=y # # CPU Frequency scaling # CONFIG_CPU_FREQ=y -CONFIG_CPU_FREQ_TABLE=y
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. Reality check: 64k block sizes on 64k page Linux machines has been used in production on XFS for at least 10 years. It's exactly the same case as 4k block size on 4k page size - one page, one buffer head, one filesystem block. This is true for ext4 as well. Block size == page size support is pretty easy; the hard part is when block size page size, due to assumptions in the VM layer that requires that FS system needs to do a lot of extra work to fudge around. So the real problem comes with trying to support 64k block sizes on a 4k page architecture, and can we do it in a way where every single file system doesn't have to do their own specific hacks to work around assumptions made in the VM layer. Some of the problems include handling the case where you get someone dirties a single block in a sparse page, and the FS needs to manually fault in the other 56k pages around that single page. Or the VM not understanding that page eviction needs to be done in chunks of 64k so we don't have part of the block evicted but not all of it, etc. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. I was only objecting on the grounds that the last time we looked at it, it was major VM surgery. Can someone give a summary of how far we are away from being able to do this with the VM system today and what extra work is needed (and how big is this piece of work)? James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. Do not do this as there is no guarantee that a compound allocation will succeed. If the allocation fails then it is potentially unrecoverable because we can no longer write to storage then you're hosed. If you are now thinking mempool then the problem becomes that the system will be in a state of degraded performance for an unknowable length of time and may never recover fully. 64K MMU page size systems get away with this because the blocksize is still = PAGE_SIZE and no core VM changes are necessary. Critically, pages like the page table pages are the same size as the basic unit of allocation used by the kernel so external fragmentation simply is not a severe problem. I was only objecting on the grounds that the last time we looked at it, it was major VM surgery. Can someone give a summary of how far we are away from being able to do this with the VM system today and what extra work is needed (and how big is this piece of work)? Offhand no idea. For fsblock, probably a similar amount of work than had to be done in 2007 and I'd expect it would still require filesystem awareness problems that Dave Chinner pointer out earlier. For large block, it'd hit into the same wall that allocations must always succeed. If we want to break the connection between the basic unit of memory managed by the kernel and the MMU page size then I don't know but it would
Re: [usb-storage] Re: usb disk recognized but fails
On Thu, 23 Jan 2014, Milan Svoboda wrote: Whoaa!! I recompiled the master again, but now with a little bit modified configuration, mainly I disabled the CONFIG_USB_STORAGE_CYPRESS_ATACB and it works like a charm! Disk is properly and immediately detected and works! I don't see how that could have made any difference. The Cypress-ATACB driver works just like the default driver, except for two commands (ATA(12) and ATA(16)) neither of which appeared in the usbmon trace. Your new config enables CONFIG_USB_STORAGE_DEBUG. More likely that is the reason for the improvement. Try taking out that one setting (don't change anything else) and see what happens. I also tried to boot to standard kernel and disable loading ums_cypress by putting it on the blacklist but it didn't worked out. The disk wasn't detected at all (no message about plug-in event nor report about disk size). I strongly belive that it is Linux kernel problem, not the disk's (apart it might need some quirks). If I remember correctly there hasn't been the ums_cypress from the begining, right? So, perhaps the time when it was added corresponds with the time when it worked for me last time. What do you mean by from the beginning? The ums-cypress driver was added in 2008. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.
When the controller encounters an error (including QUEUE FULL or BUSY status), it aborts all not yet submitted requests in the function sym_dequeue_from_squeue. This function aborts them with DID_SOFT_ERROR. If the disk has a full tag queue, the request that caused the overflow is aborted with QUEUE FULL status (and the scsi midlayer properly retries it until it is accepted by the disk), but other requests are aborted with DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then signals the error up to sd. The result is that disk returning QUEUE FULL causes request failures. The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but under some access patterns it return QUEUE FULL when there are less than 64 pending tags. The SCSI specification allows returning QUEUE FULL anytime and it is up to the host to retry. Signed-off-by: Mikulas Patocka mpato...@redhat.com Cc: sta...@vger.kernel.org --- drivers/scsi/sym53c8xx_2/sym_hipd.c |4 1 file changed, 4 insertions(+) Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c === --- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c 2010-09-27 10:25:59.0 +0200 +++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c 2010-09-27 10:26:27.0 +0200 @@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb * if ((target == -1 || cp-target == target) (lun== -1 || cp-lun== lun) (task == -1 || cp-tag== task)) { +#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING sym_set_cam_status(cp-cmd, DID_SOFT_ERROR); +#else + sym_set_cam_status(cp-cmd, DID_REQUEUE); +#endif sym_remque(cp-link_ccbq); sym_insque_tail(cp-link_ccbq, np-comp_ccbq); } -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Persistent reservation behaviour/compliance with redundant controllers
On 01/07/2014 12:18 PM, Pasi Kärkkäinen wrote: On Mon, Jan 06, 2014 at 11:53:44PM +0100, Matthias Eble wrote: I have a persistent reservations for dummies document I wrote that I can send you off list, if you like. I think I know how PRs work. Yet I'd be happy about your document. I think that document could be helpful for others aswell, so please post it to the list :) Thanks! -- Pasi Apologies for taking so darn long to reply! I have published my SCSI-3 Document here: http://www.gonzoleeman.net/documents/scsi-3-pgr-tutorial-v1.0 Feedback welcome. -- Lee Duncan SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote: On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. Reality check: 64k block sizes on 64k page Linux machines has been used in production on XFS for at least 10 years. It's exactly the same case as 4k block size on 4k page size - one page, one buffer head, one filesystem block. This is true for ext4 as well. Block size == page size support is pretty easy; the hard part is when block size page size, due to assumptions in the VM layer that requires that FS system needs to do a lot of extra work to fudge around. So the real problem comes with trying to support 64k block sizes on a 4k page architecture, and can we do it in a way where every single file system doesn't have to do their own specific hacks to work around assumptions made in the VM layer. Some of the problems include handling the case where you get someone dirties a single block in a sparse page, and the FS needs to manually fault in the other 56k pages around that single page. Or the VM not understanding that page eviction needs to be done in chunks of 64k so we don't have part of the block evicted but not all of it, etc. Right, this is part of the problem that fsblock tried to handle, and some of the nastiness it had was that a page fault only resulted in the individual page being read from the underlying block. This means that it was entirely possible that the filesystem would need to do RMW cycles in the writeback path itself to handle things like block checksums, copy-on-write, unwritten extent conversion, etc. i.e. all the stuff that the page cache currently handles by doing RMW cycles at the page level. The method of using compound pages in the page cache so that the page cache could do 64k RMW cycles so that a filesystem never had to deal with new issues like the above was one of the reasons that approach is so appealing to us filesystem people. ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, 2014-01-23 at 16:44 +, Mel Gorman wrote: On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. Do not do this as there is no guarantee that a compound allocation will succeed. I presume this is because in the current implementation compound pages have to be physically contiguous. For increasing granularity in the page cache, we don't necessarily need this ... however, getting write out to work properly without physically contiguous pages would be a bit more challenging (but not impossible) to solve. If the allocation fails then it is potentially unrecoverable because we can no longer write to storage then you're hosed. If you are now thinking mempool then the problem becomes that the system will be in a state of degraded performance for an unknowable length of time and may never recover fully. 64K MMU page size systems get away with this because the blocksize is still = PAGE_SIZE and no core VM changes are necessary. Critically, pages like the page table pages are the same size as the basic unit of allocation used by the kernel so external fragmentation simply is not a severe problem. Right, I understand this ... but we still need to wonder about what it would take. Even the simple fail a compound page allocation gets treated in the kernel the same
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, Jan 23, 2014 at 04:44:38PM +, Mel Gorman wrote: On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. Do not do this as there is no guarantee that a compound allocation will succeed. If the allocation fails then it is potentially unrecoverable because we can no longer write to storage then you're hosed. If you are now thinking mempool then the problem becomes that the system will be in a state of degraded performance for an unknowable length of time and may never recover fully. We are talking about page cache allocation here, not something deep down inside the IO path that requires mempools to guarantee IO completion. IOWs, we have an *existing error path* to return ENOMEM to userspace when page cache allocation fails. 64K MMU page size systems get away with this because the blocksize is still = PAGE_SIZE and no core VM changes are necessary. Critically, pages like the page table pages are the same size as the basic unit of allocation used by the kernel so external fragmentation simply is not a severe problem. Christoph's old patches didn't need 64k MMU page sizes to work. IIRC, the compound page was mapped via into the page cache as individual 4k pages. Any change of state on the child pages followed the back pointer to the head of the compound page and changed the state of that page. On page faults, the individual 4k pages were mapped to userspace rather than the compound page, so there was no userspace visible change, either. The question I had at the time that was never answered was this: if pages are faulted and mapped individually through their own ptes, why did the compound pages need to be contiguous? copy-in/out through read/write was still done a PAGE_SIZE granularity, mmap mappings were still on PAGE_SIZE granularity, so why can't we build a compound page for the page cache out of discontiguous pages? FWIW, XFS has long used discontiguous pages for large block support in metadata. Some of that is vmapped to make metadata processing simple. The point of this is that we don't need *contiguous* compound pages in the page cache if we can map them into userspace as individual PAGE_SIZE pages. Only the page cache management needs to handle the groups of pages that make up a filesystem block as a compound page I was only objecting on the grounds that the last time we looked at it, it was major VM surgery. Can someone give a summary of how far we are away from being able to do this with the VM system today and what extra work is needed (and how big is this piece of work)? Offhand no idea. For fsblock, probably a similar amount of work than had to be done in 2007 and I'd expect it would still require filesystem awareness problems that Dave Chinner pointer out earlier. For large block, it'd hit into the same wall that allocations must always succeed. If we want to break the connection between the basic unit of memory managed by the kernel and the MMU page size then I don't know but it would be a fairly large amount of surgery and need a lot of design work. Here's the patch that Christoph wrote backin 2007 to add PAGE_SIZE based mmap
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 22 Jan 2014, Mel Gorman wrote: Don't get me wrong, I'm interested in the topic but I severely doubt I'd have the capacity to research the background of this in advance. It's also unlikely that I'd work on it in the future without throwing out my current TODO list. In an ideal world someone will have done the legwork in advance of LSF/MM to help drive the topic. I can give an overview of the history and the challenges of the approaches if needed. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 22 Jan 2014, Mel Gorman wrote: Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. It was rejected first because of the necessity of higher order page allocations. Nick and I then added ways to virtually map higher order pages if the page allocator could no longe provide those. All of this required changes to the basic page cache operations. I added a way for the mapping to indicate an order for an address range and then modified the page cache operations to be able to operate on any order pages. The patchset that introduced the ability to specify different orders for the pagecache address ranges was not accepted by Andrew because he thought there was no chance for the rest of the modifications to become acceptable. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, 23 Jan 2014, James Bottomley wrote: If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. I was only objecting on the grounds that the last time we looked at it, it was major VM surgery. Can someone give a summary of how far we are away from being able to do this with the VM system today and what extra work is needed (and how big is this piece of work)? The main problem for me was the page cache. The VM would not be such a problem. Changing the page cache function required updates to many filesystems. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote: On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. Reality check: 64k block sizes on 64k page Linux machines has been used in production on XFS for at least 10 years. It's exactly the same case as 4k block size on 4k page size - one page, one buffer head, one filesystem block. This is true for ext4 as well. Block size == page size support is pretty easy; the hard part is when block size page size, due to assumptions in the VM layer that requires that FS system needs to do a lot of extra work to fudge around. So the real problem comes with trying to support 64k block sizes on a 4k page architecture, and can we do it in a way where every single file system doesn't have to do their own specific hacks to work around assumptions made in the VM layer. Yup, ditto for ocfs2. Joel -- One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important. - Bertrand Russell http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote: On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: [agreement cut because it's boring for the reader] Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. There are a few workloads where the VM and the FS would team up to make this fairly miserable Small files. Delayed allocation fixes a lot of this, but the VM doesn't realize that fileA, fileB, fileC, and fileD all need to be written at the same time to avoid RMW. Btrfs and MD have setup plugging callbacks to accumulate full stripes as much as possible, but it still hurts. Metadata. These writes are very latency sensitive and we'll gain a lot if the FS is explicitly trying to build full sector IOs. OK, so these two cases I buy ... the question is can we do something about them today without increasing the block size? The metadata problem, in particular, might be block independent: we still have a lot of small chunks to write out at fractured locations. With a large block size, the FS knows it's been bad and can expect the rolled up newspaper, but it's not clear what it could do about it. The small files issue looks like something we should be tackling today since writing out adjacent files would actually help us get bigger transfers. ocfs2 can actually take significant advantage here, because we store small file data in-inode. This would grow our in-inode size from ~3K to ~15K or ~63K. We'd actually have to do more work to start putting more than one inode in a block (thought that would be a promising avenue too once the coordination is solved generically. Joel -- One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important. - Bertrand Russell http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Thu, 2014-01-23 at 13:27 -0800, Joel Becker wrote: On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote: On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: [agreement cut because it's boring for the reader] Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. There are a few workloads where the VM and the FS would team up to make this fairly miserable Small files. Delayed allocation fixes a lot of this, but the VM doesn't realize that fileA, fileB, fileC, and fileD all need to be written at the same time to avoid RMW. Btrfs and MD have setup plugging callbacks to accumulate full stripes as much as possible, but it still hurts. Metadata. These writes are very latency sensitive and we'll gain a lot if the FS is explicitly trying to build full sector IOs. OK, so these two cases I buy ... the question is can we do something about them today without increasing the block size? The metadata problem, in particular, might be block independent: we still have a lot of small chunks to write out at fractured locations. With a large block size, the FS knows it's been bad and can expect the rolled up newspaper, but it's not clear what it could do about it. The small files issue looks like something we should be tackling today since writing out adjacent files would actually help us get bigger transfers. ocfs2 can actually take significant advantage here, because we store small file data in-inode. This would grow our in-inode size from ~3K to ~15K or ~63K. We'd actually have to do more work to start putting more than one inode in a block (thought that would be a promising avenue too once the coordination is solved generically. Btrfs already defaults to 16K metadata and can go as high as 64k. The part we don't do is multi-page sectors for data blocks. I'd tend to leverage the read/modify/write engine from the raid code for that. -chris -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] ia64 simscsi: fix race condition and simplify the code
The simscsi driver processes the requests in the request routine and then offloads the completion callback to a tasklet. This is buggy because there is parallel unsynchronized access to the completion queue from the request routine and from the tasklet. With current SCSI architecture, requests can be completed directly from the requets routine. So I removed the tasklet code. Signed-off-by: Mikulas Patocka mpato...@redhat.com --- arch/ia64/hp/sim/simscsi.c | 34 ++ 1 file changed, 2 insertions(+), 32 deletions(-) Index: linux-2.6-ia64/arch/ia64/hp/sim/simscsi.c === --- linux-2.6-ia64.orig/arch/ia64/hp/sim/simscsi.c 2014-01-24 01:23:08.0 +0100 +++ linux-2.6-ia64/arch/ia64/hp/sim/simscsi.c 2014-01-24 01:26:16.0 +0100 @@ -47,9 +47,6 @@ static struct Scsi_Host *host; -static void simscsi_interrupt (unsigned long val); -static DECLARE_TASKLET(simscsi_tasklet, simscsi_interrupt, 0); - struct disk_req { unsigned long addr; unsigned len; @@ -64,13 +61,6 @@ static int desc[16] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }; -static struct queue_entry { - struct scsi_cmnd *sc; -} queue[SIMSCSI_REQ_QUEUE_LEN]; - -static int rd, wr; -static atomic_t num_reqs = ATOMIC_INIT(0); - /* base name for default disks */ static char *simscsi_root = DEFAULT_SIMSCSI_ROOT; @@ -95,21 +85,6 @@ simscsi_setup (char *s) __setup(simscsi=, simscsi_setup); -static void -simscsi_interrupt (unsigned long val) -{ - struct scsi_cmnd *sc; - - while ((sc = queue[rd].sc) != NULL) { - atomic_dec(num_reqs); - queue[rd].sc = NULL; - if (DBG) - printk(simscsi_interrupt: done with %ld\n, sc-serial_number); - (*sc-scsi_done)(sc); - rd = (rd + 1) % SIMSCSI_REQ_QUEUE_LEN; - } -} - static int simscsi_biosparam (struct scsi_device *sdev, struct block_device *n, sector_t capacity, int ip[]) @@ -315,14 +290,9 @@ simscsi_queuecommand_lck (struct scsi_cm sc-sense_buffer[0] = 0x70; sc-sense_buffer[2] = 0x00; } - if (atomic_read(num_reqs) = SIMSCSI_REQ_QUEUE_LEN) { - panic(Attempt to queue command while command is pending!!); - } - atomic_inc(num_reqs); - queue[wr].sc = sc; - wr = (wr + 1) % SIMSCSI_REQ_QUEUE_LEN; - tasklet_schedule(simscsi_tasklet); + (*sc-scsi_done)(sc); + return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM ATTEND] interest in blk-mq, scsi-mq, dm-cache, dm-thinp, dm-*
On 01/13/2014 05:36 AM, Hannes Reinecke wrote: On 01/10/2014 07:27 PM, Mike Snitzer wrote: I would like to attend to participate in discussions related to topics listed in the subject. As a maintainer of DM I'd be interested to learn/discuss areas that should become a development focus in the months following LSF. +1 I've been thinking on (re-) implementing multipathing on top of blk-mq, and would like to discuss the probability of which. There are some design decisions in blk-mq (eg statically allocating the number of queues) which do not play well with that. I think I have been thinking about going a completely different direction. The thing about dm-multipath is that request based adds the extra queue locking and that of course is bad. In our testing it is a major perf issue. We got things like ioscheduling though. If we went back to bio based multipathing then it turns out that when scsi also supports multiqueue then it all works pretty nicely. There is room for improvement in general like with some dm allocations being numa/cpu aware, but the request_queue locking issues we have go away and it is very simple code wise. We could go the route of making request based dm-multipath: 1. aware of underlying multiqueue devices. So just basically keep what we have more or less but then have dm-multipath make a request that can be sent to a multiqueue device then call blk_mq_insert_request. This would all be hidden by nice interfaces that hide if it is multiqueue underlying device or not. 2. make dm-multipath do multiqueue (so implement map_queue, queue_rq, etc) and also making it aware of underlying multiqueue devices. #1 just keeps the existing request spin_lock problem so there is not much point other than just getting things working. #2 is a good deal of work and what does it end up buying us over just making multipath bio based. We lose iosched support. If we are going to make advanced multiqueue ioschedulers that rely on request structs then #2 could be useful. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM ATTEND] interest in blk-mq, scsi-mq, dm-cache, dm-thinp, dm-*
On 01/24/2014 03:37 AM, Mike Christie wrote: On 01/13/2014 05:36 AM, Hannes Reinecke wrote: On 01/10/2014 07:27 PM, Mike Snitzer wrote: I would like to attend to participate in discussions related to topics listed in the subject. As a maintainer of DM I'd be interested to learn/discuss areas that should become a development focus in the months following LSF. +1 I've been thinking on (re-) implementing multipathing on top of blk-mq, and would like to discuss the probability of which. There are some design decisions in blk-mq (eg statically allocating the number of queues) which do not play well with that. I think I have been thinking about going a completely different direction. The thing about dm-multipath is that request based adds the extra queue locking and that of course is bad. In our testing it is a major perf issue. We got things like ioscheduling though. Indeed. And without that we cannot do true load-balancing. If we went back to bio based multipathing then it turns out that when scsi also supports multiqueue then it all works pretty nicely. There is room for improvement in general like with some dm allocations being numa/cpu aware, but the request_queue locking issues we have go away and it is very simple code wise. If and when. The main issue I see with that is that it might take some time (if ever) for SCSI LLDDs to go fully multiqueue. In fact, I strongly suspect that only newer LLDDs will ever support multiqueue; for the older cards the HW interface it too much tied to single queue operations. We could go the route of making request based dm-multipath: 1. aware of underlying multiqueue devices. So just basically keep what we have more or less but then have dm-multipath make a request that can be sent to a multiqueue device then call blk_mq_insert_request. This would all be hidden by nice interfaces that hide if it is multiqueue underlying device or not. 2. make dm-multipath do multiqueue (so implement map_queue, queue_rq, etc) and also making it aware of underlying multiqueue devices. #1 just keeps the existing request spin_lock problem so there is not much point other than just getting things working. #2 is a good deal of work and what does it end up buying us over just making multipath bio based. We lose iosched support. If we are going to make advanced multiqueue ioschedulers that rely on request structs then #2 could be useful. Obviously we need iosched support when going multiqueue. I wouldn't dream of dropping them. So my overall idea here is to move multipath over to block-mq, making each path identical to one queue. (As mentioned above, currently every single FC HBA exposes a single HW queue anyway) The ioschedulers would be moved to the map_queue function. This approach has several issues which I would like to discuss: - block-mq ctx allocation currently is static. This doesn't play well with multipathing, were paths (=queues) might get configured on-the-fly. - Queues might be coming from different HBAs; one would need to audit the block-mq stuff if that's possible. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/6] megaraid_sas: Do not wait forever
Hannes: We have already worked on wait_event usage in megasas_issue_blocked_cmd. That code will be posted by LSI once we received test result from LSI Q/A team. If you see the current OCR code in Linux Driver we do re-send the IOCTL command. MR product does not want IOCTL timeout due to some reason. That is why even if FW faulted, Driver will do OCR and re-send all existing Management commands (IOCTL comes under management commands). Just for info. (see below snippet in OCR code) /* Re-fire management commands */ for (j = 0 ; j instance-max_fw_cmds; j++) { cmd_fusion = fusion-cmd_list[j]; if (cmd_fusion-sync_cmd_idx != (u32)ULONG_MAX) { cmd_mfi = instance-cmd_list[cmd_fusion-sync_cmd_idx]; if (cmd_mfi-frame-dcmd.opcode == MR_DCMD_LD_MAP_GET_INFO) { megasas_return_cmd(instance, cmd_mfi); megasas_return_cmd_fusion(instance, cmd_fusion); Current MR Driver is not designed to add timeout for DCMD and IOCTL path. [ I added timeout only for limited DCMDs, which are harmless to continue after timeout ] As of now, you can skip this patch and we will be submitting patch to fix similar issue. But note, we cannot add complete wait_event_timeout due to day-1 design, but will try to cover wait_event_timout for some valid cases. ` Kashyap -Original Message- From: Hannes Reinecke [mailto:h...@suse.de] Sent: Thursday, January 16, 2014 3:56 PM To: James Bottomley Cc: linux-scsi@vger.kernel.org; Hannes Reinecke; Desai, Kashyap; Adam Radford Subject: [PATCH 1/6] megaraid_sas: Do not wait forever If the firmware is incommunicado for whatever reason the driver will wait forever during initialisation, causing all sorts of hangcheck timers to trigger. We should rather wait for a defined time, and give up on the command if no response was received. Cc: Kashyap Desai kashyap.de...@lsi.com Cc: Adam Radford aradf...@gmail.com Signed-off-by: Hannes Reinecke h...@suse.de --- drivers/scsi/megaraid/megaraid_sas_base.c | 43 ++ - 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index 3b7ad10..95d4e5c 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -911,9 +911,11 @@ megasas_issue_blocked_cmd(struct megasas_instance *instance, instance-instancet-issue_dcmd(instance, cmd); - wait_event(instance-int_cmd_wait_q, cmd-cmd_status != ENODATA); + wait_event_timeout(instance-int_cmd_wait_q, +cmd-cmd_status != ENODATA, +MEGASAS_INTERNAL_CMD_WAIT_TIME * HZ); - return 0; + return cmd-cmd_status == ENODATA ? -ENODATA : 0; } /** @@ -932,11 +934,12 @@ megasas_issue_blocked_abort_cmd(struct megasas_instance *instance, { struct megasas_cmd *cmd; struct megasas_abort_frame *abort_fr; + int status; cmd = megasas_get_cmd(instance); if (!cmd) - return -1; + return -ENOMEM; abort_fr = cmd-frame-abort; @@ -960,11 +963,14 @@ megasas_issue_blocked_abort_cmd(struct megasas_instance *instance, /* * Wait for this cmd to complete */ - wait_event(instance-abort_cmd_wait_q, cmd-cmd_status != 0xFF); + wait_event_timeout(instance-abort_cmd_wait_q, +cmd-cmd_status != 0xFF, +MEGASAS_INTERNAL_CMD_WAIT_TIME * HZ); cmd-sync_cmd = 0; + status = cmd-cmd_status; megasas_return_cmd(instance, cmd); - return 0; + return status == 0xFF ? -ENODATA : 0; } /** @@ -3902,6 +3908,7 @@ megasas_get_seq_num(struct megasas_instance *instance, struct megasas_dcmd_frame *dcmd; struct megasas_evt_log_info *el_info; dma_addr_t el_info_h = 0; + int rc; cmd = megasas_get_cmd(instance); @@ -3933,23 +3940,23 @@ megasas_get_seq_num(struct megasas_instance *instance, dcmd-sgl.sge32[0].phys_addr = cpu_to_le32(el_info_h); dcmd-sgl.sge32[0].length = cpu_to_le32(sizeof(struct megasas_evt_log_info)); - megasas_issue_blocked_cmd(instance, cmd); - - /* - * Copy the data back into callers buffer - */ - eli-newest_seq_num = le32_to_cpu(el_info-newest_seq_num); - eli-oldest_seq_num = le32_to_cpu(el_info-oldest_seq_num); - eli-clear_seq_num = le32_to_cpu(el_info-clear_seq_num); - eli-shutdown_seq_num = le32_to_cpu(el_info- shutdown_seq_num); - eli-boot_seq_num = le32_to_cpu(el_info-boot_seq_num); - + rc = megasas_issue_blocked_cmd(instance, cmd); + if (!rc) { + /* + * Copy the