Re: [OpenIndiana-discuss] Firefox 20

2013-04-16 Thread Apostolos Syropoulos

 I tried as well with the tar.bz2 package, and it is the same crash and core 
 dump.

The problem has been fixed and the reason was a compiler error. I just wonder
why they keep compiling with SolarisStudio?

 
A.S.

--
Apostolos Syropoulos
Xanthi, Greece


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Firefox 20

2013-04-16 Thread Bob Friesenhahn

On Tue, 16 Apr 2013, Apostolos Syropoulos wrote:




I tried as well with the tar.bz2 package, and it is the same crash and core
dump.


The problem has been fixed and the reason was a compiler error. I just wonder
why they keep compiling with SolarisStudio?


It avoids needing to depend on GCC run-time libraries, for which there 
is no useful standard on Solaris.  It also helps keep the code 
honest by assuring that it is compiled with at least three different 
compilers.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Edward Ned Harvey (openindiana)
 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 
 It would be difficult to believe that 10Gbit Ethernet offers better
 bandwidth than 56Gbit Infiniband (the current offering).  The swiching
 model is quite similar.  The main reason why IB offers better latency
 is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP communications done on ethernet, with a 
lot of variable peer-to-peer and broadcast traffic.  IB is designed for 
networks where systems want to establish connections to other systems, and 
those connections remain mostly statically connected.  Primarily clustering  
storage networks.  Not primarily TCP/IP.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Doug Hughes
some of these points are a bit dated. Allow me to make some updates. I'm sure 
that you are aware that most 10gig switches these days are cut through and not 
store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. 
Cisco has a mix of things, but they aren't really in the low latency space. The 
10g and 40g port to port forwarding is in nanoseconds. buffering is mostly 
reserved to carrier operations anymore, and even there it is becoming less 
common because of the toll it causes to things like IPVideo and VOIP. Buffers 
are good for web farms, still, and to a certain extent storage servers or WAN 
links where there is a high degree of contention from disparate traffic.
  
At a physical level, the signalling of IB compared to Ethernet (10g+) is very 
similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, 
and QDR and FDR infiniband on any port.
 there are also a fair number of vendors that support RDMA in ethernet NIC now, 
like SolarFlare with Onboot technology.

The main reason for lowest achievable latency is higher speed. Latency is 
roughly equivalent to the inversion of bandwidth.  But, the higher levels of 
protocols that you stack on top contribute much more than the hardware 
theoretical minimums or maximums. TCP/IP is a killer in terms of adding 
overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is 
much faster than the kernel overhead induced by TCP session setups and other 
host side user/kernel boundaries and buffering. PCI latency is also higher than 
the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband.

There is even a special layer that you can write custom protocols to on 
Infiniband called Verbs for lowering latency further.

Infiniband is inherently a layer1 and 2 protocol, and the subnet manager 
(software) is resposible for setting up all virtual circuits (routes between 
hosts on the fabric) and rerouting when a path goes bad. Also, the link 
aggregation, as you mention, is rock solid and amazingly good. Auto rerouting 
is fabulous and super fast. But, you don't get layer3. TCP over IB works out of 
the box, but adds large overhead. Still, it does make it possible that you can 
have IB native and IP over IB with gateways to a TCP network with a single 
cable. That's pretty cool.


Sent from my android device.

-Original Message-
From: Edward Ned Harvey (openindiana) openindi...@nedharvey.com
To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
Sent: Tue, 16 Apr 2013 10:49 AM
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage 
(OpenIndiana-discuss Digest, Vol 33, Issue 20)

 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 
 It would be difficult to believe that 10Gbit Ethernet offers better
 bandwidth than 56Gbit Infiniband (the current offering).  The swiching
 model is quite similar.  The main reason why IB offers better latency
 is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP 

Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Mehmet Erol Sanliturk
I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for Enterprise labeled drives ) :


SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .

MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .

SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .

Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


Thank you very much .

Mehmet Erol Sanliturk
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Mon, Apr 15, 2013 at 5:00 AM, Edward Ned Harvey (openindiana) 
openindi...@nedharvey.com wrote:


 So I'm just assuming you're going to build a pool out of SSD's, mirrored,
 perhaps even 3-way mirrors.  No cache/log devices.  All the ram you can fit
 into the system.


What would be the logic behind mirrored SSD arrays? With spinning platters
the mirrors improve performance by allowing the fastest of the mirrors to
respond to a particular command to be the one that defines throughput. With
SSDs, they all should respond in basically the same time. There is no
latency due to head movement or waiting for the proper spot on the disc to
rotate under the heads. The improvement in read performance seen in
mirrored spinning platters should not be present with SSDs. Admittedly,
this is from a purely theoretical perspective. I've never assembled an SSD
array to compare mirrored vs RAID-Zx performance. I'm curious if you're
aware of something I'm overlooking.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 19:17, Mehmet Erol Sanliturk wrote:

I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for Enterprise labeled drives ) :



My awareness in the subject is of similar nature, but with different
results someplace... Here goes:



SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .


Yes, maybe, for some vendors. Information is scarce about which
ones are better in practical reliability, leading to requests for
info like this thread.

Still, some vendors make a living by selling expensive gear into
critical workloads, and are thought to perform well. One factor,
though not always a guarantee, of proper end-of-work in case of
a power-cut, is presence of either batteries/accumulators, or
capacitors, which power the device long enough for it to save its
caches, metadata, etc. Then the mileage varies how well who does it.



MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .


I hear SLC are also faster due to more simple design. Price stems
from requirement to have more cells than MLC to implement the same
amount of storage bits. Also there are now some new designs like
eMLC which are young and untested, but are said to have MLC price
and SLC reliability.

With decrease of sizes in technical process, diffusion and brownian
movement of atoms plays an increasingly greater role. Indeed, while
early SSDs boasted tens and hundreds of thousands of rewrite cycles,
now 5-10k is good. But faster.



SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .


For this reason there is over-provisioning. The SSD firmware detects
unreliable chips and excludes them from use, relocating data onto
spare chips. Also there is wear-leveling, it is when the firmware
tries to make sure that all ships are utilized more or less equally
and on average the device lives longer. Basically, an SSD (unlike
a normal USB Flash key) implements a RAID over tens of chips with
intimate knowledge and diagnostic mechanisms over the storage pieces.

Overall, vendors now often rate their devices in gbytes of writes
in their lifetime, or in full rewrites of the device. Last year we
had a similar discussion on-list, regarding then-new Intel DC S3700
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/50424
and it struck me that in practical terms they boasted Endurance
Rating - 10 drive writes/day over 5 years. That is a lot for many
use-cases. They are also relatively pricey, at $2.5/gb linearly
from 100G to 800G devices (in a local webshop here).



Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


They are not wrong in general, and there are any number of examples
where bad things do happen. But there are devices which are said to
successfully work around the fundamental drawbacks with some other
technology, such as firmware and capacitors and so on.

It is indeed not yet a subject and market to be careless with, by
taking just any device off the shelf and expecting it to perform
well and live long. Also it is beneficial to do some homework during
system configuration and reduce unnecessary writes to the SSDs - by
moving logs out of the rpool, disabling atime updates and so on.
There are things an SSD is good for, and some things HDDs are better
at (or are commonly thought to be) - i.e. price and longevity past
infant death toll, and the choice of components does depend on
expected system utilization as well as performance requirements
as well as how much you're ready to cash up for that.

All that said, I haven't yet touched an SSD so far, but mostly due
to financial reasons with both dayjob and home rigs...

//Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Andrew Gabriel

Mehmet Erol Sanliturk wrote:

I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for Enterprise labeled drives ) :


SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .


That's why some of them include their own momentary power store, or in
some systems, the system has a momentary power store to keep them powered
for a period after the last write operation.


MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .

SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .


All the Enterprise grade SSDs I've used can tell you how far through their
life they are (in terms of write wearing). Some of the monitoring tools pick
this up and warn you when you're down to some threshold, such as 20% left.

Secondly, when they wear out, they fail to write (effectively become write
protected). So you find out before they confirm committing your data, and
you can still read all the data back.

This is generally the complete opposite of the failure modes of hard drives,
although like any device, the SSD might fail for other reasons.

I have not played with consumer grade drives.


Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


The extra cost of using loads of short stroked 15k drives to get anywhere
near SSD performance is generally prohibitive.

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 11:54 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-04-16 20:30, Jay Heyl wrote:

 What would be the logic behind mirrored SSD arrays? With spinning platters
 the mirrors improve performance by allowing the fastest of the mirrors to
 respond to a particular command to be the one that defines throughput.
 With


 Well, to think up a rationale: it is quite possible to saturate a bus
 or an HBA with SSDs, leading to increased latency in case of intense
 IO just because some tasks (data packets) are waiting in queue waiting
 for the bottleneck to dissolve. If another side of the mirror has a
 different connection (another HBA, another PCI bus) then IOs can go
 there - increasing overall performance.


This strikes me as a strong argument for carefully planning the arrangement
of storage devices of any sort in relation to HBAs and buses. It seems
significantly less strong as an argument for a mirror _maybe_ having a
different connection and responding faster.

My question about the rationale behind the suggestion of mirrored SSD
arrays was really meant to be more in relation to the question from the OP.
I don't see how mirrored arrays of SSDs would be effective in his
situation.

Personally, I'd go with RAID-Z2 or RAID-Z3 unless the computational load on
the CPU is especially high. This would give you as good as or better fault
protection than mirrors at significantly less cost. Indeed, given his
scenario of write early, read often later on, I might even be tempted to go
for the new TLC SSDs from Samsung. For this particular use the much reduced
lifetime of the devices would probably not be a factor at all. OTOH,
given the almost-no-limits budget, shaving $100 here or there is probably
not a big consideration. (And just to be clear, I would NOT recommend the
TLC SSDs for a more general solution. It was specifically the write-few,
read-many scenario that made me think of them.)

Basically, this answer stems from logic which applies to why would we
 need 6Gbit/s on HDDs? Indeed, HDDs won't likely saturate their buses
 with even sequential reads. The link speed really applies to the bursts
 of IO between the system and HDD's caches. Double bus speed roughly
 halves the time a HDD needs to keep the bus busy for its portion of IO.
 And when there are hundreds of disks sharing a resource (an expander
 for example), this begins to matter.


It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
when there are multiple storage devices on the other end of that path. No
single HDD today is going to come close to needing that full 6Gb/s, but put
four or five of them hanging off that same path and that ultra-super
highway starts looking pretty congested. Put SSDs on the other end and the
6Gb/s pathway is going to quickly become your bottleneck.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Bob Friesenhahn

On Tue, 16 Apr 2013, Jay Heyl wrote:


It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
when there are multiple storage devices on the other end of that path. No
single HDD today is going to come close to needing that full 6Gb/s, but put
four or five of them hanging off that same path and that ultra-super
highway starts looking pretty congested. Put SSDs on the other end and the
6Gb/s pathway is going to quickly become your bottleneck.


SATA and SAS are dedicated point-to-point interfaces so there is no 
additive bottleneck with more drives as long as the devices are 
directly connected.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 10:57 PM, Bob Friesenhahn wrote:
 On Tue, 16 Apr 2013, Jay Heyl wrote:

 It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
 when there are multiple storage devices on the other end of that path. No
 single HDD today is going to come close to needing that full 6Gb/s,
 but put
 four or five of them hanging off that same path and that ultra-super
 highway starts looking pretty congested. Put SSDs on the other end and
 the
 6Gb/s pathway is going to quickly become your bottleneck.
 
 SATA and SAS are dedicated point-to-point interfaces so there is no
 additive bottleneck with more drives as long as the devices are directly
 connected.

Not true. Modern flash storage is quite capable of saturating a 6 Gbps
SATA link. SAS has an advantage here, being dual-port natively with
active-active load balancing deployed as standard practice. Also please
note that SATA is half-duplex, whereas SAS is full-duplex.

The problem with SATA vs SAS for flash storage is that there are, as
yet, no flash devices of the NL-SAS kind. By this I mean drives that
are only about 10-20% more expensive than their SATA counterparts,
offering native SAS connectivity, but not top-notch enterprise
features and/or performance. This situation existed in HDDs not long
ago: you had 7k2 SATA and 10k/15k SAS, but no 7k2 SAS. That's why we had
to do all that nonsense with SAS to SATA interposers (I have an old Sun
J4200 with 1TB SATA drives that had an interposer on each of the 12
drives). Since then, NL-SAS largely made this route obsolete, so now I
just buy 7k2 NL-SAS drives and skip the whole interposer thing.

Now if any of the big storage drive got their shit together and started
offering flash storage with native SAS at slightly above SATA prices,
I'd be delighted. Trouble is, the manufacturers seem to be trying to
position SAS SSDs as even more expensive/performing than SAS HDDs
types of products. When I can buy a 512GB SATA SSD for the price of a
600GB SAS drive, that seems a strange proposition indeed...

--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl j...@frelled.us wrote:

 My question about the rationale behind the suggestion of mirrored SSD
 arrays was really meant to be more in relation to the question from the OP.
 I don't see how mirrored arrays of SSDs would be effective in his
 situation.


There is another detail here to keep in mind: ZFS checks checksums on every
read from storage, and with raid-zn used with block sizes that give it more
capacity than mirroring (that is, data blocks are large enough that they
get split across multiple data sectors and therefore devices, instead of
degenerate single data sector plus parity sector(s) - OP mentioned 32K
blocks, so they should get split), this means each random filesystem read
that isn't cached hits a large number of devices in a raid-zn vdev, but
only one device in a mirror vdev (unless ZFS splits these reads across
mirrors, but even then it is still fewer devices hit).  If you are limited
by IOPS of the devices, then this could make raid-zn slower.

Disclaimer: this is theory, I haven't tested this in practice, nor have I
done any math to see if it should matter to SSDs.  However, since it is a
configuration question rather than a hardware question, it may be possible
to acquire (some of) the hardware first and test both setups before
deciding.

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 11:25 PM, Timothy Coalson wrote:
 On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl j...@frelled.us wrote:
 
 My question about the rationale behind the suggestion of mirrored SSD
 arrays was really meant to be more in relation to the question from the OP.
 I don't see how mirrored arrays of SSDs would be effective in his
 situation.

 
 There is another detail here to keep in mind: ZFS checks checksums on every
 read from storage, and with raid-zn used with block sizes that give it more
 capacity than mirroring (that is, data blocks are large enough that they
 get split across multiple data sectors and therefore devices, instead of
 degenerate single data sector plus parity sector(s) - OP mentioned 32K
 blocks, so they should get split), this means each random filesystem read
 that isn't cached hits a large number of devices in a raid-zn vdev, but
 only one device in a mirror vdev (unless ZFS splits these reads across
 mirrors, but even then it is still fewer devices hit).  If you are limited
 by IOPS of the devices, then this could make raid-zn slower.

If you are IOPS constrained, then yes, raid-zn will be slower, simply
because any read needs to hit all data drives in the stripe. This is
even worse on writes if the raidz has bad geometry (number of data
drives isn't a power of 2).

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).


Off topic slightly, but I have always wondered at this - what exactly
causes non-power of 2 plus number of parities geometries to be slower, and
by how much?  I tested for this effect with some consumer drives, comparing
8+2 and 10+2, and didn't see much of a penalty (though the only random test
I did was read, our workload is highly sequential so it wasn't important).

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 11:37 PM, Timothy Coalson wrote:
 On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:
 
 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).

 
 Off topic slightly, but I have always wondered at this - what exactly
 causes non-power of 2 plus number of parities geometries to be slower, and
 by how much?  I tested for this effect with some consumer drives, comparing
 8+2 and 10+2, and didn't see much of a penalty (though the only random test
 I did was read, our workload is highly sequential so it wasn't important).

Because a non-power-of-2 number of drives causes a read-modify-write
sequence on every (almost) write. HDDs are block devices and they can
only ever write in increments of their sector size (512 bytes or
nowadays often 4096 bytes). Using your example above, you divide a 128k
block by 8, you get 8x16k updates - all nicely aligned on 512 byte
boundaries, so your drives can write that in one go. If you divide by
10, you get an ugly 12.8k, which means if your drives are of the
512-byte sector variety, they write 24x 512 sectors and then for the
last partial sector write, they first need to fetch the sector from the
patter, modify if in memory and then write it out again.

I said almost every write is affected, but this largely depends on
your workload. If your writes are large async writes, then this RMW
cycle only happens at the end of the transaction commit (simplifying a
bit, but you get the idea), which is pretty small. However, if you are
doing many small updates in different locations (e.g. writing the ZIL),
this can significantly amplify the load.

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 2:25 PM, Timothy Coalson tsc...@mst.edu wrote:

 On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl j...@frelled.us wrote:

  My question about the rationale behind the suggestion of mirrored SSD
  arrays was really meant to be more in relation to the question from the
 OP.
  I don't see how mirrored arrays of SSDs would be effective in his
  situation.
 

 There is another detail here to keep in mind: ZFS checks checksums on every
 read from storage, and with raid-zn used with block sizes that give it more
 capacity than mirroring (that is, data blocks are large enough that they
 get split across multiple data sectors and therefore devices, instead of
 degenerate single data sector plus parity sector(s) - OP mentioned 32K
 blocks, so they should get split), this means each random filesystem read
 that isn't cached hits a large number of devices in a raid-zn vdev, but
 only one device in a mirror vdev (unless ZFS splits these reads across
 mirrors, but even then it is still fewer devices hit).  If you are limited
 by IOPS of the devices, then this could make raid-zn slower.


I'm getting a sense of comparing apples to oranges here, but I do see your
point about the raid-zn always requiring reads from more devices due to the
parity. OTOH, it was my impression that read operations on n-way mirrors
are always issued to each of the 'n' mirrors. Just for the sake of
argument, let's say we need room for 1TB of storage. For raid-z2 we use
4x500GB devices. For the mirrored setup we have two mirrors each with
2x500GB devices. Reads to the raid-z2 system will hit four devices. If my
assumption is correct, reads to the mirrored system will also hit four
devices. If we go to a 3-way mirror, reads would hit six devices.

In all but degenerate cases, mirrored arrangements are going to include
more drives for the same amount of usable storage, so it seems they should
result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 4:44 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:

 On 04/16/2013 11:37 PM, Timothy Coalson wrote:
  On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.com
 wrote:
 
  If you are IOPS constrained, then yes, raid-zn will be slower, simply
  because any read needs to hit all data drives in the stripe. This is
  even worse on writes if the raidz has bad geometry (number of data
  drives isn't a power of 2).
 
 
  Off topic slightly, but I have always wondered at this - what exactly
  causes non-power of 2 plus number of parities geometries to be slower,
 and
  by how much?  I tested for this effect with some consumer drives,
 comparing
  8+2 and 10+2, and didn't see much of a penalty (though the only random
 test
  I did was read, our workload is highly sequential so it wasn't
 important).

 Because a non-power-of-2 number of drives causes a read-modify-write
 sequence on every (almost) write. HDDs are block devices and they can
 only ever write in increments of their sector size (512 bytes or
 nowadays often 4096 bytes). Using your example above, you divide a 128k
 block by 8, you get 8x16k updates - all nicely aligned on 512 byte
 boundaries, so your drives can write that in one go. If you divide by
 10, you get an ugly 12.8k, which means if your drives are of the
 512-byte sector variety, they write 24x 512 sectors and then for the
 last partial sector write, they first need to fetch the sector from the
 patter, modify if in memory and then write it out again.

 I said almost every write is affected, but this largely depends on
 your workload. If your writes are large async writes, then this RMW
 cycle only happens at the end of the transaction commit (simplifying a
 bit, but you get the idea), which is pretty small. However, if you are
 doing many small updates in different locations (e.g. writing the ZIL),
 this can significantly amplify the load.


Okay, I get the carryover of partial stripe causing problems, that makes
sense, and at least has implications on space efficiency given that ZFS
mainly uses power of 2 block sizes.  However, I was not under the
impression that ZFS ever uses partial sectors, that instead it uses fewer
devices in the final stripe, ie, it would be split 10+2, 10+2...6+2.  If
what you say is true, I'm not sure how ZFS both manages to address halfway
through a sector (if it must keep that old partial sector, it must be used
somewhere, yes?), and yet has problems with changing sector sizes (the
infamous ashift).  Are you perhaps thinking of block device style software
raid, where you need to ensure that even non-useful bits have correct
parity computed?

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread alka
ZFS datablocks are also a power of two what means, that if you have 
1,2,4,8,16,32,.. datadisks, every write is evenly spread over all disks.

If you add one disk ex from 8 to 9 datadisks, any one disk is not used on a 
read/write.
Does that means, 9 datadisks are slower than 8 disks?

No,  9 disks are faster, maybee not 1/9 faster but faster.
So think about more like a myth 

Add Raid redundancy disks to that count, example with 8 datadisks,  
add one disk for Z1 (9), 2 disks for Z2 (10) and 3 disks forZ3(11) for disks 
per vdev.




Am 16.04.2013 um 23:37 schrieb Timothy Coalson:

 On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:
 
 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).
 
 
 Off topic slightly, but I have always wondered at this - what exactly
 causes non-power of 2 plus number of parities geometries to be slower, and
 by how much?  I tested for this effect with some consumer drives, comparing
 8+2 and 10+2, and didn't see much of a penalty (though the only random test
 I did was read, our workload is highly sequential so it wasn't important).
 
 Tim
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss

--


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Richard Elling
clarification below...

On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov skiselkov...@gmail.com wrote:

 On 04/16/2013 11:37 PM, Timothy Coalson wrote:
 On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:
 
 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).
 
 
 Off topic slightly, but I have always wondered at this - what exactly
 causes non-power of 2 plus number of parities geometries to be slower, and
 by how much?  I tested for this effect with some consumer drives, comparing
 8+2 and 10+2, and didn't see much of a penalty (though the only random test
 I did was read, our workload is highly sequential so it wasn't important).

This makes sense, even for more random workloads.

 
 Because a non-power-of-2 number of drives causes a read-modify-write
 sequence on every (almost) write. HDDs are block devices and they can
 only ever write in increments of their sector size (512 bytes or
 nowadays often 4096 bytes). Using your example above, you divide a 128k
 block by 8, you get 8x16k updates - all nicely aligned on 512 byte
 boundaries, so your drives can write that in one go. If you divide by
 10, you get an ugly 12.8k, which means if your drives are of the
 512-byte sector variety, they write 24x 512 sectors and then for the
 last partial sector write, they first need to fetch the sector from the
 patter, modify if in memory and then write it out again.

This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has 
been
a few years, I did a bunch of tests and found no correlation between the number
of disks in the set (within boundaries as described in the man page) and random
performance for raidz. This is not the case for RAID-5/6 where pathologically
bad performance is easy to create if you know the number of disks and stripe 
width.
 -- richard

 
 I said almost every write is affected, but this largely depends on
 your workload. If your writes are large async writes, then this RMW
 cycle only happens at the end of the transaction commit (simplifying a
 bit, but you get the idea), which is pretty small. However, if you are
 doing many small updates in different locations (e.g. writing the ZIL),
 this can significantly amplify the load.
 
 Cheers,
 --
 Saso
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Bob Friesenhahn

On Tue, 16 Apr 2013, Sašo Kiselkov wrote:


SATA and SAS are dedicated point-to-point interfaces so there is no
additive bottleneck with more drives as long as the devices are directly
connected.


Not true. Modern flash storage is quite capable of saturating a 6 Gbps
SATA link. SAS has an advantage here, being dual-port natively with
active-active load balancing deployed as standard practice. Also please
note that SATA is half-duplex, whereas SAS is full-duplex.


You did not describe how my statement about not being additive is 
wrong.  This is different than per-drive bandwidth being insufficient 
for latest SSDs.  Please expound on Not true.


SAS/SATA are not like old parallel SCSI.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 23:56, Jay Heyl wrote:

result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?


Yes, in normal case (not scrubbing which makes a point of reading
everything) this assumption is wrong. Writes do hit all devices
(mirror halves or raid disks), but reads should be in parallel.
For mechanical HDDs this allows to double average read speeds
(or triple for 3-way mirrors, etc.) because different spindles
begin using their heads in shorter strokes around different areas,
if there are enough concurrent randomly placed reads.

Due to ZFS data structure, you know your target block's expected
checksum before you read its data (from any mirror half, or from
data disks in a raidzn); then ZFS calculates the checksum of the
data it has read and combined, and only if there is a mismatch, it
has to read from other disks for redundancy (and fix the detected
broken part).

HTH,
//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-04-16 23:56, Jay Heyl wrote:

 result in more devices being hit for both read and write. Or am I wrong
 about reads being issued in parallel to all the mirrors in the array?


 Yes, in normal case (not scrubbing which makes a point of reading
 everything) this assumption is wrong. Writes do hit all devices
 (mirror halves or raid disks), but reads should be in parallel.
 For mechanical HDDs this allows to double average read speeds
 (or triple for 3-way mirrors, etc.) because different spindles
 begin using their heads in shorter strokes around different areas,
 if there are enough concurrent randomly placed reads.


There is another part to his question, specifically whether a single random
read that falls within one block of the file hits more than one top level
vdev - to put it another way, whether a single block of a file is striped
across top level vdevs.  I believe every block is allocated from one and
only one vdev (blocks with ditto copies allocate multiple blocks, ideally
from different vdevs, but this is not the same thing), such that every read
that hits only one file block goes to only one top level vdev unless
something goes wrong badly enough to need a ditto copy.

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 23:37, Timothy Coalson wrote:

On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov skiselkov...@gmail.comwrote:


If you are IOPS constrained, then yes, raid-zn will be slower, simply
because any read needs to hit all data drives in the stripe. This is
even worse on writes if the raidz has bad geometry (number of data
drives isn't a power of 2).



Off topic slightly, but I have always wondered at this - what exactly
causes non-power of 2 plus number of parities geometries to be slower, and
by how much?  I tested for this effect with some consumer drives, comparing
8+2 and 10+2, and didn't see much of a penalty (though the only random test
I did was read, our workload is highly sequential so it wasn't important).



My take on this is not that these geometries are slower, but that
they may be less efficient in terms of overheads at data storage.

Say, you write a 16-sector block of userdata to your arrays.
In case of 8+2 that would be two full stripes of parity and data.
In case of 9+2 that would be a 9+2 and a 7+2 stripe. Access to
this data is less balanced, placing more load on some disks which
have 2 sectors of this block, and less load on others which have
only one sector. It seems more sad when (i.e. due to compression)
you have 1 or 2 userdata sectors remaining on a second stripe, but
must still provide the 2 or 3 sectors of redundancy for this mini
stripe.

Also, as I found, ZFS raidzN makes precautions to not leave some
potentially unusable holes (i.e. 1 or 2 free sectors, where you
can't fit parity and data), so it would allocate full stripes
when you have sufficiently unlucky stripe lengths just a few
sectors shorter than full (i.e. 7+2 above would likely be allocated
as 9+2 with zeroed-out extra sectors)... These things do add up to
gigabytes, though they can happen on power-of-two sized arrays with
compression just as easily (I found this with a 6-disk raidz2, with
4 data disks).

Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-17 01:12, Timothy Coalson wrote:

On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov jimkli...@cos.ru wrote:


On 2013-04-16 23:56, Jay Heyl wrote:


result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?



Yes, in normal case (not scrubbing which makes a point of reading
everything) this assumption is wrong. Writes do hit all devices
(mirror halves or raid disks), but reads should be in parallel.
For mechanical HDDs this allows to double average read speeds
(or triple for 3-way mirrors, etc.) because different spindles
begin using their heads in shorter strokes around different areas,
if there are enough concurrent randomly placed reads.



There is another part to his question, specifically whether a single random
read that falls within one block of the file hits more than one top level
vdev - to put it another way, whether a single block of a file is striped
across top level vdevs.  I believe every block is allocated from one and
only one vdev (blocks with ditto copies allocate multiple blocks, ideally
from different vdevs, but this is not the same thing), such that every read
that hits only one file block goes to only one top level vdev unless
something goes wrong badly enough to need a ditto copy.


I believe so too... I think, striping over top-level vdevs is subject
to many tuning and algorithmical influences, and is not as simple as
even-odd IOs. Also IIRC there is some size of data (several MBytes)
that is preferably sent as sequential IO to one TLVDEV, then it is
striped over to another, in order to better utilize the strengths of
faster sequential IO vs. lags of seeking.

But I may be way off-track here with this belief ;)

Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/17/2013 12:08 AM, Richard Elling wrote:
 clarification below...
 
 On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 
 On 04/16/2013 11:37 PM, Timothy Coalson wrote:
 On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov 
 skiselkov...@gmail.comwrote:

 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).


 Off topic slightly, but I have always wondered at this - what exactly
 causes non-power of 2 plus number of parities geometries to be slower, and
 by how much?  I tested for this effect with some consumer drives, comparing
 8+2 and 10+2, and didn't see much of a penalty (though the only random test
 I did was read, our workload is highly sequential so it wasn't important).
 
 This makes sense, even for more random workloads.
 

 Because a non-power-of-2 number of drives causes a read-modify-write
 sequence on every (almost) write. HDDs are block devices and they can
 only ever write in increments of their sector size (512 bytes or
 nowadays often 4096 bytes). Using your example above, you divide a 128k
 block by 8, you get 8x16k updates - all nicely aligned on 512 byte
 boundaries, so your drives can write that in one go. If you divide by
 10, you get an ugly 12.8k, which means if your drives are of the
 512-byte sector variety, they write 24x 512 sectors and then for the
 last partial sector write, they first need to fetch the sector from the
 patter, modify if in memory and then write it out again.
 
 This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has 
 been
 a few years, I did a bunch of tests and found no correlation between the 
 number
 of disks in the set (within boundaries as described in the man page) and 
 random
 performance for raidz. This is not the case for RAID-5/6 where pathologically
 bad performance is easy to create if you know the number of disks and stripe 
 width.
  -- richard

You are right, and I think I already know where I went wrong, though
I'll need to check raidz_map_alloc to confirm. If memory serves me
right, raidz actually splits the I/O up so that each stripe component is
simply length-aligned and padded out to complete a full sector
(otherwise the zio_vdev_child_io would fail in a block-alignment
assertion in zio_create here:

zio_create(zio_t *pio, spa_t *spa,...
{
  ..
  ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
  ..

I was probably misremembering the power-of-2 rule from a discussion
about 4k sector drives. There the amount of wasted space can be
significant, especially on small-block data, e.g. the default 8k
volblocksize not being able to scale beyond 2 data drives + parity.

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
 From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
 
 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. 

Saso, I would expect you to know the answer to this question, probably:
I have heard that raidz is more similar to raid-1e than raid-5.  Meaning, when 
you write data to raidz, it doesn't get striped across all devices in the raidz 
vdev...  Rather, two copies of the data get written to any of the available 
devices in the raidz.  Can you confirm?

If the behavior is to stripe across all the devices in the raidz, then the 
raidz iops really can't exceed that of a single device, because you have to 
wait for every device to respond before you have a complete block of data.  But 
if it's more like raid-1e and individual devices can read independently of each 
other, then at least theoretically, the raidz with n-devices in it could return 
iops performance on-par with n-times a single disk. 


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 4:01 PM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-04-16 23:56, Jay Heyl wrote:

 result in more devices being hit for both read and write. Or am I wrong
 about reads being issued in parallel to all the mirrors in the array?


 Yes, in normal case (not scrubbing which makes a point of reading
 everything) this assumption is wrong. Writes do hit all devices
 (mirror halves or raid disks), but reads should be in parallel.
 For mechanical HDDs this allows to double average read speeds
 (or triple for 3-way mirrors, etc.) because different spindles
 begin using their heads in shorter strokes around different areas,
 if there are enough concurrent randomly placed reads.


Not to get into bickering about semantics, but I asked, Or am I wrong
about reads being issued in parallel to all the mirrors in the array?, to
which you replied, Yes, in normal case... this assumption is wrong... but
reads should be in parallel. (Ellipses intended for clarity, not argument
munging.) If reads are in parallel, then it seems as though my assumption
is correct. I realize the system will discard data from all but the first
reads and that using only the first response can improve performance, but
in terms of number of IOPs, which is where I intended to go with this, it
seems to me the mirrored system will have at least as many if not more than
the raid-zn system.

Or have I completely misunderstood what you intended to say?
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
 From: Jay Heyl [mailto:j...@frelled.us]
 
  So I'm just assuming you're going to build a pool out of SSD's, mirrored,
  perhaps even 3-way mirrors.  No cache/log devices.  All the ram you can fit
  into the system.
 
 What would be the logic behind mirrored SSD arrays? With spinning platters
 the mirrors improve performance by allowing the fastest of the mirrors to
 respond to a particular command to be the one that defines throughput.

When you read from a mirror, ZFS doesn't read the same data from both sides of 
the mirror simultaneously and let them race, wasting bus  memory bandwidth to 
attempt gaining smaller latency.  If you have a single thread doing serial 
reads, I also have no cause to believe that zfs reads stripes from multiple 
sides of the mirror to accelerate - rather, it relies on the striping across 
multiple mirrors or vdev's.

But if you have multiple threads requesting independent random read operations 
that are on the same mirror, I have measured the results that you get very 
nearly n-times a single disk random read performance by using a n-way mirror 
and at least n or 2n independent random read threads.


 There is no
 latency due to head movement or waiting for the proper spot on the disc to
 rotate under the heads. 

Nothing, including ZFS, has such an in-depth knowledge of the inner drive 
geometry as to know how long is necessary for the rotational latency to come 
around.  Also, rotational latency is almost nothing compared to head seek.  For 
this reason, short-stroking makes a big difference, when you have a data usage 
pattern that can easily be confined to a small number of adjacent tracks.  I 
believe, if you use a HDD for log device, it's aware of itself and does 
short-stroking, but I don't actually know.  Also, this is really a completely 
separate subject.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Richard Elling
For the context of ZPL, easy answer below :-) ...

On Apr 16, 2013, at 4:12 PM, Timothy Coalson tsc...@mst.edu wrote:

 On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov jimkli...@cos.ru wrote:
 
 On 2013-04-16 23:56, Jay Heyl wrote:
 
 result in more devices being hit for both read and write. Or am I wrong
 about reads being issued in parallel to all the mirrors in the array?
 
 
 Yes, in normal case (not scrubbing which makes a point of reading
 everything) this assumption is wrong. Writes do hit all devices
 (mirror halves or raid disks), but reads should be in parallel.
 For mechanical HDDs this allows to double average read speeds
 (or triple for 3-way mirrors, etc.) because different spindles
 begin using their heads in shorter strokes around different areas,
 if there are enough concurrent randomly placed reads.
 
 
 There is another part to his question, specifically whether a single random
 read that falls within one block of the file hits more than one top level
 vdev -

No.

 to put it another way, whether a single block of a file is striped
 across top level vdevs.  I believe every block is allocated from one and
 only one vdev (blocks with ditto copies allocate multiple blocks, ideally
 from different vdevs, but this is not the same thing), such that every read
 that hits only one file block goes to only one top level vdev unless
 something goes wrong badly enough to need a ditto copy.

Correct.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
 From: Mehmet Erol Sanliturk [mailto:m.e.sanlit...@gmail.com]
 
 SSD units are very vulnerable to power cuts during work up to complete
 failure which they can not be used any more to complete loss of data .

If there are any junky drives out there that fail so dramatically, those are 
junky and the exception.  Just imagine how foolish the engineers would have to 
be, Power loss?  I didn't think of that...  Complete drive failure in power 
loss is acceptable behavior.   Definitely an inaccurate generalization about 
SSD's.  There is nothing inherent about flash memory as compared to magnetic 
material, that would cause such a thing.

I repeat:  I'm not saying there's no such thing as a SSD that has such a 
problem.  I'm saying if there is, it's junk.  And you can safely assume any 
good drive doesn't have that problem.


 MLC ( Multi-Level Cell ) SSD units have a short life time if they are
 continuously written ( they are more suitable to write once ( in a limited
 number of writes sense ) - read many )  .

It's a fact that NAND has a finite number of write cycles, and it gets slower 
to write, the more times it's been re-written.  It is also a fact that when 
SSD's were first introduced to the commodity market about 11 years ago, that 
they failed quickly due to OSes (windows) continually writing the same sectors 
over and over.  But manufacturers have been long since aware of this problem, 
and solved it by overprovisioning and wear-leveling.

Similar to ZFS copy-on-write, which has the ability to logically address some 
blocks and secretly re-map them to different sectors behind the scenes...  
SSD's with wear-leveling secretly remap sectors during writes.


 SSD units may fail due to write wearing in an unexpected time , making them
 very unreliable for mission critical works .

Every page has a write counter, which is used to predict failure.  A very 
predictable, and very much *not* unexpected time.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-17 02:10, Jay Heyl wrote:

Not to get into bickering about semantics, but I asked, Or am I wrong
about reads being issued in parallel to all the mirrors in the array?, to
which you replied, Yes, in normal case... this assumption is wrong... but
reads should be in parallel. (Ellipses intended for clarity, not argument
munging.) If reads are in parallel, then it seems as though my assumption
is correct. I realize the system will discard data from all but the first
reads and that using only the first response can improve performance, but
in terms of number of IOPs, which is where I intended to go with this, it
seems to me the mirrored system will have at least as many if not more than
the raid-zn system.

Or have I completely misunderstood what you intended to say?


Um, right... I got torn between several letters and forgot the details
of one. So, here's what I replied to with poor wording - *I thought you
meant* A single read request from a program would be redirected as a
series of parallel requests to mirror components asking for the same
data, whichever one answers first - this is no, the wrong in my
reply. Unless the first device to answer returns garbage (something
that doesn't match the expected checksum), other copies are not read
as part of this request.

Now, if there are many requests on the system issued simultaneously,
which is most often the case, then reads from different requests are
directed to different disks, but again - one read goes to one disk
except pathological cases. It is likely that the system selects a
disk to read from based, in part, on its expectation of where the
disk head is (i.e. last requested LBA is nearest to the LBA we want
now) in order to minimize latency and unproductive time losses.
Thus sequential reads where requests for nearby sectors come in
a succession are likely to be satisfied by a single disk in the
mirror, leaving other disks available to satisfy other reads.

Copies of a write request however are sent to all disks and committed
(flushed) before the synchronous request is accepted as completed
(for example, a write-and-commit of a TXG transaction group).

Hope this makes my point clearer, it is late here ;)
//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 17/04/13 02:10, Jay Heyl wrote:
 Not to get into bickering about semantics, but I asked, Or am I
 wrong about reads being issued in parallel to all the mirrors in
 the array?

Each read is issued only to a (lets say, random) disk in the mirror,
unless the read is faulty.

You can check this easily with zpool iostat -v.

- -- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQCVAwUBUW4LLJlgi5GaxT1NAQJc7QQAin3XjOVhOqlD5/Q0xplH+TLtPNjzsCqd
rAvz30tnokA1MXgGpCXx2u5rGnS2CE/Xi5boMBMegxf+feAgQlANYRykpgwSqxeo
VgBUvkoWC2oKAk2hAT1UxcbKD+YhuESx8n1B/JHej97eZuhlthpmGZC2e1H3GDiH
0+I/+wNPUVY=
=yyd1
-END PGP SIGNATURE-

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] building a new box soon- HDD concerns and recommendations for virtual serving

2013-04-16 Thread Carl Brewer
Further to my original post, I have a new (desktop, I know ... but I am 
on a tight budget) Intel MB with an i5-3750 CPU and 32 GB of desktop 
RAM.  Booting the 151a7 live DVD shows that it thinks it's a 32 bit 
system (huh?). It regognises almost all the devices when I run the 
device manager, but doesn't see any HDD's.  they're just not there at 
all.  At boot time it says they're too big for a 32 bit kernel.  Why is 
it booting a 32 bit kernel anyway?  Maybe some BIOS thing I need to set? 
 I booted the first (default) image that the live CD displays.


The MB is a DH77EB which is an H77 Express chipset.

As well as thinking it's 32 bit, it also doesn't see any drives at all. 
I have a single new Seagate 2TB (4k blocks?) ST2000DM001 in it that the 
live CD doesn't see at all.  I know there was a recent thread on 
installing to these drives that seemed inconclusive as to how best to 
install to them.  I want to have :


2 x 2TB HDDs for rpool (ZFS mirror)
4 x 2TB HDD's to get at least a 4TB mirror (or is RAID-Z a better option?)

Would I be better off with some 500GB HDD's for the rpool?  And while I 
fiddle with this thing, is there any way to get the live CD installer to 
work with these drives without poking around with some other OS to 
partition the drive?


help!

Thank you

Carl




___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss