from:"Joe Landman"

[Gluster-users] data going missing with 3.3.x

2013-10-07 Thread Joe Landman


Hi folks

  We've run into this issue a few years ago, filed bugs, and been told 
that its been fixed.


  Unfortunately, we still have customers using 3.3.x (x=1 ... ) whom 
are running into data loss, in the sense that data gets written, and 
then disappears from the volumes.  We developed some tools a while ago 
to help with these issues 
(http://download.scalableinformatics.com/gluster/utils/data_mover.pl), 
and it looks like they are still problematic for a few of our users.


  Does anyone else run into this?  I am just trying to get a sense for 
this, and figure out exactly what sort of debugging we want to do 
relative to this.  Update to new GlusterFS may not be an option for 
these users.


  Thanks.

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Install and Hardware expert.

2013-09-26 Thread Joe Landman


On 09/26/2013 12:38 PM, William John van Jaarsveldt wrote:

Hi,

I am looking for a Gluser Install and Hardware expert, somebody I can
contract to advice and help with a gluster installation.

They system we are planning will run in production, I don't have the
time to actually play and test to the full extend I need.

We will be a Debian Wheezy based server system.

Any body know of a company I can contract.


If you were looking for complete solutions, I know of someone we could 
recommend (/grin).  If you simply want a local consultant in the same 
time zone to handle issues, this may be a good place to ask.  I suspect 
the latter is the case.








--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] ACL issue with 3.4.0 GA + XFS + native client?

2013-07-23 Thread Joe Landman


On 07/23/2013 03:55 PM, Nicholas Majeran wrote:

FWIW, I tried to disable the parameter on a stopped volume, which was
successful.  I then started the volume and I could get/set the ACLs
normally.

I'm going to try the same procedure on the gv0 volume that threw the
error previously.
Thanks.


The inconsistent version possibility worries me.  What does glusterfs -V 
and glusterfsd -V report?


[root@crunch glusterfs-3.4.0]# glusterfs -V
glusterfs 3.4.0 built on Jul 23 2013 16:02:16
...

[root@crunch glusterfs-3.4.0]# glusterfsd -V
glusterfs 3.4.0 built on Jul 23 2013 16:02:16
...

for each machine that is a brick (glusterfsd) and each client (glusterfs)?

We've had problems in the past with incomplete updates between major 
version revisions, to the point that we had to completely scrub any old 
libraries and config files off some units during updates (3.2.x and 
3.3.x time periods).




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Joe Landman


On 07/10/2013 03:18 PM, Joe Julian wrote:


The "small file" complaint is all about latency though. There's very
little disk overhead (all inode lookups) to doing a self-heal check. "ls
-l" on a 50k file directory and nearly all the delay is from network RTT
for self-heal checks (check that with wireshark).



Try it with localhost.  Build a small test gluster brick, take 
networking out of the loop, create 50k files, and launch the self heal. 
 RTT is part of it, but not the majority (last I checked it wasn't a 
significant fraction relative to other metadata bits).


I did an experiment with 3.3.x a while ago with 2x ramdisks I created a 
set of files, looped  back with losetup, built xfs fs atop them, 
mirrored them with glusterfs, and then set about to doing metadata/small 
file heavy workloads.  Performance was still abysmal.  Pretty sure none 
of that was RTT.  Definitely a stack traversal problem, but I didn't 
trace it far enough back to be definitively sure where it was.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Joe Landman


On 07/10/2013 02:36 PM, Joe Julian wrote:


1) http://www.solarflare.com makes sub microsecond latency adapters that
can utilize a userspace driver pinned to the cpu doing the request
eliminating a context switch


We've used open-onload in the past on Solarflare hardware.  And with 
GlusterFS.


Just say no.  Seriously.  You don't want to go there.


2) http://www.aristanetworks.com/en/products/7100t is a 2.5 microsecond
switch


Neither choice will impact overall performance much for GlusterFS, even 
in heavily loaded situations.


What impacts performance more than anything else is node/brick design, 
implementation, and specific choices in that mix.  Storage latency, 
bandwidth, and overall design will be more impactful than low latency 
networking.  Distribution, kernel and filesystem choices (including 
layout, lower level features, etc.) will matter significantly more than 
low latency networking.  You can completely remove the networking impact 
by trying your changes out on localhost, and seeing what the impact your 
design changes have.


If you don't start out with a fast box, you are not going to have fast 
aggregated storage.  This observation has not changed since the pre 2.0 
GlusterFS days (its as true today as it was years ago).


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Meta

2013-01-22 Thread Joe Landman


On 01/22/2013 09:28 AM, F. Ozbek wrote:


However, it just turns out that we have the data and the tests, so we 
will

post it here. I have this feeling that the moment we do, Jeff will start


Please provide more information on the "data and the tests".  What are 
they, what do they entail, what is meant by failing, passing, etc.?


This information is helpful to everyone, regardless of which systems do 
poorly/well.


OTOH, please be prepared for a fairly intensive look at your testing 
methodology.  We've found in our own experience, that unless the tests 
really do what they are purported to do, that end users wind up 
generating less than valuable data, and subsequently, decisions based 
upon this are as often as not, fundamentally flawed.


I cannot tell you how many times we've dealt with flawed tests that 
didn't come close to measuring what people thought they did.   Its quite 
amusing to be attacked with results of these tests as well. Using poor 
tests and then bashing vendors with them is more of a reflection of the 
user than of the vendor.


Honestly, we have some issues with Gluster that we've raised off list 
with John Mark and others (not Jeff, but I should make the points with 
him as well).  There are reasonable and valid critiques of it, and it is 
not appropriate for all workloads.   There are good elements to it, and 
... less good ... elements to it, in implementation, design, etc.


I agree with Jeff that its bad form to come on the list and say "Gluster 
fails, X works" in general.  Its far more constructive to come on the 
list and say "these are the tests we use, and these are the results.  
Gluster does well here and here, X does well here and here."  Freedom of 
speech isn't relevant here, the mailing list and product are privately 
owned, and there is no presumption of such freedom in this case.  I'd 
urge you to respect the other list members and participants, by 
positively contributing as noted above.  The "gluster fails, X rulez" 
doesn't quite fit this.


So ... may I request that, before you respond to further posts on this 
topic, that you create a post with your tests, how you ran them, your 
hardware configs, your software stack elements (kernel, net/IB, ...), 
details of the tests, details of the results?  Without this, I am hard 
pressed to take further posts seriously.


There are alternatives to Gluster.  The ones we use/deploy include Ceph, 
Fraunhofer, Lustre, and others.  We did review MooseFS, mostly for a set 
of media customers.  It had some positive elements, but we found that 
performance was underwhelming for our streaming and reliability tests 
(c.f. http://download.scalableinformatics.com/disk_stress_tests/fio/ ).  
Our hardware are our JackRabbit units, and our siFlash units (links not 
provided so as to avoid spamming).  Native system performance was 
2.5GB/s for the JackRabbit, about 8GB/s for the siFlash.  GlusterFS got 
me to 2GB/s on JackRabbit, and 3.5GB/s on siFlash.  MooseFS, when we 
tested (about a year ago), was about 400-500 MB/s on JackRabbit, and 
about 600 MB/s on siFlash.  We tried some networked tests to multiple 
clients (and John Mark has an email from me around that time) where we 
were sustaining 2+ GB/s across 2x JackRabbit units with GlusterFS.  I've 
never been able to get above 700 MB/s with MooseFS on any of our test 
cases.  I've had tests fail on MooseFS, usually when a network port 
becomes overloaded, its response to this was anything but graceful.


We had considered using it with some customers, but figured we should 
wait for it to mature some more.   We feel the same way about btrfs, and 
until recently, about Ceph.  The two latter have been coming along 
nicely.  Ceph is deployable.


W.r.t. Gluster, it has been getting better, with a few caveats (again, 
John Mark knows what I am talking about).  Its not perfect for 
everything, but its quite good at what it does.


Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Very slow directory listing and high CPU usage on replicated volume

2012-11-06 Thread Joe Landman


On 11/06/2012 04:35 AM, Fernando Frediani (Qube) wrote:

Joe,

I don't think we have to accept this as this is not acceptable thing.


I understand your unhappyness with it.  But its "free" and you sometimes 
have to accept what you get for "free".



I have seen countless people complaining about this problem for a
while and seems no improvements have been done. The thing about the
ramdisk although might help, looks more a chewing gun. I have seen
other distributed filesystems that don't suffer for the same problem,
so why Gluster have to ?


This goes to some aspect of the implementation.  FUSE makes metadata ops 
(and other very small IOs) problematic (as in time consuming).  There 
are no easy fixes for this, without engineering a new kernel subsystem 
(unlikely) to incorporate Gluster, or redesigning FUSE so this is not an 
issue.  I am not sure either is likely.


Red Hat may be willing to talk to you about these if you give them money 
for subscriptions.  They eventually relented on xfs.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Very slow directory listing and high CPU usage on replicated volume

2012-11-05 Thread Joe Landman


On 11/05/2012 09:57 AM, harry mangalam wrote:

Jeff Darcy wrote a nice piece in his hekafs blog about 'the importance of
keeping things sequential' which is essentially about the contention for heads
between data io and journal io.

(also congrats on the Linux Journal article on the glupy python/gluster
approach).

We've been experimenting with SSDs on ZFS (using the SSDs fo the ZIL
(journal)) and while it's provided a little bit of a boost, it has not been
dramatic.  Ditto XFS.  However, we did not stress it at all with heavy loads


An issue you have to worry about is if the SSD streaming read/write path 
is around the same speed as the spinning rust performance.  If so, this 
design would be a wash at best.


Also, if this is under Linux, the ZFS pathways may not be terribly well 
optimized.



in a gluster env and I'm now thinking that there is where you would see the
improvement. (see Jeff's graph about how the diff in threads/load affects
IOPS).

Is anyone running a gluster system with the underlying XFS writing the journal
to SSDs?  If so, any improvement?  I would have expected to hear about this as
a recommended architecture for gluster if it had performed MUCH better, but


Yes, we've done this, and do this on occasion.  No, there's no dramatic 
speed boost for most use cases.


Unfortunately, heavy metadata ops on GlusterFS are going to be slow, and 
we simply have to accept that for the near term.  This appears to be 
independent of the particular file system, or even storage technology. 
If you aren't doing metadata heavy ops, then you should be in good 
shape.  It appears that mirroring magnifies the metadata heavy ops 
significantly.


For laughs, about a year ago, we set up large ram disks (tmpfs) in a 
cluster, put a loopback device on them, then a file system, then 
GlusterFS atop this.  Should have been very fast for metadata ops.  But 
it wasn't.  Gave some improvement, but not significant enough that we'd 
recommend doing "heroic" designs like this.


If your workloads are metadata heavy, we'd recommend local IO, and if 
you are mostly small IO, an SSD.





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] GlusterFS performance

2012-09-26 Thread Joe Landman


On 09/26/2012 05:28 PM, John Mark Walker wrote:



On Wed, Sep 26, 2012 at 2:23 PM, Berend de Boer mailto:ber...@pobox.com>> wrote:

 > "John" == John Mark Walker mailto:johnm...@johnmark.org>> writes:

 John> This is different from any other benchmark I've seen. I
 John> haven't seen that much of a disparity before.

What benchmarks? Steve's experience is very similar to what everyone
sees when trying out gluster.


Have you seen write:read ratios > 5:1? I certainly haven't. I have seen
discrepancies, sure, but not by that much.


I've seen stuff like this.  Looks like a caching issue (gluster client) 
among other things.


Read performance with the gluster client isn't that good, write 
performance (effectively write caching at the brick layer) is pretty good.


I know its a generalization, but this is basically what we see.  In the 
best case scenario, we can tune it pretty hard to get within 50% of 
native speed.  But it takes lots of work to get it to that point, as 
well as an application which streams large IO.  Small IO is a (still) 
bad on the system IMO.


I've not explored the 3.3.x caching behavior (largely turned it off in 
3.2.x and previous due to bugs which impacted behavior).





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Ran into a problem replacing a failed unit

2012-09-04 Thread Joe Landman

Have a server that died, and dropping in a new server to replace it. 
Same name/IP address (hard requirement, cannot be changed).


Everything is ready, but upon starting gluster, and peer probing the new 
server, I can't do a replace-brick.  It tells me


   dv4-4-10g, is not a friend

Any clues on what to do?  This is 3.2.7 (cannot update to 3.3 for a 
number of reasons).


Worst case, we could tear down gluster and rebuild the gluster file 
system atop it, though this seems rather extreme.  But I'll have to do 
that tonight if nothing else works (time constraints on the part of the 
user).



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] XFS and MD RAID

2012-08-29 Thread Joe Landman


On 08/29/2012 03:48 AM, Brian Candler wrote:

Does anyone have any experience running gluster with XFS and MD RAID as the


Lots


backend, and/or LSI HBAs, especially bad experience?


Its pretty solid as long as your hardware/drivers/kernel revs are solid. 
 And this requires updated firmware.  We've found modern LSI HBA and 
RAID gear have had issues with occasional "events" that seem to be more 
firmware bugs or driver bugs than anything else.  The gear is stable for 
very light usage, but when pushed hard (without driver/fw updates), it 
does crash, hard, often with corruption.




In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid
controllers, MD RAID) I can cause XFS corruption just by throwing some
bonnie++ load at the array - locally without gluster.  This happens within
hours.  The same test run over a week doesn't corrupt with ext4.


Which kernel?  I can't say I've ever seen XFS corruption from light use. 
 It usually takes some significant failure of some sort to cause this. 
 Iffy driver, bad disk, etc.


The ext4 comparison might not be apt.  Ext4 isn't designed for parallel 
IO workloads, while xfs is.  Chances are you are tickling a 
driver/kernel bug with the higher amount of work being done in xfs 
versus ext4.




I've just been bitten by this in production too on a gluster brick I hadn't
converted to ext4.  I have the details I can post separately if you wish,
but the main symptoms were XFS timeout errors and stack traces in dmesg, and
xfs corruption (requiring a reboot and xfs_repair showing lots of errors,
almost certainly some data loss).

However, this leaves me with some unpalatable conclusions and I'm not sure
where to go from here.

(1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu
kernels.  This seems unlikely given its pedigree and the fact that it is
heavily endorsed by Red Hat for their storage appliance.


Uh ... no.  Its pretty much the best/only choice for large storage 
systems out there.  Almost 20 years old at this point, making its first 
appearance in Irix in 1995 time frame or so, moving to Linux a few years 
later.  Its many things, but crappy ain't one of them.




(2) Heavy write load in XFS is tickling a bug lower down in the stack
(either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in
ext4 doesn't.  This would have to be a gross error such as blocks queued for
write being thrown away without being sent to the drive.


xfs is a parallel IO file system, ext4 is not.  There is a very good 
chance you are tickling a bug lower in the stack.  Which LSI HBA or RAID 
are you using?  How have you set this up?  What kernel rev, and whats the


modinfo mpt2sas
lspci
uname -a

output?



I guess this is plausible - perhaps the usage pattern of write barriers is
different for example.  However I don't want to point the finger there
without direct evidence either.  There are no block I/O error events logged
in dmesg.


Its very different.  XFS is pretty good about not corrupting things, the 
file system shuts down if it detects that it is corrupt.  So if the in 
memory image of the current state at moment of sync is not matched by 
whats on the platters/SSD chips, then chances are you have a problem in 
that pathway.




The only way I can think of pinning this down is to find out what's the
smallest MD RAID array I can reproduce the problem with, then try to build a
new system with a different controller card (as MD RAID + JBOD, and/or as a
hardware RAID array)


This would be a good start.



However while I try to see what I can do for that, I would be grateful for
any other experience people have in this area.


We've had lots of problems with LSI drivers/FW before rev 11.x.y.z .

FWIW:  We have siCluster storage customers with exactly these types of 
designs with uptimes measurable in hundreds of days, using Gluster atop 
XFS atop MD RAID on our units.   We also have customers who tickle 
obscure and hard to reproduce bugs, causing crashes.  Its not frequent, 
but it does happen.  Not with the file system, but usually with the 
network drivers or overloaded NFS servers.




Many thanks,

Brian.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster server overload; recovers, now "Transport endpoint is not connected" for some files

2012-08-01 Thread Joe Landman


On 08/01/2012 09:20 PM, Harry Mangalam wrote:

[...]


which implies that some of the errors are not fixable.

Is there a best-practices solution  for this problem?  I suspect this
is one of the most common problems to affect an operating gluster fs.


This is one of the issues that caused us to develop some tools to find 
missing files, and add them back into the file system.  We've had to do 
this with 3.2.x series.  Haven't had many 3.3 deployments, I had hoped 
that this would have been fixed.


Search back in the archives for some of our tools.  Last year, June/July 
time period.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] RDMA "not fully supported" by GlusterFS 3.3.0 ?!

2012-07-16 Thread Joe Landman


On 07/16/2012 03:39 PM, Anand Avati wrote:


It only means we had to push out RDMA support to 3.3.1 (or 3.3.2) for
internal resource scheduling reasons.


Ok, thanks.



Avati




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] RDMA "not fully supported" by GlusterFS 3.3.0 ?!

2012-07-16 Thread Joe Landman


On 07/16/2012 12:16 PM, Philippe Muller wrote:


Here is what I found: - On page 123 of the "GlusterFS Administration
Guide 3.3.0", a small note saying: "NOTE: with 3.3.0 release,
transport type 'rdma' and 'tcp,rdma' are not fully supported."


I don't see this indicated in the 3.2.x series, though arguably, it 
didn't work well (tcp,rdma or even pure rdma).  Last time it worked well 
for us was the 3.0.x series.


I definitely see it now in the 3.3.0 docs.

Oh well.

Should we assume this is a feature deprecation and RDMA support will be 
removed going forward?  Need to know soon for planning purposes ...


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] RAID options for Gluster

2012-06-14 Thread Joe Landman


On 06/14/2012 07:06 AM, Fernando Frediani (Qube) wrote:

I think this discussion probably came up here already but I couldn’t
find much on the archives. Would you able to comment or correct whatever
might look wrong.

What options people think is more adequate to use with Gluster in terms
of RAID underneath and a good balance between cost, usable space and
performance. I have thought about two main options with its Pros and Cons

*No RAID (individual hot swappable disks):*

Each disk is a brick individually (server:/disk1, server:/disk2, etc) so
no RAID controller is required. As the data is replicated if one fail
the data must exist in another disk on another node.


For this to work well, you need the ability to mark a disk as failed and 
as ready for removal, or to migrate all data on a disk over to a new 
disk.  Gluster only has the last capability, and doesn't have the rest. 
 You still need additional support in the OS and tool sets.


The tools we've developed for DeltaV and siFlash help in this regard, 
though I wouldn't suggest using Gluster in this mode.





_Pros_:

Cheaper to build as there is no cost for a expensive RAID controller.


If a $500USD RAID adapter saves you $1000USD of time/expense over its 
lifetime due to failed disk alerts, hot swap autoconfiguration, etc. is 
it "really" expensive?  Of course, if you are at a university where you 
have infinite amounts of cheap labor, sure, its expensive.  Cheaper to 
manage by throwing grad/undergrad students at it than it is to manage 
with an HBA.


That is, the word "expensive" has different meanings in different 
contexts ... and in storage, the $500USD adapter may easily help reduce 
costs elsewhere in the system (usually in the disk lifecycle management, 
as RAID's major purpose in life is to give you the administrator a 
fighting chance to replace a failed device before you lose your data).




Improved performance as writes have to be done only on a single disk not
in the entire RAID5/6 Array.


Good for tiny writes.  Bad for larger writes (>64kB)



Make better usage of the Raw space as there is no disk for parity on a
RAID 5/6

_Cons_:

If a failed disk gets replaced the data need to be replicated over the
network (not a big deal if using Infiniband or 1Gbps+ Network)


For a 100 MB/s pipe (streaming disk read, which you don't normally get 
when you copy random files to/from disk), 1 GB = 10 seconds.  1 TB = 
10,000 seconds.  This is the best case scenario.  In reality, you will 
get some fractional portion of that disk read/write speed.  So expect 
10,000 seconds as the most optimistic (and unrealistic) estimate ... a 
lower bound on time.




The biggest file size is the size of one disk if using a volume type
Distributed.


For some users this is not a problem, though several years ago, we had 
users wanting to read write *single* TB sized files.




In this case does anyone know if when replacing a failed disk does it
need to be manually formatted and mounted ?


In this model, yes.  This is why the RAID adapter saves time unless you 
have written/purchased "expensive" tools to do similar things.




*RAID Controller:*

Using a RAID controller with battery backup can improve the performance
specially caching the writes on the controller’s memory but at the end
one single array means the equivalent performance of one disk for each
brick. Also RAID requires have either 1 or 2 disks for parity. If using


For large reads/writes, you typically get N* (N disks reduced by number 
of parity disks and hot spares) disk performance.  For small 
reads/writes you get 1 disk (or less) performance.  Basically optimal 
read/write will be in multiples of the stripe width.  Optimizing stripe 
width and chunk sizes for various applications is something of a black 
art, in that overoptimization for one size/app will negatively impact 
another.




very cheap disks probably better use RAID 6, if using better quality
ones should be fine RAID 5 as, again, the data the data is replicated to
another RAID 5 on another node.


If you have more than 6TB of data, use RAID6 or RAID10.  RAID5 shouldn't 
be used for TB class storage for units with UCE rates more than 10^-17 
(you would hit a UCE on rebuild for a failed drive, which would take out 
all your data ... not nice).




_Pros_:

Can create larger array as a single brick in order to fit bigger files
for when using Distributed volume type.

Disk rebuild should be quicker (and more automated?)


More generally, management is nearly automatic, modulo physically 
replacing a drive.




_Cons_:

Extra cost of the RAID controller.


Its a cost-benefit analysis, and for lower end storage units, the CBE 
almost always is in favor of a reasonable RAID design.




Performance of the array is equivalent a single disk + RAID controller
caching features.


No ... see above.



RAID doesn’t scale well beyond ~16 disks


16 disks is the absolute maximum we would ever tie to a single RAID (or 
HBA).  Most RAID pro

Re: [Gluster-users] question on prospective release dates

2012-05-30 Thread Joe Landman


On 05/30/2012 10:32 AM, John Mark Walker wrote:

It would appear that it will be released this week, and probably tomorrow at 
the latest.


Excellent.  Thanks!



-JM


- Original Message -

We are curious about the 3.3 GA date.  Beta1 was released to the
world
(according to the src date) 20-July-2011.  As this is within 60 days
(right now) of a 1 year anniversary of the beta, is there an expected
GA
date for the 3.3 release?  We have customers who want to pilot test
Gluster, and we are trying to figure out which code to have them
start
with.  We've deployed every gluster from 2.0.x onwards ... and I'd
like
to avoid some of the issues we ran into with the 3.0->3.1->3.2 chain.
I'd like to suggest 3.3-qaX, but I need to have a rough guess at what
the GA date is for 3.3 so that they are not stuck deploying a qa
release.

Thanks!

Joe




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] question on prospective release dates

2012-05-30 Thread Joe Landman

We are curious about the 3.3 GA date.  Beta1 was released to the world 
(according to the src date) 20-July-2011.  As this is within 60 days 
(right now) of a 1 year anniversary of the beta, is there an expected GA 
date for the 3.3 release?  We have customers who want to pilot test 
Gluster, and we are trying to figure out which code to have them start 
with.  We've deployed every gluster from 2.0.x onwards ... and I'd like 
to avoid some of the issues we ran into with the 3.0->3.1->3.2 chain. 
I'd like to suggest 3.3-qaX, but I need to have a rough guess at what 
the GA date is for 3.3 so that they are not stuck deploying a qa release.


Thanks!

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable

2012-04-18 Thread Joe Landman


On 04/18/2012 06:58 PM, Harry Mangalam wrote:

And one more observation that will probably be obvious in retrospect. If
you enable auth.allow (on 3.3b3), it will do reverse lookups to verify
hostnames so it will be more complicated to share an IPoIB gluster
volume to IPoEth clients.

I had been overriding DNS entries with /etc/hosts entries, but the
auth.allow option will prevent that hack.

If anyone knows how to share an IPoIB volume to ethernet clients in a
more formally correct way, I'd be happy to learn of it.


After dealing with problems in multi-modal networks with slightly 
different naming schemes, I don't recommend using tcp and RDMA together 
(or even IPoIB with eth) for Gluster.  Very long, very painful saga. 
Executive summary:  here be dragons.


Also, IPoIB is very leaky.  So under heavy load, you can find your 
servers starting to run out of memory.  We've seen this with OFED 
through 1.5.3.x and Glusters as late as 3.2.6.


We'd recommend sticking to one fabric for the moment with Gluster.  Use 
real tcp with a 10 or 40 GbE backbone.  Far fewer problems.  Much less 
excitement.


Regards,

Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] 3.3beta question: Can we drain a brick yet for removal?

2012-03-30 Thread Joe Landman

This is an important bit of functionality, one we are running head first 
into.


Here's the scenario:  You have N bricks in your system.  You want to 
reduce this by one (or more), and you have enough storage to shrink this.


Can we do something like a

gluster volume decommission-brick volume_name brick1:/path1 ...

so that it starts moving data off the brick onto other bricks, and upon 
completion, it marks the brick as ready to be removed?


Assume this is a very useful use case.  Telling a user that they have 
mandatory down time (e.g. missing some files) to do this now is not a 
selling point.


Is this, or something very close to this in 3.3beta?

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "Structure needs cleaning" error

2012-02-26 Thread Joe Landman


On 02/26/2012 01:35 PM, Patrick Haley wrote:


Hi,

I went on the particular server on which the file in question
resides. I check the RAID and all the disks show up as OK.
I looked in dmesg and did not see anything on xfs. To confirm
this absence, I did "grep -in xfs dmesg" and that also came
up empty. "grep -in fs dmesg" (no "x") returns


type

mount

and see what the file system is mounted as.

The structure needs cleaning in xfs means an xfs_repair is needed.  Not 
sure if this is used by other file systems.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "Structure needs cleaning" error

2012-02-26 Thread Joe Landman


On 02/26/2012 12:54 PM, Patrick Haley wrote:


Hi,

We have recently upgraded our gluster to 3.2.5 and have
encountered the following error.  Gluster seems somehow
confused about one of the files it should be serving up,
specifically
/projects/philex/PE/2010/Oct18/arch07/BalbacFull_250_200_03Mar_3.png

If I go to that directory and simply do an ls *.png I get

ls: BalbacFull_250_200_03Mar_3.png: Structure needs cleaning


This is usually what happens when you have an underlying xfs file system 
as your backing store, and have had a failure which has shut the xfs 
file system down.


Have a look in dmesg for the servers, and see if xfs has shut down.

Reasons for xfs shutdown include a) bugs in xfs/kernel, b) storage 
device/RAID failure.


Our experience is that "b" happens far more often than "a", though "a" 
does happen (especially on some kernels).


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] default cluster.stripe-block-size for striped volumes on 3.0.x vs 3.3 beta (128kb), performance change if i reduce to a smaller block size?

2012-02-24 Thread Joe Landman

On 02/24/2012 01:50 AM, Sabuj Pattanayek wrote:

This seems to be a bug in XFS as Joe pointed out :

http://oss.sgi.com/archives/xfs/2011-06/msg00233.html

This was in a different context though. Are your files sparse by default?

http://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs

It seems to be there in XFS available natively in RHEL6 and RHEL5

Yes.

On Thu, Feb 23, 2012 at 5:12 PM, Sabuj Pattanayek wrote:

Hi,

I've been migrating data from an old striped 3.0.x gluster install to
a 3.3 beta install. I copied all the data to a regular XFS partition
(4K blocksize) from the old gluster striped volume and it totaled
9.2TB. With the old setup I used the following option in a "volume
stripe" block in the configuration file in a client :

volume stripe
type cluster/stripe
option block-size 2MB
subvolumes
end-volume

IIRC, the data was using up about the same space on the old striped
volume (9.2T) . While copying the data back to the new v3.3 striped
gluster volume on the same 5 servers/same brick filesystems (XFS w/4K
blocksize), I noticed that the amount stored on disk increased by 5x

512B blocks are 1/8 the size of the 4096B blocks, so a scheme where 512B
blocks are naively replaced by 4096B blocks should net an 8x space
change if that is the issue.

Currently if I do a du -sh on the gluster fuse mount of the new
striped volume I get 4.3TB (I haven't finished copying all 9.2TB of
data over, stopped it prematurely because it's going to use up all the
physical disk it seems if I let it keep going). However, if I do a du
-sh at the filesystem / brick level on each of the 5 directories on
the 5 servers that store the striped data, it shows that each one is
storing 4.1TB. So basically, 4.3TB of data from a 4K block size FS
took up 20.5TB of storage on a 128KB block size striped gluster

So you have 5 servers, each storing a portion of a stripe. You get a 5x
change in allocation? This sounds less like an xfs issue and more like
a gluster allocation issue. I've not looked lately at the stripe code,
but it may allocate the same space on each node, using the access
pattern for performance.

volume. What is the correlation between the " option block-size"
setting on client configs in cluster/stripe blocks in 3.0.x vs the
cluster.stripe-block-size parameter in 3.3? If these settings are
talking about what I think they mean, then basically a file that is 1M
in size would be written out to the stripe in 128KB chunks across N
servers, i.e. 128/N KB of data per brick? What happens when the stripe
block size isn't evenly divisible by N (e.g. 128/5 = 25.6). If the old
block-size and new stripe-block-size options are describing the same
thing, then wouldn't a 2MB block size from the old config cause more
storage to be used up vs a 128KB block size?

Thanks,
Sabuj

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] >1PB

2012-02-14 Thread Joe Landman


On 02/06/2012 01:41 PM, Andrew Holway wrote:

Hi,

Can anyone tell me about their experiences with 1PB and larger gluster
installations. We are thinking of doing 1PB with 10x 100PB storage
servers and would appreciate some reassurance :). We plan to scale
toabove and beyond


Greetings Andrew:

  We've got a bit of experience in scaling to (many large) bricks for 
gluster.  The big issues are reliable/working network connections, 
RAIDs, etc.  Currently you should look carefully at some sort of 
replication mechanism (if you are storing 1PB of data, recovery time is 
likely important).


  100TB bricks aren't a problem (as long as they are built sanely), 
though I'd probably suggest subdividing smaller so you can take better 
advantage of finer grain parallelism.  The issue on design of the bricks 
goes to the bandwidth and the height of the storage bandwidth wall (time 
required to read/write the entire 100TB).  For most designs we see 
(cascading arrays with SAS/FC), this is going to be problematic at best. 
 But thats another story (and we are biased because of what we do).





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] RFC: need better monitoring/failure signaling

2012-02-13 Thread Joe Landman


Hi folks


  Just had a failure this morning which didn't leave much in the way of 
useful logs ... a gluster process started running up CPU and ignoring 
input.  No log details, and a simple kill and restart fixed it.


  A few days ago, some compute node clients connected via Infiniband 
could see 5 of 6 bricks, though all the rest of the systems could see 
all 6.  Restarting the mount (umount -l /path ; sleep 30 ; mount /path) 
'fixed' it.


  The problem was that no one knew that there was a problem, the logs 
were (nearly) useless for problem determination.  We had to look at the 
overall system.


  What I'd like to request comments and thoughts on, are whether or not 
we can extract an external signal of some sort upon detection of an 
issue.  So in the event of a problem, an external program is run with an 
error number, some text, etc.  Sort of like what mdadm does for MD RAID 
units.  Alternatively, a nice simple monitoring port of some sort, which 
we can open, and read until EOF, which reports current (error) state, 
would be tremendously helpful.


  What we are looking for is basically a way to monitor the system. 
Not performance monitoring, but health monitoring.


  Yes, we can work on a hacked up version of this ... I've done 
something like this in the past.  What we want is to figure out how to 
expose enough of what we need to create a reasonable "health" monitor 
for bricks.


  I know there is a nagios plugin of some sort, and other similar 
tools.  What I am looking for is to get a discussion going on what the 
capability for this should be minimally composed of.  Given the layered 
nature of gluster, it might be harder to pass errors up and down through 
translator layers.  But if we could connect something to the logging 
system to specifically signal important events, to some place other than 
the log, and do so in real time (again, the mdadm model is perfect), 
then we are in good shape.  I don't know if this is showing up in 3.3, 
though this type of monitoring capability seems to be an obvious fit 
going forward.


  Unfortunately, this is something of a problem (monitoring gluster 
health), and I don't see easy answers save building a log parser at this 
time.  So something that lets us periodically inquire as to volume/brick 
health/availability would be (very) useful.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Sorry for x-post: RDMA Mounts drop with "transport endpoints not connected"

2012-02-01 Thread Joe Landman


On 02/01/2012 04:49 PM, Brian Smith wrote:

Having serious issues w/ glusterfs 3.2.5 over rdma. Clients are
periodically dropping off with "transport endpoint not connected". Any
help would be appreciated. Environment is HPC. GlusterFS is being used
as a shared /work|/scratch directory. Standard distributed volume
configuration. Nothing fancy.

Pastie log snippet is here: http://pastie.org/3291330

Any help would be appreciated!




What OS, kernel rev, OFED, etc.  What HCAs, switch, etc.

What does ibv_devinfo report for nodes experiencing the transport 
endpoint issue?


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] The Future of GlusterFS Webcast Recording

2012-01-27 Thread Joe Landman


On 01/27/2012 07:02 PM, John Mark Walker wrote:

Greetings,

As you may have heard, we had a webcast yesterday going over changes to the 
GlusterFS project. You can see the slides here:

http://redhatstorage.redhat.com/2012/01/27/the-future-of-glusterfs-slides/



I missed it, but one of my team did go.  Reading the slides now.  Extra 
points for the Monty Python paraphrase ...




There's also a recording of said webcast. This link takes you to an embedded 
player:

https://redhat.webex.com/redhat/lsr.php?AT=pb&SP=EC&rID=5637302&rKey=E39F770CBFA9E679


And this link is for the MP4 download:

https://redhat.webex.com/redhat/lsr.php?AT=pb&SP=EC&rID=5637457&rKey=792D93DF5BBC4551


-JM
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Best practices?

2012-01-24 Thread Joe Landman


On 01/24/2012 01:18 PM, John Mark Walker wrote:



- Original Message -

Aside from this, carving a storage system into 16TB chunks is a
terrible
thing to do to a large file system capable unit.  ext4 (if you follow
the previous link) still really doesn't do 16TB+ stably.  Things
break.
   It will eventually get there, but its not there now.  xfs has been
doing large file systems for more than a decade.



Well said. And someday, btrfs will be a worthy option as well.


There is this brand spanking new btrfsck !  Though there are the 
occasional "OMG IT ATE MY FILE SYSTEM" posts to the list, which ... 
while helpful to the devs, won't inspire confidence yet ...


... but it will get there.



-JM



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Best practices?

2012-01-24 Thread Joe Landman


On 01/24/2012 01:07 PM, Whit Blauvelt wrote:


So is the preference now that even for workloads _not_ involving huge files,
XFS is better? For non-huge-file systems is Ext4 more likely to break, or
suffer in performance speed?


ext4 has some significant serialization in its journaling and other 
pathways.  Its great for boot drives.  Not so great when you have many 
simultaneous processes hammering on it.


Someone posted results that literally spoke for themselves last year as 
a response to a statement I made to that effect.  With 8 simultaneous 
readers/writers, ext4 was taking something like 2x (going on memory 
here, so don't take this as precise) the time that xfs was to perform 
the operations.  With 16 readers and writers and up it gets very pronounced.


xfs is designed for parallel IO workloads.  ext4 isn't, and it shows 
under load.


Aside from this, carving a storage system into 16TB chunks is a terrible 
thing to do to a large file system capable unit.  ext4 (if you follow 
the previous link) still really doesn't do 16TB+ stably.  Things break. 
 It will eventually get there, but its not there now.  xfs has been 
doing large file systems for more than a decade.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

2012-01-18 Thread Joe Landman


On 01/18/2012 01:37 PM, John Mark Walker wrote:

Along those lines, insert key->value pairs into your logs, and then
run something like Splunk or logstash over them. Can be an easy way
to do performance monitoring and analytics.



Yeah.  (sigh) I have too much coding to do.  Let me see if there is a
real easy way to do this over the next few days.  I haven't looked at 
the code for 3.2.5 (not much since 3.2 dropped), so its possible 
everything we want is in there, just needs a push in the right direction.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

2012-01-18 Thread Joe Landman


On 01/18/2012 01:29 PM, Daniel Taylor wrote:

Thanks. We saw something very similar with root filesystem damage on one
of our nodes locking access to the clusters it was a member of.

Better logging wouldn't have helped there, since it was clobbering the
glusterd logfile, but it does make me wonder if it isn't possible to get
smarter error messages for host filesystem access issues?


Yeah ...

I might start going through the code and add bunches of

if (!open(...)) {

}

crap if its not in there now.  My code (mostly Perl these days, though 
some C and others) tends to have that, as I like our customers to call 
us up and tell us "hey the code said it can't write to file /x/y/z 
because the permissions are wrong, and we need to change ownership ... 
what does that mean?".  Makes support much easier.







--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

2012-01-18 Thread Joe Landman

Ok, so there you are with a gluster file system, that just had a RAID 
issue on the backing store.  You fixed the backing store, rebooted, and 
are now trying to bring gluster daemon up.


And it doesn't come up.

Ok, no problem.  Run it by hand, turn on debugging, turn off 
backgrounding.  Capture all the output.


strace /opt/glusterfs/3.2.5/sbin/glusterd --no-daemon \
--log-level=DEBUG  > out 2>&1 &

Then looking at the out file, near the end we see ...

writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34}, {"(", 1}, 
{"glusterfs_volumes_init", 22}, {"+0x", 3}, {"18b", 3}, {")", 1}, 
{"[0x", 3}, {"4045ab", 6}, {"]\n", 2}], 9) = 75
writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34}, {"(", 1}, 
{"main", 4}, {"+0x", 3}, {"448", 3}, {")", 1}, {"[0x", 3}, {"405658", 
6}, {"]\n", 2}], 9) = 57
writev(4, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 
17}, {"+0x", 3}, {"f4", 2}, {")", 1}, {"[0x", 3}, {"3f6961d994", 10}, 
{"]\n", 2}], 9) = 55
writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34}, {"[0x", 3}, 
{"403739", 6}, {"]\n", 2}], 4) = 45

write(4, "-\n", 10) = 10
rt_sigaction(SIGSEGV, {SIG_DFL, [SEGV], SA_RESTORER|SA_RESTART, 
0x3f696302d0}, {0x7fb75cf66de0, [SEGV], SA_RESTORER|SA_RESTART, 
0x3f696302d0}, 8) = 0

tgkill(6353, 6353, SIGSEGV) = 0
rt_sigreturn(0x18d1)= 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++


... er ... uh  ok ...

Not so helpful.  All we know is that we have a SEGV.  This usually 
happens when one program starts stamping on memory (or similar things) 
its really not supposed to touch.


I sanity checked this binary.  Same md5 signature as other working 
binaries on its peers.


Ok ... short of running debug and getting a stack trace at SEGV, I opted 
for the slightly simpler version of things (e.g. didn't force me to 
recompile this).  I assumed that for some reason, even though there was 
no crash, that the rootfs somehow was corrupted.


Ok, this was a WAG.  Bear with me.  Turns out to have been right.

Did a

tune2fs -c 20 /dev/md0
tune2fs -C 21 /dev/md0

and rebooted.  Forces a reboot on the file system attached to /dev/md0 
("/" in this case).  Basically I was trying to take all variables off 
the table with this.


Sure enough, on restart, glusterd fired up correctly with no problem. 
From this I surmise that somehow, glusterd was trying to work with an 
area of the fs that was broken.  Not gluster being broken, but the root 
file system that gluster uses for pipes/files/etc. The failure cascaded, 
and glusterd not lighting up was merely a symptom of something else.


Just as an FYI for people working on debugging this.  For future 
reference, I think we may be tweaking the configure process to make sure 
we build a glusterfs with all debugging options on, just in case I need 
to run it in a debugger.  Will see if this materially impacts performance.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster rdma

2012-01-16 Thread Joe Landman


On 01/16/2012 06:29 PM, Derek Yarnell wrote:

Hi,

So I wanted to test a gluster install w/ RDMA only support.  RDMA is
working with a successful running of ib_write_bw test between both
nodes.  After I start the gluster daemons I can no longer run the
ib_write_bw tests and also gluster is showing errors on startup,



1st question:  Did you upgrade from earlier 3.x?

2nd question:  what does your mount command look like?

For 3.2.x (x>2 or so), you need to mount

mount -o intr -t glusterfs server:/volume.rdma /mnt

for a volume. Note the ".rdma".  Once you have this, make sure you are
correctly mounted.  Do a df -h and see if it gives you a transport 
disconnected.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Got one that stumped me

2011-12-02 Thread Joe Landman


On 12/02/2011 01:34 PM, Jeff Darcy wrote:

On Fri, 02 Dec 2011 13:24:26 -0500
Joe Landman  wrote:


Can't start a volume.

[root@sfsccl03 ~]# gluster volume start brick1
brick: sfsccl03:/data/brick-sdc2/glusterfs/dht, path creation failed,
reason: No such file or directory

But ...


[root@sfsccl03 ~]# ls -alF /data/brick-sdc2/glusterfs
total 0
drwxr-xr-x 4 root root  27 Dec  2 13:00 ./
drwxr-xr-x 4 root root 107 Jul  5 11:55 ../
drwxrwxrwt 7 root root  61 Sep 15 11:35 dht/
drwxr-xr-x 2 root root   6 Dec  2 13:00 dht2/


The mode on /data/brick-sdc2/glusterfs/dht - especially the sticky bit
set - seems odd.  Have you looked at the xattrs on that?


Not yet.  We tend to make all top level glusterfs directories 1777 mode, 
otherwise there are often complications when using the file system.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Got one that stumped me

2011-12-02 Thread Joe Landman


On 12/02/2011 01:29 PM, Mohit Anchlia wrote:

Is this normal?

Brick3: sfsccl03:/data/brick-sdc2/glusterfs/dht
Brick4: sfsccl03:/data/brick-sdd2/glusterfs/dht


Two different bricks.  One is sdc2 one is sdd2


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Got one that stumped me

2011-12-02 Thread Joe Landman


Can't start a volume.

[root@sfsccl03 ~]# gluster volume start brick1
brick: sfsccl03:/data/brick-sdc2/glusterfs/dht, path creation failed, 
reason: No such file or directory


But ...


[root@sfsccl03 ~]# ls -alF /data/brick-sdc2/glusterfs
total 0
drwxr-xr-x 4 root root  27 Dec  2 13:00 ./
drwxr-xr-x 4 root root 107 Jul  5 11:55 ../
drwxrwxrwt 7 root root  61 Sep 15 11:35 dht/
drwxr-xr-x 2 root root   6 Dec  2 13:00 dht2/

So it is there.

[root@sfsccl03 ~]# ls -alF /data/brick-sdc2/glusterfs/dht
total 128
drwxrwxrwt7 root   root61 Sep 15 11:35 ./
drwxr-xr-x4 root   root27 Dec  2 13:00 ../
drwxr-xr-x 1230 root   root 65536 Oct 24 09:14 equity/
drwxr-xr-x 1740 oracle root 65536 Nov 30 23:33 opra/
drwxr-xr-x   35 oracle oinstall   501 Jul  9 17:07 tag/
drwxr-xr-x   11 root   root   126 Jul  1 08:51 taq/
drwxr-xr-x2 root   root34 Jul 11 19:44 test/


and it is readable.

More info:

[root@sfsccl03 ~]# gluster volume info brick1

Volume Name: brick1
Type: Distribute
Status: Stopped
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: sfsccl01:/data/glusterfs/dht
Brick2: sfsccl02:/data/glusterfs/dht
Brick3: sfsccl03:/data/brick-sdc2/glusterfs/dht
Brick4: sfsccl03:/data/brick-sdd2/glusterfs/dht

[root@sfsccl03 ~]# gluster peer status
Number of Peers: 2

Hostname: sfsccl02
Uuid: 6e72d1a8-bdeb-4bfb-806c-7fa8b98cb697
State: Peer in Cluster (Connected)

Hostname: sfsccl01
Uuid: 116197cd-5dfe-4881-85ad-5de2be484ba6
State: Peer in Cluster (Connected)

a volume reset doesn't help.

[root@sfsccl03 ~]# gluster volume reset brick1
reset volume successful

[root@sfsccl03 ~]# gluster volume start brick1
brick: sfsccl03:/data/brick-sdc2/glusterfs/dht, path creation failed, 
reason: No such file or directory


New volume creation also fails.

[root@sfsccl03 ~]# gluster volume create brick2 transport tcp 
sfsccl01:/data/glusterfs/dht2 sfsccl03:/data/brick-sdc2/glusterfs/dht2 
sfsccl02:/data/glusterfs/dht2 sfsccl03:/data/brick-sdd2/glusterfs/dht2
brick: sfsccl03:/data/brick-sdc2/glusterfs/dht2, path creation failed, 
reason: No such file or directory


Not good.

Taking out the 03 machine

[root@sfsccl03 ~]# gluster volume create brick2 transport tcp 
sfsccl01:/data/glusterfs/dht2  sfsccl02:/data/glusterfs/dht2Creation of 
volume brick2 has been successful. Please start the volume to access data.


I am wondering if I should remove the 03 machine from the volume, start 
it up with 01 and 02, and then add the 03 machine in, after forcing the 
volume back up.  Any thoughts?


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] glusterfs over rdma ... not.

2011-11-04 Thread Joe Landman


On 11/04/2011 09:06 PM, Harry Mangalam wrote:

OK - finished some tests over tcp and ironed out a lot of problems. rdma
is next; should be snap now

[I must admit that this is my 1st foray into the land of IB, so some of
the following may be obvious to a non-naive admin..]

except that while I can create and start the volume with rdma as transport:


First things first:


ibv_devinfo

What does it report?



==

root@pbs3:~

622 $ gluster volume info glrdma

Volume Name: glrdma

Type: Distribute

Status: Started

Number of Bricks: 4

Transport-type: tcp,rdma

Bricks:

Brick1: pbs1:/data2

Brick2: pbs2:/data2

Brick3: pbs3:/data2

Brick4: pbs3:/data

==

I can't mount the damn thing. This seems to be a fairly frequent problem
according to google.. Again, all servers and clients are ubuntu
10.04.3/64b, running self-compiled 3.3b1.


try a

showmount -e pbs2

Also, did you enable the auth bits as last time?



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] auth.allow behavior?

2011-11-01 Thread Joe Landman


On 11/01/2011 08:21 PM, Harry Mangalam wrote:


So:

- auth.allow supports multiple addresses IF they're separated with ','
but not with spaces (not documented, more silent failures)

- volumes have to be stopped and restarted to propagate these changes
(another silent .. if not failure, the absence of a warning that
indicates that on auth.allow (and all other option?) changes, you have
to restart the volume to activate it.

This does not seem to be documented in the Gluster_FS_3.2_Admin_Guide
that I have. I guess I should be going thru the wiki on this...


The documentation does need work.  There are a few "old hands" around 
here who can help :)




Joe gets an ebeer (12pack!) :)



(in best Homer Simpson voice) "mm e-beer!"


Thanks Joe!


No problem.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] auth.allow behavior?

2011-11-01 Thread Joe Landman


On 11/01/2011 06:09 PM, Harry Mangalam wrote:

Hi All,

I have the following volume which I'm trying to mount on some cluster
nodes for yet more testing. The cluster nodes are running CentOS and the
gluster 3.3b1 utilities have been self-compiled from source.

The gluster volume (g6) worked oK when enabled on other Ubuntu-based
client nodes.

The gluster volume is being served from a Ubuntu 10.04.3 server with 6
bricks all running the same gluster 3.3b1 release, self-compiled and
installed.

$ gluster volume info

Volume Name: g6

Type: Distribute

Status: Started

Number of Bricks: 6

Transport-type: tcp

Bricks:

Brick1: pbs1:/data2

Brick2: pbs2:/data2

Brick3: pbs3:/data2

Brick4: pbs3:/data

Brick5: dabrick:/data2

Brick6: hef:/data2

Options Reconfigured:

auth.allow: 128.*

However, when I try to mount that same volume from these new nodes,
mount completes as if it succeeds, but a 'df' from that node hangs on
hitting the glusterfs entry.

The client log starts up OK and then logs failures:


[ ... ]


What else sets the authentication / permission correctly?


gluster volume set g6 auth.allow 192.168.*,128.*

(or similar)



--

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

--

This signature has been OCCUPIED!



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Some questions about theoretical gluster failures.

2011-10-25 Thread Joe Landman


On 10/25/2011 10:01 PM, Harry Mangalam wrote:


- what happens in a distributed system if a node goes down? Does the
rest of the system keep working with the files on that brick unavailable
until it comes back or is the filesystem corrupted? In my testing, it
seemed that the system indeed kept working and added files to the
remaining systems, but that files that were hashed to the failed volume
were unavailable (of course).


This is basically it.



- is there a head node? the system is distributed but you're mounting a


Only if you mount via nfs, though technically you can mount it from any 
server.  If you mount via gluster client, just point it at any of the 
servers.  In the nfs case, if the mount server goes away, so does access 
unless you remount.  In the glusterfs case, if the mount server goes 
away, the other servers can continue talking with the client.



specific node for the glusterfs mount - if that node goes down, is the
whole filesystem hosed or is that node reference really a group
reference and the gluster filesystem continues with the loss of that
node's files? ie can any gluster node replace a mountpoint node and does
that happen transparently? (I haven't tested this).


You can mount from any node, but the mount target has to be specifically 
unmounted/remounted under nfs (umount -l is your friend).  With 
GlusterFS client its less of an issue.


This said, I don't know many people using the nfs client version.  I 
haven't tested 3.2.4's server, but through 3.2.3, we can crash the NFS 
server with a moderate load.



- can you intermix distributed and mirrored volumes? This is of


Not sure what you mean by intermix ... but yes, you can have multiple 
(many) volumes of all different types coming from the same units on 
different volume names.



particular interest since some of our users want to have replicated data
and some don't care.

Many thanks

hjm

--

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

--

This signature has been OCCUPIED!



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] tons and tons of clients, oh my!

2011-10-20 Thread Joe Landman


On 10/20/2011 06:29 PM, Luis Cerezo wrote:

Hello gluster-verse!

I'm about to see if GlusterFS can handle a large amount of clients, this
was not at all in plans when we initially selected and setup our current
configuration.

What sort of experience do you (the collective "you" as in y'all) have
with a large client to storage brick server ratio? (~1330:1) Where do
you see things going awnry?


We've had up to 300 clients per brick ratios in the past.  For large 
files, we saw lots of contention.  Contention drastically reduces 
performance.



Most of this will be reads and locks on small files and dirs. our setup
is 3xstorage node servers in a pure distribute config.


You do not want to even consider small files ... this is not a good use 
case for a cluster file system in general.  The only real way to deal 
with this is to put your backing file system on SSD or Flash, and have a 
very fast network.  Even then performance is not going to be that good.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Need help with optimizing GlusterFS for Apache

2011-10-18 Thread Joe Landman


On 10/18/2011 06:14 AM, Robert Krig wrote:


I think I'm going to have to abandon GlusterFS for our Image files. The
performance is abysmal. I've tried all sorts of settings, but at some
point the http process just keeps spawning more and more processess
because clients are waiting because the directory can't be read, since
glusterfs is busy.
We're not even reaching 500 apache requests per second and already
apache locks up.

I'm pretty sure it can't be the hardware, since we're talking about a 12
Core Hyperthreading Xeon CPU, with 48GB of ram and 30TB of storage in a
hardware Raid.


From our experience, and please don't take this incorrectly, the vast 
majority of storage users (and for that matter, storage companies) don't 
know how to design their RAIDs to their needs.  A "fast" CPU (12 core 
Xeon would be X5650 or higher) won't impact small file read speed all 
that much.  48 GB of ram could, if you can cache enough of your small files.


What you need, for your small random file read, is an SSD or Flash 
cache.  It has to be large enough that its relevant for your use case. 
I am not sure what your working set size is for your images, but you can 
buy them from small 300GB units through several 10s of TB.  Small random 
file performance is extremely good, and you can put gluster atop it as a 
file system if you wish to run the images off the cache ... or you can 
use it as a block level cache, which you then need to warm up prior to 
inital use (and then adjust after changes).



I realise that GlusterFS is not ideal for many small files, but this is
beyond ridiculous. It certainly doesn't help that the documentation
doesn't even properly explain how to activate different translators, or
where exactly to edit them by hand in the config files.

If anyone has any suggestions, I'd be happy to hear them.


See above.  As noted, most people (and companies) do anywhere from a bad 
to terrible job on storage system design.  No one should be using a 
large RAID5 or RAID6 for small random file reads.  Its simply the wrong 
design.  I am guessing its unlikely that you have a RAID10, but even 
with that, you are going to be rate limited by the number of drives you 
have and their about 100 IOP rates.


This particular problem isn't likely Gluster's fault.  It is likely your 
storage design.  I'd suggest doing a quick test using fio to ascertain 
how many random read IOPs you can get out of your file system.  If you 
want to handle 500 apache requests per second, how many IOPs does this 
imply (how many files does each request require to fulfill)?  Chances 
are that you exceed the IOP capacity of your storage by several times.


Your best bet is either a caching system, or putting the small randomly 
accessed image files on SSD or Flash, and using that.  Try that before 
you abandon Gluster.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Red Hat..

2011-10-04 Thread Joe Landman


On 10/04/2011 11:33 AM, Nathan Stratton wrote:

On Tue, 4 Oct 2011, Luis Cerezo wrote:


congrats folks.


Ditto. Just further validates technology some of us have grown to love
over the last few years.


+1




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

2011-09-29 Thread Joe Landman


On 09/29/2011 01:44 PM, David Miller wrote:

On Thu, Sep 29, 2011 at 1:32 PM, David Miller mailto:davi...@gmail.com>> wrote:

Couldn't you  accomplish the same thing with flashcache?
https://github.com/facebook/flashcache/


I should expand on that a little bit.  Flashcache is a kernel module
created by Facebook that uses the device mapper interface in Linux to
provide a ssd cache layer to any block device.

What I think would be interesting is using flashcache with a pcie ssd as
the caching device.  That would add about $500-$600 to the cost of each
brick node but should be able to buffer the active IO from the spinning
media pretty well.


Erp ... low end PCIe flash with decent performance start much higher 
than 500-600 $ USD.



Somthing like this.
http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE
or something from FusionIO if you want something that's aimed more at
the enterprise.


Flashcache is reasonably good, but there are many variables in using it, 
and its designed for a different use case.  For most people the 
writeback may be reasonable, but other use cases would require different 
configs.


This said, please understand that it (and L2ARC, and other similar 
things) are *not* silver bullets (e.g. not magical things that will 
instantly make something far better, at no cost/effort).  They do 
introduce additional complexity, and additional tuning points.


The thing you cannot get rid of, the network traversal, is implicated in 
much of the performance degradation for small files.  Putting the file 
system on a RAM disk (if possible, tmpfs doesn't support xattrs), 
wouldn't make the system much faster for small files.  Eliminating the 
network traversal and doing local distributed caching of metadata on the 
client side ... could ... but this would be a huge new complication, and 
I'd argue that it probably isn't worth it.


For the short duration, small file performance is going to be bad.  You 
might be able to play some games to make this performance better (L2ARC 
etc. could help in some aspects, but they won't be universally much better).


What matters most is very good design on the storage backend (we are 
biased due to what it is we sell/support), very good networking, and 
very good gluster implementation/tuning.  Its real easy to hit very slow 
performance by missing critical elements.  We field many inquiries which 
usually start out with "we built our own and the performance isn't that 
good."  You won't get good performance on the cluster file system if the 
underlying file system and storage design isn't going to give it to you 
in the first place.


This said, please understand that there is a (significant) performance 
cost to all those nice features in ZFS.  And there is a reason why its 
not generally considered a high performance file system.  So if you 
start building with it, you shouldn't necessarily think that the whole 
is going to be faster than the sum of the parts.  Might be worse.


This is a caution from someone who has tested/shipped many different 
file systems in the past.  ZFS included, on Solaris and other machines. 
 There is a very significant performance penalty one pays for using 
some of these features.  You have to decide if this penalty is worth it.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

2011-09-29 Thread Joe Landman


On 09/29/2011 12:38 PM, paul simpson wrote:

been reading this thread - quite fascinating.

zfsonlinux + gluster looks like an intriguing combination.  i'm
interested in your findings to date; specifically would the zfs L2ARC
(with SSDs) speed up underlying gluster operations?  it sounds like it
could be a potent mix.


Just don't minimize the legal risk issue.  Its very hard for a vendor to 
ship/support this due to the potential risk.  Its arguably hard for a 
user to deploy zfs on linux due to the risk, unless they had a way to 
argue that they are not violating licensing (can't intermix GPL and CDDL 
and ship/support it) for commercial purposes.


Lots of folks can't claim the type of cover that a national lab can 
claim (researching storage models).  You have to decide if the risk is 
worth it.


If you were to do this, I'd suggest going the Illumos/OpenIndiana or BSD 
route.  Yeah, work still needs to be done to get Gluster to build there, 
but the licensing is on firmer ground (hard to claim that an "open 
source" license such as CDDL does not mean what it says).


Understand where you stand first.  Speak to a lawyer type first.  Make 
sure you won't have issues.


And do remember, that while Oracle and Netapp have (for the moment) 
de-escalated hostilities, Oracle did not provide indemnity to non-Oracle 
customers.  So Netapp (and others) *can* resume their actions.  A 
question was asked why not go after Nexenta versus others.  Simple. 
There are many others (e.g. more potential licensing/legal fees) as 
compared to a single Nexenta.  Its arguably less about rights as it is 
revenue from legal action.  But that stuff does happen ...


Oracle is probably the only one whom can ship ZFS anything safely.  And, 
I'd guess that they are perfectly happy with that situation.




regards,

-paul



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

2011-09-25 Thread Joe Landman


On 09/25/2011 03:56 AM, Di Pe wrote:


So far the discussion has been focusing on XFS vs ZFS. I admit that I
am a fan of ZFS and I have only used XFS for performance reasons on
mysql servers where it did well. When I read something like this
http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me
not want to use XFS for big data. You can assume that this is a real


This is a corner case bug, and one we are hoping we can get more data to 
the XFS team for.  They asked for specific information that we couldn't 
provide (as we had to fix the problem).  Note: other file systems which 
allow for sparse files *may* have similar issues.  We haven't tried yet.


The issues with ZFS on Linux have to do with legal hazards.  Neither 
Oracle, nor those who claim ZFS violates their patents, would be happy 
to see license violations, or further deployment of ZFS on Linux.  I 
know the national labs in the US are happily doing the integration from 
source.  But I don't think Oracle and the patent holders would sit idly 
by while others do this.  So you'd need to use a ZFS based system such 
as Solaris 11 express to be able to use it without hassle.  BSD and 
Illumos may work without issue as well, and should be somewhat better on 
the legal front than Linux + ZFS.  I am obviously not a lawyer, and you 
should consult one before you proceed down this route.



recent bug because Joe is a smart guy who knows exactly what he is
doing. Joe and the Gluster guys are vendors who can work around these
issues and provide support. If XFS is the choice, may be you should
hire them for this gig.

ZFS typically does not have these FS repair issues in the first place.
The motivation of Lawrence Livermore for porting ZFS to Linux was
quite clear:

http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf

OK, they have 50PB and we are talking about much smaller deployments.
However some of the limitations they report I can confirm. Also,
recovering from a drive failure with this whole LVM/Linux Raid stuff
is unpredictable. Hot swapping does not always work and if you
prioritize the re-sync of data to the new drive you can strangle the
entire box (by default the priority of the re-sync process is low on
linux). If you are a Linux expert you can handle this kind of stuff
(or hire someone) but if you ever want to give this setup to a Storage
Administrator you better give them something that they can use with
confidence (may be less of an issue in the cloud).
Compare to this to ZFS: re-silvering works with a very predictable
result and timing. There is a ton of info out there on this topic.  I
think that gluster users may be getting around many of the linux raid
issues by simply taking the entire node down (which is ok in mirrored
node settings) or by using hardware raid controllers. (which are often
not available in the cloud )


There are definite advantages to better technology.  But the issue in 
this case is the legal baggage that goes along with them.


BTRFS may, eventually, be a better choice.  The national labs can do 
this with something of an immunity to prosecution for license violation, 
by claiming the work is part of a research project, and won't actively 
be used in a way that would harm Oracle's interests.  And it would be 
... bad ... for Oracle (and others) to sue to government over a 
relatively trivial violation.


Until Oracle comes out with an absolute declaration that its OK to use 
ZFS with Linux in a commercial setting ... yeah ... most vendors are 
gonna stay away from that scenario.



Some in the Linux community seem to be slightly opposed to ZFS (I
assume because of the licensing issue) and make sometimes odd
suggestions ("You should use BTRFS").


Licensing mainly.  BTRFS has a better design, but its not ready yet. 
Won't be for a while.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] GlusterFS performance, concurrency and I/O blocking

2011-08-23 Thread Joe Landman


On 08/23/2011 02:17 PM, Ken Randall wrote:

Hi everybody!  Love this community, and I love GlusterFS.


[...]


Before I get to my suspicion of what's happening, keep in mind that we
have 50+ million files (over hundreds of thousands of directories), most
of them are small, and each page request will pull in upwards of 10-40
supporting assets (images, Flash files, CSS, JS, etc.).  We also have
people executing directory listings whenever they're editing their site,
as they choose images, etc. to insert onto the page.  We're also
exporting the volume to CIFS so our Windows servers can access the
GlusterFS client on the Linux machines in the cluster.  The Samba
settings on there were tweaked to the hilt as well, turning off
case-insensitivity, bumping up caches and async IO, etc.

It appears as if GlusterFS has some kind of I/O blocking going on.
Whenever a directory listing is being pieced together, it noticeably
slows down (or stops?) other operations through the same client.  For a
high-concurrency app like ours where the storage backend needs to be
able to pull off 10 to 100 directory listings a second, and 5,000 to
10,000 IOPS overall, it's easy to see how perf would degrade if my
blocking suspicion is correct.  The biggest culprit, in my guess, is the
directory listing.  Executing one makes things drag.  I've been able to
demonstrate that through a simple script.  And we're running some pretty
monster machines with 24 cores, 24 GB RAM, etc.


[...]



Am I way off?  Does GlusterFS block on directory listings (getdents) or
any other operations?  If so, is there a way to enable the database
equivalent of "dirty reads" so it doesn't block?


What was the back end file system and how did you construct it?

Many folks use the ext* series based upon recommendations from the 
company.  ext* is highly contra-indicated for highly parallel 
operations.  It simply cannot keep up with file systems designed for this.


Another area are in the extended attributes.  If you size the system or 
the extended attributes wrong, you will lose performance.


And (note we are biased given what we build/sell), hardware matters. 
The vast majority of hardware we've seen people stand up themselves for 
this, has been badly underpowered for massive scale IOs.  It usually 
starts with someone spec'ing a 6G backplane, and rapidly goes south from 
there.  It doesn't matter how fast the software layers are, if the 
hardware simply cannot keep up.  In the vast majority of cases where 
we've been called in to look at someones setup, yeah, the hardware 
played a huge role in the (lack of) performance.




Ken



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Very bad performance /w glusterfs. Am I missing something?

2011-08-16 Thread Joe Landman


On 08/11/2011 10:36 AM, Jean-Francois Chevrette wrote:

Hello everyone,

I have just began playing with GlusterFS 3.2 on a debian squeeze
system. This system is a powerful quad-core xeon with 12GB of RAM and
two 300GB SAS 15k drives configured as a RAID-1 on an Adaptec 5405
controller. Both servers are connected through a crossover cable on
gigabit ethernet ports.

I installed the latest GlusterFS 3.2.2 release from the provided
debian package.

As an initial test, I've created a simple brick on my first node:

gluster volume create brick transport tcp node1.internal:/brick

I started the volume and mounted it locally

mount -t glusterfs 127.0.0.1:/brick /mnt/brick

I can an iozone test on both the underlying partition and the
glusterfs mountpoint. Here are my results for the random write test
(results are in ops/sec):


[...]


(sorry if the formatting is messed)


Any ideas why I am getting such bad results? My volume is not even
replicated or distributed yet!


You are not getting "bad" results.  The results from the local fs w/o 
gluster are likely completely cached.  This is a very small test, and 
chances are you'r IOs aren't even making it out to the device before the 
test completes.


The only test in your results which is likely generating any sort of 
realistic IO is that very last row and last column data size.


A 15k RPM disk will do ~300 IOPs, which is about what you should see per 
unit.  For a RAID1 across 2 such disks, you should get (depending upon 
how you built the RAID1 and what the underlying RAID system is), from 
150-600 IOPs in most cases.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.2.2 Performance Issue

2011-08-11 Thread Joe Landman


On 08/11/2011 09:11 AM, Burnash, James wrote:

Cogently put and helpful, Joe. Thanks. I'm filing this under "good
answers to frequently asked technical questions". You have a number
of spots in that archive already :-)


Thanks :)

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.2.2 Performance Issue

2011-08-11 Thread Joe Landman


On 08/11/2011 08:28 AM, Stephan von Krawczynski wrote:

On Wed, 10 Aug 2011 12:08:39 -0700
Mohit Anchlia  wrote:


Did you run dd tests on all your servers? Could it be one of the disk is slower?

On Wed, Aug 10, 2011 at 10:51 AM, Joey McDonald  wrote:

Hi Joe, thanks for your response!



An order of magnitude slower with replication. What's going on I wonder?
Thanks for any suggestions.


You are dealing with contention for Gigabit bandwidth.  Replication will
do that, and will be pronounced over 1GbE.  Much less of an issue over 10GbE
or Infiniband.


If that was a GBit contention you can check out by spreading your boxes
over different switches. That should prevent a contention problem.
Unfortunately I can tell you it did not help on our side, so we doubt the
explanation.


Contention over a single port won't be helped by increasing the number 
of switches.


This isn't that hard to work out for yourself.  If you have 1 constant 
stream over a 100MB/s link, you'll get close to 100MB/s.


Now have 2 streams operating over this link.  Assuming good balance, and 
a 100% duty cycle (50% per stream), you'll get, again, 100MB/s used, or 
50 MB/s per client.


Now have 6 streams over this link. Assuming good balance, and a 100% 
duty cycle (16.7% per stream), you'll get, again, 100MB/s used, or 16.7 
MB/s per client.  Which, is curiously close to what was observed with a 
replica 6.  For replica 2, it should be closer to 1/2 the bandwidth.


Note that this analysis assumes full duplex gigabit.  Half duplex would 
 divide these results in half.


Note also that these assume reasonably good gigabit NICs.  Some of the 
lower end NICs, like the Broadcom's shipped in Dell and HP units, might 
not behave as well under load.


The point of this analysis is that it is very easy to run out of network 
bandwidth pretty quickly on gigabit networks, so you shouldn't be 
surprised when you run out of network bandwidth on gigabit networks. 
You are contending for a fixed (relatively small) sized resource from N 
requestors.  On average, you should expect 1/N bandwidth per requester. 
 Inclusive of clients.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] RDMA/Infiniband

2011-08-11 Thread Joe Landman


On 08/11/2011 08:06 AM, David Pusch wrote:

Hello again,
a quick question. If I use infiniband to connect 2 nodes in a Cluster.
Do I add the peers via their guid or via an IP? Also when creating a
volume, do I add the bricks based on guid or IP?


IP

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.2.2 Performance Issue

2011-08-10 Thread Joe Landman


On 08/10/2011 01:41 PM, Joey McDonald wrote:



An order of magnitude slower with replication. What's going on I wonder?
Thanks for any suggestions.


You are dealing with contention for Gigabit bandwidth.  Replication will 
do that, and will be pronounced over 1GbE.  Much less of an issue over 
10GbE or Infiniband.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Installation and Benchmarks

2011-08-10 Thread Joe Landman


On 08/10/2011 10:18 AM, David Pusch wrote:

@Joseph: Come again please? I didn't quite catch what you wanted to tell me.


With N replicas sharing 1 GbE link, you will get (assuming 100% link 
utilization) 1/N of the bandwidth per replica.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Installation and Benchmarks

2011-08-10 Thread Joe Landman


On 08/10/2011 08:22 AM, David Pusch wrote:

Hello again,
we now did another test where we mounted the Volume on the client and
shut down all Servers but one. We then transferred a 1 GB test file to
the Volume. The transfer took around 10 seconds. We then brought up
another server from the Cluster and again transferred a 1 GB file.
Transfer time now was roughly 20 seconds. We proceeded in this manner
for the last two servers, and each time the transfer time increased by
ca. 10 seconds.
I hope someone can make sense of this and maybe help with this problem.


Replica 6 and a gigabit network, where each brick is communicating with 
clients and other bricks via the same single gigabit network.


This performance is designed in (to the network design, not gluster's 
design).  It shouldn't be surprising to you.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] scrub as in zfs

2011-08-08 Thread Joe Landman


On 08/08/2011 04:00 AM, Uwe Kastens wrote:

Hi again,

If one thinks about a large amount of data, maybe as a replacement
for tapes. Will auto heal of gluster help with data corruption
problems? I would expect that, but only, if the files are accessed on
a regular basis.

As far as I  have seen, there is no regular scrub mechanism like in
zfs?


In hardware and software RAID, this mechanism exists.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster usable as open source solution

2011-08-04 Thread Joe Landman


On 08/04/2011 06:38 AM, Uwe Kastens wrote:

Hi,

I looked at gluster over the past year. It looks nice but the commercial
option is not so interesting, since it is not possible to evaluate a
storage solution within 30 days. More than one any other storage
platform its a matter of trust, if the scaling is working.


Greetings Uwe

  You can evaluate it as long as needed using the open source code. 
The evaluation is really of the support and service side of things.  We 
could have more discussion offline if you wish, so as not to spam the 
group.  I thought you had decided upon Nexenta though ... has this 
changed?  I'd certainly like to hear more about your consideration of 
Nexenta and Gluster.




So my questions to this mailinglist are:
- Anybody using the open source edition in a bigger production
environment? How is the expierence over a longer time?


  Multiple of our customers are doing this, ranging from smaller 10TB 
shops through a few 500TB shops.



- Since gluster seems only to offer support within the enterprise
version. Anybody out there how is supporting the open source edition?


  Support is customizable to your needs.  This is a conversation best 
had offline.


  Regards,

Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] 3.0.5 RDMA clients seem broken

2011-07-25 Thread Joe Landman

Just ran through a long exercise with a customer on trying to find out 
why only 1.6M files out of 4+M files appeared when using RDMA glusterfs 
clients.


The tools we put together previously were all about trying to work 
around other issues (not gfid, though it looks like they might have 
utility there).


It looks like the issue was/is in the RDMA mounting.  I don't know why, 
but the same file system mounted from the gluster client over tcp saw 
all the files, but when mounted over IB, we did not see all the files. 
More disconcerting was that several clients saw radically different 
numbers of files.


Just a note ... if you use 3.0.5, it looks like rdma is broken.  Caching 
is (badly) broken there as well, but thats a whole other story.


n.b. we are updating them to 3.2.2 tomorrow.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Tools for the admin

2011-07-21 Thread Joe Landman


On 07/21/2011 05:07 PM, Whit Blauvelt wrote:

On Thu, Jul 14, 2011 at 12:15:00AM -0400, Joe Landman wrote:


Tool #2: scan_gluster.pl


Joe,

Thanks for this contributions.

scan_gluster.pl seems to depend on File::ExtAttr, which when I try to
install it from cpan to Perl v5.10.1 fails at the "make" step for reasons
cpan leaves unclear. Is there some prerequisite that might not be properly
requested?

Looking at this now because I just had a simple Gluster replication setup
blow up in a nonobvious way. Not obvious from the logs anyhow. It may well
be because I switched it from 3.1.3 to 3.1.5 a few days back. Anyway, it
gives me a good excuse to learn how to diagnose a good failure, so I thought
I'd start by seeing what your tools can show.



Depending upon distro, you would need libattr, attr-dev, etc.  Sadly, 
there is little/no consistency in naming these.


If this turns out to be an issue, I'll see if I can code around it, but 
doing a sub process for each file would be seriously expensive for even 
reasonable sized file systems.



Best,
Whit



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Anyone have that find command handy

2011-07-21 Thread Joe Landman


On 07/21/2011 12:35 PM, Mohit Anchlia wrote:

Did this occur after adding or removing a brick and then running rebalance?


Adding and rebalance.



On Thu, Jul 21, 2011 at 8:30 AM, Joe Landman
  wrote:

On 07/21/2011 11:21 AM, Joe Landman wrote:


On 07/21/2011 11:17 AM, Burnash, James wrote:


Hi Joe.



Found it:

   find $local -type f -perm +01000 -exec rm -v '{}' \;

for $local



Before I answer - can you share your volume info dump?


This is

[root@X02 ~]# gluster volume info

Volume Name: brick1
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: X01:/data/glusterfs/dht
Brick2: X02:/data/glusterfs/dht
Brick3: X03:/data/brick-sdc2/glusterfs/dht
Brick4: X03:/data/brick-sdd2/glusterfs/dht




My understanding is that those "link" files are valid if you have my
kind of config - which is distributed-replicate across 2 mirrored
pairs of backed servers. If the request for a file come into the pair
that do not have that file physically on their storage, the "link"
file is created to point to the actual location on the other mirror.
At least ... that is what I think it's supposed to do ...


They are a known bug from 2.0.x time frame. I am searching my emails
from 2 years ago for the magic command that removes them.



James Burnash Unix Engineer Knight Capital Group


-Original Message- From: gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Joe Landman
Sent: Thursday, July 21, 2011 11:11 AM To: gluster-users Subject:
[Gluster-users] Anyone have that find command handy

Ran into a situation that I thought had been corrected in the 2.0.x
time frame.

[root@X03 ~]# ls -alF

/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats





/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip


-T 1 root root 0 Jul 19 18:08

/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats




-T 1 root root 0 Jul 21 10:48



/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip


and the real files

[root@X02 ~]# ls -alF
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_*
-rw-r--r-- 1 oracle oinstall 984227914 Jul 21 11:05
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip



-rw-r--r-- 1 root root 17028780 Jul 20 13:00



/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats


Note the permissions on the first as compared to the second. This
came from a rebalance operation with 3.1.5.

Anyone have that handy

find -perm  -exec rm {}

command handy so we can scan for and remove the ghost files? I can
search for it in my old emails, just figured I'd ask.

-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc.
email: land...@scalableinformatics.com web :
http://scalableinformatics.com
http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121
fax : +1 866 888 3112 cell : +1 734 612 4615
___ Gluster-users mailing
list Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


DISCLAIMER: This e-mail, and any attachments thereto, is intended
only for use by the addressee(s) named herein and may contain legally
privileged and/or confidential information. If you are not the
intended recipient of this e-mail, you are hereby notified that any
dissemination, distribution or copying of this e-mail, and any
attachments thereto, is strictly prohibited. If you have received
this in error, please immediately notify me and permanently delete
the original and any copy of any e-mail and any printout thereof.
E-mail transmission cannot be guaranteed to be secure or error-free.
The sender therefore does not accept liability for any errors or
omissions in the contents of this message which arise as a result of
e-mail transmission. NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
Knight Capital Group may, at its discretion, monitor and review the
content of all e-mail communications. http://www.knight.com






--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Anyone have that find command handy

2011-07-21 Thread Joe Landman


On 07/21/2011 11:21 AM, Joe Landman wrote:

On 07/21/2011 11:17 AM, Burnash, James wrote:

Hi Joe.



Found it:

   find $local -type f -perm +01000 -exec rm -v '{}' \;

for $local



Before I answer - can you share your volume info dump?


This is

[root@X02 ~]# gluster volume info

Volume Name: brick1
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: X01:/data/glusterfs/dht
Brick2: X02:/data/glusterfs/dht
Brick3: X03:/data/brick-sdc2/glusterfs/dht
Brick4: X03:/data/brick-sdd2/glusterfs/dht




My understanding is that those "link" files are valid if you have my
kind of config - which is distributed-replicate across 2 mirrored
pairs of backed servers. If the request for a file come into the pair
that do not have that file physically on their storage, the "link"
file is created to point to the actual location on the other mirror.
At least ... that is what I think it's supposed to do ...


They are a known bug from 2.0.x time frame. I am searching my emails
from 2 years ago for the magic command that removes them.



James Burnash Unix Engineer Knight Capital Group


-Original Message- From: gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Joe Landman
Sent: Thursday, July 21, 2011 11:11 AM To: gluster-users Subject:
[Gluster-users] Anyone have that find command handy

Ran into a situation that I thought had been corrected in the 2.0.x
time frame.

[root@X03 ~]# ls -alF
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats




/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip


-T 1 root root 0 Jul 19 18:08
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats




-T 1 root root 0 Jul 21 10:48

/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip


and the real files

[root@X02 ~]# ls -alF
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_*
-rw-r--r-- 1 oracle oinstall 984227914 Jul 21 11:05
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip



-rw-r--r-- 1 root root 17028780 Jul 20 13:00

/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats


Note the permissions on the first as compared to the second. This
came from a rebalance operation with 3.1.5.

Anyone have that handy

find -perm  -exec rm {}

command handy so we can scan for and remove the ghost files? I can
search for it in my old emails, just figured I'd ask.

-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc.
email: land...@scalableinformatics.com web :
http://scalableinformatics.com
http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121
fax : +1 866 888 3112 cell : +1 734 612 4615
___ Gluster-users mailing
list Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


DISCLAIMER: This e-mail, and any attachments thereto, is intended
only for use by the addressee(s) named herein and may contain legally
privileged and/or confidential information. If you are not the
intended recipient of this e-mail, you are hereby notified that any
dissemination, distribution or copying of this e-mail, and any
attachments thereto, is strictly prohibited. If you have received
this in error, please immediately notify me and permanently delete
the original and any copy of any e-mail and any printout thereof.
E-mail transmission cannot be guaranteed to be secure or error-free.
The sender therefore does not accept liability for any errors or
omissions in the contents of this message which arise as a result of
e-mail transmission. NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
Knight Capital Group may, at its discretion, monitor and review the
content of all e-mail communications. http://www.knight.com






--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Anyone have that find command handy

2011-07-21 Thread Joe Landman


On 07/21/2011 11:17 AM, Burnash, James wrote:

Hi Joe.

Before I answer - can you share your volume info dump?


This is

[root@X02 ~]# gluster volume info

Volume Name: brick1
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: X01:/data/glusterfs/dht
Brick2: X02:/data/glusterfs/dht
Brick3: X03:/data/brick-sdc2/glusterfs/dht
Brick4: X03:/data/brick-sdd2/glusterfs/dht




My understanding is that those "link" files are valid if you have my
kind of config - which is distributed-replicate across 2 mirrored
pairs of backed servers. If the request for a file come into the pair
that do not have that file physically on their storage, the "link"
file is created to point to the actual location on the other mirror.
At least ... that is what I think it's supposed to do ...


They are a known bug from 2.0.x time frame.  I am searching my emails 
from 2 years ago for the magic command that removes them.




James Burnash Unix Engineer Knight Capital Group


-Original Message- From: gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Joe Landman
Sent: Thursday, July 21, 2011 11:11 AM To: gluster-users Subject:
[Gluster-users] Anyone have that find command handy

Ran into a situation that I thought had been corrected in the 2.0.x
time frame.

[root@X03 ~]# ls -alF
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats



/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip

-T 1 root root 0 Jul 19 18:08
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats



-T 1 root root 0 Jul 21 10:48

/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip

 and the real files

[root@X02 ~]# ls -alF
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_*
-rw-r--r-- 1 oracle oinstall 984227914 Jul 21 11:05
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip



-rw-r--r-- 1 root   root  17028780 Jul 20 13:00

/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats

 Note the permissions on the first as compared to the second.  This
came from a rebalance operation with 3.1.5.

Anyone have that handy

find -perm  -exec rm {}

command handy so we can scan for and remove the ghost files?  I can
search for it in my old emails, just figured I'd ask.

-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc.
email: land...@scalableinformatics.com web  :
http://scalableinformatics.com
http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121
fax  : +1 866 888 3112 cell : +1 734 612 4615
___ Gluster-users mailing
list Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


DISCLAIMER: This e-mail, and any attachments thereto, is intended
only for use by the addressee(s) named herein and may contain legally
privileged and/or confidential information. If you are not the
intended recipient of this e-mail, you are hereby notified that any
dissemination, distribution or copying of this e-mail, and any
attachments thereto, is strictly prohibited. If you have received
this in error, please immediately notify me and permanently delete
the original and any copy of any e-mail and any printout thereof.
E-mail transmission cannot be guaranteed to be secure or error-free.
The sender therefore does not accept liability for any errors or
omissions in the contents of this message which arise as a result of
e-mail transmission. NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
Knight Capital Group may, at its discretion, monitor and review the
content of all e-mail communications. http://www.knight.com



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Anyone have that find command handy

2011-07-21 Thread Joe Landman

Ran into a situation that I thought had been corrected in the 2.0.x time 
frame.


[root@X03 ~]# ls -alF 
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats 
/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip
-T 1 root root 0 Jul 19 18:08 
/data/brick-sdc2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats
-T 1 root root 0 Jul 21 10:48 
/data/brick-sdd2/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip


and the real files

[root@X02 ~]# ls -alF 
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_*
-rw-r--r-- 1 oracle oinstall 984227914 Jul 21 11:05 
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip
-rw-r--r-- 1 root   root  17028780 Jul 20 13:00 
/data/glusterfs/dht/opra/20110502/options_20110502_opra_ch_015_.dat.zip.stats


Note the permissions on the first as compared to the second.  This came 
from a rebalance operation with 3.1.5.


Anyone have that handy

find -perm  -exec rm {}

command handy so we can scan for and remove the ghost files?  I can 
search for it in my old emails, just figured I'd ask.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-17 Thread Joe Landman


On 07/17/2011 11:19 PM, Ken Randall wrote:

Joe,

Thank you for your response.  After seeing what you wrote, I bumped up
the performance.cache-size up to 4096MB, the max allowed, and ran into
the same wall.


Hmmm ...



I wouldn't think that any SMB caching would help in this case, since the
same Samba server on top of the raw Gluster data wasn't exhibiting any
trouble, or am I deceived?


Samba could cache better so it didn't have to hit Gluster so hard.


I haven't used strace before, but I ran it on the glusterfs process, and
saw a lot of:
epoll_wait(3, {{EPOLLIN, {u32=9, u64=9}}}, 257, 4294967295) = 1
readv(9, [{"\200\0\16,", 4}], 1)= 4
readv(9, [{"\0\n;\227\0\0\0\1", 8}], 1) = 8
readv(9,
[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\31\0\0\0\0\0\0\0\1\0\0\0\0"..., 
3620}],
1) = 1436
readv(9, 0xa90b1b8, 1)  = -1 EAGAIN (Resource
temporarily unavailable)


Interesting ... I am not sure why its reporting an EAGAIN for readv, 
other than it can't fill the vector from the read.



And when I ran it on smbd, I saw a constant stream of this kind of activity:
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 848
getdents(29, /* 24 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 864
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 832
getdents(29, /* 24 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 784
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 25 entries */, 32768)   = 808
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 864
getdents(29, /* 25 entries */, 32768)   = 872
getdents(29, /* 25 entries */, 32768)   = 832
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 24 entries */, 32768)   = 864
getdents(29, /* 25 entries */, 32768)   = 848
getdents(29, /* 24 entries */, 32768)   = 840


Get directory entries.  This is the stuff that NTFS is caching for its 
web server, and it appears Samba is not.


Try

aio read size = 32768
csc policy = documents
dfree cache time = 60
directory name cache size = 10
fake oplocks = yes
getwd cache = yes
level2 oplocks = yes
max stat cache size = 16384


That chunk would get repeated over and over and over again as fast as
the screen could go, with the occasional (every 5-10 seconds or so),
would you see anything that you'd normally expect to see, such as:
close(29)   = 0
stat("Storage/01", 0x7fff07dae870) = -1 ENOENT (No such file or directory)
write(23,
"\0\0\0#\377SMB24\0\0\300\210A\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0d\233"...,
39) = 39
select(38, [5 20 23 27 30 31 35 36 37], [], NULL, {60, 0}) = 1 (in [23],
left {60, 0})
read(23, "\0\0\0x", 4)  = 4
read(23,
"\377SMB2\0\0\0\0\30\7\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0\250P\273\0[8"...,
120) = 120
stat("Storage", {st_mode=S_IFDIR|0755, st_size=1581056, ...}) = 0
stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or
directory)
stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or
directory)
open("Storage", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 29
fcntl(29, F_SETFD, FD_CLOEXEC)  = 0

(The no such file or directory part is expected since some of the image
references don't exist.)



Ok.  It looks like Samba is pounding on GlusterFS metadata (getdents). 
GlusterFS doesn't really do a great job in this case ... you have to 
give it help and cache pretty aggressively here.  Samba can do this 
caching to some extent.  You might want to enable stat-cache and fast 
lookups.  These have been problematic for us in the past though, and I'd 
recommend caution.



Ken



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-17 Thread Joe Landman


On 07/17/2011 08:56 PM, Ken Randall wrote:


You may be asking, why am I asking here instead of on a Samba group, or
even a Windows group?  Here's why:  My control is that I have a Windows
file server that I can swap in Gluster's place, and I'm able to load
that page without it blinking an eye (it actually becomes a test of the
computer that the browser is on).  It does not affect any of the web
servers' in the slightest.  My second control is that I have exported
the raw Gluster data directory as an SMB share (with the same exact
Samba configuration as the Gluster one), and it performs equally as well
as the Windows file server.  I can load the Page of Death with no
consequence.


NTFS with SMB sharing caches everything.  First page load may take a bit 
of time, but subsequent will be running from data stored in RAM.


You can adjust SMB caching and Gluster caching as needed.


I've pushed IO-threads all the way to the maximum 64 without any
benefit.  I can't see anything noteworthy in the Gluster or Samba logs,
but perhaps I am not sure what to look for.


Not likely your issue.  More probably its a Gluster cache size coupled 
with some CIFS tuning you need.




Thank you to anybody who can point me the right direction.  I am hoping
I don't have to dive into Wireshark or tcpdump territory, but I'm open
if you can guide the way!  ;)


You might need to strace -P the slow servers.  Would help to know what 
calls they are stuck on.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Tools for the admin

2011-07-14 Thread Joe Landman


On 07/14/2011 04:50 PM, Mohit Anchlia wrote:

It will be good to have these monitoring capabilities rolled in
Gluster so that one can identify issues with GFID, xattr mismatch etc
proactively. Currently, there is no such feature. You find out the
hardway when clients are impacted.


Agreed.  We might be able to do something with fanotify and other types 
of signaling for sanity checking in-situ.


Or possibly, another translator (read after write/sanity check after 
write).  We are looking into building some of our own translators for 
plugging in.  This might not be a bad one to get started with.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Tools for the admin

2011-07-14 Thread Joe Landman


On 07/14/2011 04:35 PM, John Mark Walker wrote:

Joe - thanks for taking the time to write this up. It sounds like the
issue this is designed to fix is related to the GFID mismatch issue
that we released a preventive fix for today.


Hi John

  We need to get this into testing soon.



The sanity checks could be useful, though. Does today's release
change anything with respect to your tools?


Don't know yet.  If we can get access to more of the internals 
monitoring, we should be able to craft better tools for this.  Our plan 
is to eventually get something like a parallel scan and "fix" tool that 
does best practices for the fix.  Not quite an fsck (thats at a 
different level).  But a real sanity checker.


I'd imagine it would be useful for the CloudFS group too ...



-John Mark Gluster Community Guy


 From:
gluster-users-boun...@gluster.org [gluster-users-boun...@gluster.org]
on behalf of Joe Landman [land...@scalableinformatics.com] Sent:
Wednesday, July 13, 2011 9:15 PM To: gluster-users Subject:
[Gluster-users] Tools for the admin

Hi folks

We have run into a number of problems with missing files (among
other things).  So I went hunting for the files.  Along the way, I
came up with some very simple sanity checks and tools for helping to
correct situations.  They will not work on striped data ... sorry.

Sanity check #1: conservation of number of files

The sum of the number of files on your backing stores (excluding
links and directories) should equal (with possible minor variance due
to gluster internals) the sum of the number of files (excluding links
and directories) in your gluster volumes.

If you have say, 6 bricks, each with nearly 1M files, and a dht
volume built from those bricks, you really ... REALLY ... shouldn't
have only 1.8M files in your volume.  If you do, then some files are
missing from the volume (really).  You can tell what these files are,
as they have no xattr.  Yeah.  Really.

How can you enumerate what you have?

Simple.  Meet file_accounting.pl  (available at
(http://download.scalableinformatics.com/gluster/utils/)

This handy utility will tell you important things about your file
system.

[root@jr4-1 temp]# /data/tiburon/install/scan/file_accounting.pl
--bspath=/data/brick-sdc2/dht/ Number of entries: 944604 Number of
links  : 6711 Number of dir: 102825 Number of files  : 834794

--bspath is the "backing store path", where the files reside.  It
works just as well on your gluster volume, which allows you to
inspect your sanity with appropriate sums.

So you need to copy these into the volume.  And move them out of the
way first  before copy in.

Which leads to tool #2 and #3.  First, you need to scan your backing
store file system for the files.

Tool #2: scan_gluster.pl

/data/tiburon/install/scan/scan_gluster.pl
--bspath=/data/brick-sdc2/dht

/data/brick-sdc2/temp/sdc2.data


will grab lots of nice info about the file, including the
attributes. You can now use grep against the sdc2.data file and look
only for 'attr=,' and those will be things gluster knows varying
degrees about in your file system.  Some things, specifically files
with this condition, yeah, those are missing files.  The ones I was
trying to find.

If you have a user who notes that files occasionally go missing,
yeah, this can help you find them if they exist on the backing store.
Which they probably do.

The next tool is dangerous.  So far all we have done is to scan the
backing store.  Now we are going to make changes.  No, don't worry,
its actually ... almost ... safe.  We do a file move to another
location (preferably on the same device/mount point in the backing
store), then a copy into gluster volume (yes you need to mount it on
your brick nodes). The danger is in modifying a gluster file system
backend.  Don't do this.  Ever.  Unless 3/4 of your files go
missing.

And, by the way, we have a handy dandy --md5 switch on there, if you
want the scan to take forever.

Tool #3: data_mover.pl

This will do the dirty work.  It parses the output of scan_gluster,
and makes changes.  There is a --dryrun option for those who want to
try it, and a -T number   option to specify the number of changes to
make to the file.  Allows you to try it (hence the T ... for TRY) on
some number of files.  It will preserve ownership and permission mask
(ohhh ahhh ... shiny!).  The --tmp option happily sets your temporary
directory. Verbose and debug should be obvious.

nohup ./data_mover.pl --data sdd2-nomd5.data --debug --verbose
--tmp `pwd`/tmp  -T 200>>  out 2>&1&


Note:  all of these tools currently use /opt/scalable/bin/perl as
the interpreter.  This is because our Perl build (5.12.3) includes
all the bits we need to make this work.  If you want to use them, you
are welcome to change /opt/scalable/bin/perl to /usr/bin/perl, and
they you will have to install a few modules

cpan Getopt::Lucid

[Gluster-users] Tools for the admin

2011-07-13 Thread Joe Landman


Hi folks

  We have run into a number of problems with missing files (among other 
things).  So I went hunting for the files.  Along the way, I came up 
with some very simple sanity checks and tools for helping to correct 
situations.  They will not work on striped data ... sorry.


Sanity check #1: conservation of number of files

The sum of the number of files on your backing stores (excluding links 
and directories) should equal (with possible minor variance due to 
gluster internals) the sum of the number of files (excluding links and 
directories) in your gluster volumes.


If you have say, 6 bricks, each with nearly 1M files, and a dht volume 
built from those bricks, you really ... REALLY ... shouldn't have only 
1.8M files in your volume.  If you do, then some files are missing from 
the volume (really).  You can tell what these files are, as they have no 
xattr.  Yeah.  Really.


How can you enumerate what you have?

Simple.  Meet file_accounting.pl  (available at 
(http://download.scalableinformatics.com/gluster/utils/)


This handy utility will tell you important things about your file system.

[root@jr4-1 temp]# /data/tiburon/install/scan/file_accounting.pl 
--bspath=/data/brick-sdc2/dht/

Number of entries: 944604
Number of links  : 6711
Number of dir: 102825
Number of files  : 834794

--bspath is the "backing store path", where the files reside.  It works 
just as well on your gluster volume, which allows you to inspect your 
sanity with appropriate sums.


So you need to copy these into the volume.  And move them out of the way 
first  before copy in.


Which leads to tool #2 and #3.  First, you need to scan your backing 
store file system for the files.


Tool #2: scan_gluster.pl

/data/tiburon/install/scan/scan_gluster.pl --bspath=/data/brick-sdc2/dht 
> /data/brick-sdc2/temp/sdc2.data


will grab lots of nice info about the file, including the attributes. 
You can now use grep against the sdc2.data file and look only for 
'attr=,' and those will be things gluster knows varying degrees about in 
your file system.  Some things, specifically files with this condition, 
yeah, those are missing files.  The ones I was trying to find.


If you have a user who notes that files occasionally go missing, yeah, 
this can help you find them if they exist on the backing store.  Which 
they probably do.


The next tool is dangerous.  So far all we have done is to scan the 
backing store.  Now we are going to make changes.  No, don't worry, its 
actually ... almost ... safe.  We do a file move to another location 
(preferably on the same device/mount point in the backing store), then a 
copy into gluster volume (yes you need to mount it on your brick nodes). 
 The danger is in modifying a gluster file system backend.  Don't do 
this.  Ever.  Unless 3/4 of your files go missing.


And, by the way, we have a handy dandy --md5 switch on there, if you 
want the scan to take forever.


Tool #3: data_mover.pl

This will do the dirty work.  It parses the output of scan_gluster, and 
makes changes.  There is a --dryrun option for those who want to try it, 
and a -T number   option to specify the number of changes to make to the 
file.  Allows you to try it (hence the T ... for TRY) on some number of 
files.  It will preserve ownership and permission mask (ohhh ahhh ... 
shiny!).  The --tmp option happily sets your temporary directory. 
Verbose and debug should be obvious.


nohup ./data_mover.pl --data sdd2-nomd5.data --debug --verbose  --tmp 
`pwd`/tmp  -T 200 >> out 2>&1 &



Note:  all of these tools currently use /opt/scalable/bin/perl as the 
interpreter.  This is because our Perl build (5.12.3) includes all the 
bits we need to make this work.  If you want to use them, you are 
welcome to change /opt/scalable/bin/perl to /usr/bin/perl, and they you 
will have to install a few modules


cpan Getopt::Lucid File::ExtAttr

If you have an issue with either, please let me know.

We can turn these into binaries if someone needs.  Source is at

http://download.scalableinformatics.com/gluster/utils/

Let me know (offline) if you run into problems if you decide to give 
them a try.  Note, they are GPL2 (no license tag on them), no warranty, 
and data_mover.pl will MOST DEFINITELY DESTROY DATA.  We aren't liable 
for any damages if you use it.  Caveat Emptor.  Let the admin beware. 
Did I mention that data_mover.pl WILL DESTROY YOUR DATA?   I am not sure 
if I did.  So here it is again.  data_mover.pl WILL DESTROY YOUR DATA.


Don't use these unless you have a backup.  Especially data_mover.pl. 
Because IT WILL MOST DEFINITELY DESTROY YOUR DATA.  Might even bite your 
dog, egg your house, and do all sorts of other nastiness.  It will 
increase entropy in the universe.


But if you are staring at the rear end of 3.8M missing files, wondering 
WTF, mebbe ... that data lossage thing doesn't sound so bad.  Especially 
if you can reverse it.


So feel free to look them over.  I plan to hone them over time

Re: [Gluster-users] Files present on the backend but have become invisible from clients

2011-07-08 Thread Joe Landman

Interestingly, one of our customers is running into this, though with 
DHT only, and on a 3.0.5 system.  Just filed a report #3646 in the 
ticketing system.

Files which are invisible have no xattrs (I do not believe they went 
poking around the backend on their own).  This includes many of the 
directories that are missing ... no xattr.

Is there any curative measures we can take?  Copy out of backend and 
copy back in?

On 06/23/2011 02:05 PM, Burnash, James wrote:

Sorry – volfile:

http://pastebin.com/2hYZw2V9

James Burnash

Unix Engineer

Knight Capital Group

*From:*gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] *On Behalf Of *Burnash, James
*Sent:* Thursday, June 23, 2011 2:03 PM
*To:* 'Anand Avati'; Jeff Darcy
*Cc:* gluster-users@gluster.org
*Subject:* Re: [Gluster-users] Files present on the backend but have
become invisible from clients

Hi Anand.

Here is my script (no heckling – I know I do not produce beautiful code
J), and below is the invocation I used:

#!/bin/bash

#

# description: # Gluster server script to check brick extended attributes

#

# J. Burnash 2011

#

# $Id$

ECHO_CMD=""

SORT_CMD="cat"

brick_parent="read-only"

while getopts "a:b:ehsw" flag

do

case $flag in

a) attribute=$OPTARG;

OPTARG="";

;;

b) brick=$OPTARG;

OPTARG="";

;;

e) ECHO_CMD="-n";

;;

s) SORT_CMD="sort -k 2";

;;

w) brick_parent="read-write"

OPTARG="";

;;

h) progname=$(basename $0);

echo "$progname: Usage:";

echo "$progname <-a attribute> <-b brick> hostname1  ...
";

echo "where -e puts the hostname at the beginning of the output from the
requested comand,";

echo " -a  sets the extended attribute to be displayed,";

echo " -b  sets the brick for which extended attributes are
to be displayed,";

echo "and -h gives this help message";

echo "the brick and attribute arguments may be enclosed in quotes to
allow multiple matches for each host";

echo "Example: $progname 'service nitemond restart' hostname1 "

exit;

;;

*) echo "Unknown argument $flag";

;;

esac

# Debug

#echo "Flag=$flag OPTIND=$OPTIND OPTARG=$OPTARG"

done

# Dispose of processed option arguments:

shift $((OPTIND-1)); OPTIND=1

loop_check -f $ECHO_CMD "cd /export/$brick_parent; for brick_root in
$brick; do echo -n \$HOSTNAME; getfattr -d -e hex -m $attribute
\$brick_root | xargs echo | sed -e \"s/# file:/ /\" ; done" $@ | awk
'{print $2,$3,$1}' | $SORT_CMD

Invocation: check_brick_attrs_ro -e -s -a trusted.afr -b 'g0{1,2}'
jc1letgfs{14,15,17,18}

Thanks,

James Burnash

Unix Engineer

Knight Capital Group

*From:*Anand Avati [mailto:anand.av...@gmail.com]
*Sent:* Thursday, June 23, 2011 6:31 AM
*To:* Jeff Darcy
*Cc:* Burnash, James; gluster-users@gluster.org
*Subject:* Re: [Gluster-users] Files present on the backend but have
become invisible from clients

On Thu, Jun 23, 2011 at 2:10 AM, Jeff Darcy mailto:jda...@redhat.com>> wrote:

On 06/22/2011 02:44 PM, Burnash, James wrote:
 > g01/pfs-ro1-client-0=0x jc1letgfs17
 > g01/pfs-ro1-client-0=0x0608 jc1letgfs18
 > g01/pfs-ro1-client-20=0x jc1letgfs14
 > g01/pfs-ro1-client-20=0x0200 jc1letgfs15
 > g02/pfs-ro1-client-2=0x jc1letgfs17
 > g02/pfs-ro1-client-2=0x4504 jc1letgfs18
 > g02/pfs-ro1-client-22=0x jc1letgfs14
 > g02/pfs-ro1-client-22=0x0200 jc1letgfs15

 >
 > Would anybody have any insights as to what is going on here? I'm
 > seeing attributes in my sleep these days ... that cannot be good!

Can you give your script or explain what each of those fields mean and
how they fit into your volume configuration? Also can you post your
client volfile?

Avati

DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s)named herein and
may contain legally privileged and/or confidential information. If you
are not the intended recipient of this
e-mail, you are hereby notified that any dissemination, distribution or
copying of this e-mail and any attachments
thereto, is strictly prohibited. If you have received this in error,
please immediately notify me and permanently
delete the original and any printout thereof.E-mail transmission cannot
be guaranteed to be secure or error-free.
The sender therefore does not accept liability for any errors or
omissions in the contents of this message which
arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
Knight Capital Group may, at its discretion, monitor and review the
content of all e-mail communications.

http://www.knight.com 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   h

Re: [Gluster-users] tuning

2011-07-07 Thread Joe Landman


On 07/07/2011 10:01 AM, Papp Tamas wrote:

On 2011-07-07 15:42, Joe Landman wrote:

The same on the cluster volume is ~50-60 MB/s.



This is what you expect over GE, more towards the lower end of the
range (assuming a non-optimized driver and network stack).


Why? It's GE, 60MB/s is about ~500Mbit. I expect to be at least
~900Mbit. Am I dreaming?:)


This is very dependent upon the IO patterns, driver, switch, port 
contention, ...


We do see 900+ Mb/s (it makes more sense to talk about this in terms of 
MB/s).  Best case you will see over a 1 GbE wire is about 117 MB/s +/- 
some.  To get there, you need to be doing large enough writes/reads for 
it to be meaningful, not suffer port contention, and a few other things.


[...]


I don't understand this. What are your needs, and what are your
expectations based upon those needs? Do you have the right equipment
to meet the needs, or do you need to buy more/better equipment to meet
the needs?


800-1000Mbit/s would be enough to me.


Try removing the channel bond.


Actually I had never tested, that would be the optimal for our purposes.
We do the best from the tools we have, and if it's not enough, the boss
starting to think over it...
I know, this is not the best way, but we have this:)

What can be the bottleneck for our system, GE?
Would 10GE for server help on this much more (single machines to cluster
connection).


This is a complex and hard question to answer, and it involves a deep 
investigation of your system, the IO patterns, etc.







Thank you,

tamas

ps.: What do you mean on optimizing network stack, jumbo frame?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] tuning

2011-07-07 Thread Joe Landman


On 07/07/2011 04:40 AM, Papp Tamas wrote:


On the node:

$ dd if=/dev/zero of=adsdfgrr bs=128K count=100k oflag=direct
102400+0 records in
102400+0 records out
13421772800 bytes (13 GB) copied, 27.4022 s, 490 MB/s

The same on the cluster volume is ~50-60 MB/s.


This is what you expect over GE, more towards the lower end of the range 
(assuming a non-optimized driver and network stack).




Network layer is GE, nodes are connected with two NICs in bonding.


Bonding will not help single threaded reads/writes.  It could help 
multiple simultaneous reads and writes.  But a single process doing 
reads/writes will go over a single link (in general).



I am absolutely desparated. Is it Ubuntu? Would be better with Fedora?
Or does the Storage Platform run on an optimized kernel or something
like that?


I don't understand this.  What are your needs, and what are your 
expectations based upon those needs?  Do you have the right equipment to 
meet the needs, or do you need to buy more/better equipment to meet the 
needs?



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] warning: pure path resolution

2011-06-15 Thread Joe Landman


On 06/15/2011 10:23 AM, Whit Blauvelt wrote:

On Wed, Jun 15, 2011 at 10:16:02AM -0400, Joe Landman wrote:


As a general rule, the W simply tells you its a warning.


In practice, is it safe to simply ignore all warnings from Gluster?
Different projects have different thresholds between no warning, warning,
and critical messages. Is Gluster's, in your experience, such that warnings
may as well be discarded?

Whit



Ok ... Warnings shouldn't be ignored, just logged. In most cases, they 
will turn out to be nothing.  In some cases, they may turn out to be 
something.  I'll defer to the devs, but our experience suggests that 
warnings that don't develop into E states are things you don't have to 
worry about.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] warning: pure path resolution

2011-06-15 Thread Joe Landman


On 06/15/2011 10:12 AM, paul simpson wrote:

hi,

can anyone please answer this?  it's hard to contribute back to the
community when there's a deathly silence..



As a general rule, the W simply tells you its a warning.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster 3.2.0 and ucarp not working

2011-06-08 Thread Joe Landman


On 06/08/2011 04:37 PM, Joshua Baker-LePain wrote:


BTW: You need a virtual ip for ucarp


As I said, that's what I'm doing now -- using the virtual IP address
managed by ucarp in my fstab line. But Craig Carl from Gluster told the
OP in this thread specifically to mount using the real IP address of a
server when using the GlusterFS client, *not* to use the ucarp VIP.

So I'm officially confused.


GlusterFS client side gets its config from the server, and makes 
connections to each server.  Any of the GlusterFS servers may be used 
for the mount, and the client will connect to all of them.  If one of 
the servers goes away, and you have a replicated or HA setup, you 
shouldn't see any client side issues.


GlusterFS using an NFS client presents a mount point, and a single point 
of connection to the server.  Any of the GlusterFS servers may be used 
for the mount, and the client will connect to all of them.  If one of 
the servers goes away, and you have a replicated or HA setup, you 
shouldn't see any client side issues if you are not attached to that 
server for your mount.  Otherwise, the mount may hang.


Does this make it clearer or less so?

ucarp would be needed for the NFS side of the equation.  round robin DNS 
is useful in both cases.







--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Substitute for SMP?

2011-05-27 Thread Joe Landman


On 05/27/2011 11:13 AM, Jon Tegner wrote:


Thanks!

Would you say that the inefficiencies related to the ram disk would
remove all advantages of using ram instead of hard drives (or could it
still be worth a try)?


Depends upon many things.  Best thing to do is test.


As for speed, I would think that latency would be the most critical -
but I don't really know, it its the code of a colleague of mine (I was
trying to come up with a replacement of his SMP-machine which is getting
old).


Ok.  Are they IO bound with small reads/writes, large reads/writes? 
Gluster is very good for the larger file IO and many simultaneous 
(large) IOs.





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Delay for geo-replication?

2011-05-27 Thread Joe Landman


On 05/27/2011 08:19 AM, Whit Blauvelt wrote:

On Fri, May 27, 2011 at 12:12:31PM +0530, Kaushik BV wrote:


The geo-replication crawls the volume continuosly for changes (does an
intelligent crawl, i.e crawls down the fs hierarchy) only when there are
changes beneath, and records those changes in the slave.


Have you considered using the Linux kernel's INOTIFY function rather than
crawling the directories? In theory it should be more efficient.


FAM did this stuff a decade ago for Irix and Unicos.  I think it was 
ported to Linux and then superceded by INOTIFY and related designs.



On the other hand, by crawling the directories do you get the benefit of
triggering any self-heal that may be needed automatically?


I wonder if it makes sense storing a local checksum and time stamp 
(governed by one master time sync unit in the gluster storage cluster), 
that would help determine what has changed.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Substitute for SMP?

2011-05-27 Thread Joe Landman


On 05/27/2011 07:12 AM, Jon Tegner wrote:

A general question, suppose I have a parallel application, using mpi,
where really fast access to the file system is critical.

Would it be stupid to consider a ram disk based setup? Say a 36 port QDR


Ram disks won't work directly, due to lack of locking in tmpfs.  You 
could create a tmpfs, then create a file that fills this up, then a 
loopback device pointing to that file, then build a file system atop 
that, and mount it.  And then mount gluster atop that.


Needless to say, all these layers significantly decrease performance and 
introduce inefficiencies.



infiniband with half of the ports connected to computational nodes and
the other half to gluster nodes?


There may be other options, but the options are not going to be 
cheap/inexpensive.  How fast, and by fast do you mean bandwidth and/or 
latency (e.g. streaming bandwidth or random IOPs)?  What does your IO 
profile look like?


You can get nodes that stream 4.6+ GB/s read, and 3.6+ GB/s writes for 
single readers/writers to single files.  For MPI jobs with single 
readers/writers, this is good.  For very large IO jobs where you need 
10's of GB/s, you probably need a more specific design to your problem.


Regards,

Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster 3.2.0 - totally broken?

2011-05-18 Thread Joe Landman


On 05/18/2011 03:27 PM, Udo Waechter wrote:


On 18.05.2011, at 19:13, Joe Landman wrote:



+1  Folks, get an account there, and report problems, even if you
haven't paid for support.

Second, if you haven't paid for support, and you are using it in a
production environment to either make money or support your
mission, please, help Gluster there as well.  They aren't doing
this project for their own health, they need to show a demand for
this in market from paying customers (just like Redhat, and every
other company).


Here, the question arises what the difference between the paid and
non-paid version is? Is there one? Or do paying customers only get


Yes.


the possibility to call support? If so, what good would it be if


Their issues get priority.


basic functionality does not work. Would one get instant bug-fixes?


There is no such thing as an instant bug fix, as I am sure you are aware.


Do paying customers get better documentation? Do they get a warning
about versions that obviously have problems? What I see from here:
http://www.gluster.com/services/ nothing of these services would
actually help if the software provided by the company is flaved like
gluster seems to be.

We are using mostly opensource software. In my experience it usually
makes no big difference whether one pays for a product or not. On the
contrary. The commercial software that we use gives us the feeling
that we could call someone and have our problems solved quickly. The
experience is the contrary. Usually one is treated like someone who
does not know a thing and problems or bugs usually do not get solved
quicker. Sometimes it gives us headaches that we do not have someone
to blame for a bug when these turn up in Opensource software. Most
projects that do care such grave bugs are solved quickly. Just our
experience... --udo.


I won't respond to this here, this is for Gluster Inc. to respond to.

I am trying to get someone from the company to hopefully spend a bit 
more time talking about the issues.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster 3.2.0 - totally broken?

2011-05-18 Thread Joe Landman


On 05/18/2011 01:04 PM, Jeff Darcy wrote:

On 05/18/2011 11:09 AM, Burnash, James wrote:


[...]


As the leader for a project based on GlusterFS, I'm also very sensitive
to the stability issue. It is a bit disappointing when every major
release seems to be marked by significant regressions in existing
functionality. It's particular worrying when even community leaders and
distro packagers report serious problems and those take a long time to
resolve. I'd put you in that category, James, along with JoeJulian and
daMaestro just with respect to the 3.1.4 and 3.2 releases. Free or paid,
that's not a nice thing to do to your marquee users, and you're the kind
of people whose interest and support we can hardly afford to lose.  Even
I've been forced to take a second look at alternatives, and I'm just
about the biggest booster Gluster has who's not on their payroll.


Us as well ... we have a product that uses it as its base, so we are 
obviously strong proponents of it.



So how do we deal with these issues *constructively*? Not by
characterizing every release since 2.0.9 as "bogus" that's for sure.


Agreed.  Lets not get on this sort of track.  I expect issues with early 
revs, and I expect things to improve with each rev.  When we find bugs 
we do our best to submit them to bugs.gluster.com.  I'd suggest everyone 
get an account there, and submit your bugs.  Especially if you have a 
replicator.


[...]


The problem I do see, and I do agree with others who've spoken out here,
is primarily one of communication. It's a bit frustrating to see dozens
of geosync/marker/quota patches fly by while a report of a serious bug
isn't even *assigned* (let alone worked on as far as anyone can tell)
for days or even weeks. I can only imagine how it must be for the people
whose filesystems have been totally down for that long, whose bosses are
breathing down their necks and pointedly suggesting that a technology
switch might be in order. We can all help by making sure our bugs are
actually filed on bugs.gluster.com - not just mentioned here or on IRC -


+1  Folks, get an account there, and report problems, even if you 
haven't paid for support.


Second, if you haven't paid for support, and you are using it in a 
production environment to either make money or support your mission, 
please, help Gluster there as well.  They aren't doing this project for 
their own health, they need to show a demand for this in market from 
paying customers (just like Redhat, and every other company).



and by doing our part to provide the developers with the information
they need to reproduce or fix problems. We can help by actually testing
pre-release versions, particularly if our configurations/workloads are
likely to represent known gaps in Gluster's own test coverage. The devs
can help by marking bugs' status/assignment, severity/priority, and
found/fixed versions more consistently. The regression patterns in the
last few releases clearly indicate that more tests are needed in certain
areas such as RDMA and upgrades with existing data.

The key here is that if we want things to change we all need to make it
happen. We can't tell Gluster how to run their business, which includes
how they decide on features or how they allocate resources to new
features vs. bug fixes, but as a community we can give them clear and
unambiguous information about what is holding back more widespread
adoption. It used to be manageability; now it's bugs. We need to be as
specific as we possibly can about which bugs or shortcomings matter to
us, not just vague "it doesn't work" or "it's slow" or "it's not POSIX
enough" kinds of stuff, so that a concrete plan can be made to improve
the situation.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Client and server file "view", different results?! Client can't see the right file.

2011-05-17 Thread Joe Landman


On 05/17/2011 08:08 AM, Martin Schenker wrote:

This is an inherited system, I guess it was set up by hand. I guess I can
switch off these options, but the glusterd service will have to be
restarted, right?!?


Yes.


I'm also getting current error messages like these on the peer pair 3&5:

Pserver3
[2011-05-17 10:06:28.540355] E [rpc-clnt.c:199:call_bail]
0-storage0-client-2: bailing out frame type(GlusterFS 3.1) op(FINODELK(30))
xid = 0x805809xsent = 2011-05-17 09:36:18.393519. timeout = 1800


Hmmm ...  Looks like others have seen this before.  Error message 
suggests some sort of protocol error.


Its in a code path in rpc/rpc-lib/src/rpc-clnt.c and the function is 
named "call_bail".  This code looks like it is part of a timeout 
callback (I am guessing when it doesn't get a response in time, and the 
timer is hard coded to 10 seconds).  There is a note there with a TODO 
about making that configurable.


If the machine is under tremendous load, it is possible that a response 
is delayed more than 10 seconds, so that this portion of the code falls 
through to the timeout, rather than processing an rpc call).




Pserver5
[2011-05-17 10:02:23.738887] E [dht-common.c:1873:dht_getxattr]
0-storage0-dht: layout is NULL
[2011-05-17 10:02:23.738909] W [fuse-bridge.c:2499:fuse_xattr_cbk]
0-glusterfs-fuse: 489090: GETXATTR()
/images/2078/ebb83b05-3a83-9d18-ad8f-8542864da
6ef/hdd-images/21351 =>  -1 (No such file or directory)
[2011-05-17 10:02:23.738954] W [fuse-bridge.c:660:fuse_setattr_cbk]
0-glusterfs-fuse: 489091: SETATTR()
/images/2078/ebb83b05-3a83-9d18-ad8f-8542864da
6ef/hdd-images/21351 =>  -1 (Invalid argument)

Best, Martin

-Original Message-
From: gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Joe Landman
Sent: Tuesday, May 17, 2011 1:54 PM
To: gluster-users@gluster.org
Subject: Re: [Gluster-users] Client and server file "view", different
results?! Client can't see the right file.

On 05/17/2011 01:43 AM, Martin Schenker wrote:

Yes, it is!

Here's the volfile:

cat  /mnt/gluster/brick0/config/vols/storage0/storage0-fuse.vol:

volume storage0-client-0
  type protocol/client
  option remote-host de-dc1-c1-pserver3
  option remote-subvolume /mnt/gluster/brick0/storage
  option transport-type rdma
  option ping-timeout 5
end-volume


Hmmm ... did you create these by hand or using the CLI?

I noticed quick-read and stat-cache on.  We recommend turning both of
them off.  We experienced many issues with them on (from gluster 3.x.y)




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Client and server file "view", different results?! Client can't see the right file.

2011-05-17 Thread Joe Landman


On 05/17/2011 01:43 AM, Martin Schenker wrote:

Yes, it is!

Here's the volfile:

cat  /mnt/gluster/brick0/config/vols/storage0/storage0-fuse.vol:

volume storage0-client-0
 type protocol/client
 option remote-host de-dc1-c1-pserver3
 option remote-subvolume /mnt/gluster/brick0/storage
 option transport-type rdma
 option ping-timeout 5
end-volume


Hmmm ... did you create these by hand or using the CLI?

I noticed quick-read and stat-cache on.  We recommend turning both of 
them off.  We experienced many issues with them on (from gluster 3.x.y)


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] [SPAM?] Storage Design Overview

2011-05-11 Thread Joe Landman


On 05/11/2011 10:47 AM, Burnash, James wrote:

Hi Joe.

Your remarks are always useful and informative - thanks.


Thanks for the compliment!


As for our slow network throughput - what I didn't put in our
configuration is that our servers are on 10GBe, but all of our
clients are on 1Gbe because our core network can't (yet) handle the
load of 10GBe clients - just to fill in that data point.


Got it, that makes sense.  We see typically 30-80% of wire speed in 
connections (depending  upon contention and other things).  Your numbers 
fall into that range.



I find your remark about the stability of 3.1.3 reassuring -
considering the painful struggle to get there from the 3.0.4
versions, and the stability issues that I noted in the list over the


3.0.5 has been remarkably stable at customer sites ... no complaints. 
3.1.x has been a struggle until 3.1.2 and 3.1.3.  Ran head first into 
some 3.1.4 issues that I still cannot tell if they were migration issues 
or real bugs.  3.2.0 was a test effort that did not succeed internally, 
so we backed off for the moment.



course of my migration. Your problems with 3.1.4 and 3.2 are enough
reason to not do anymore upgrades to the production systems yet.



We always recommend staging upgrades on test machines if possible. 
Sometimes you get bitten by some nasty bits you were not expecting (not 
with Gluster per se, but with an odd interaction).



As a rule of thumb, I never implement X.0 releases into production
anyhow - even from Redhat ... I have the arrows still sticking out of
my back from doing so in the past :-)


Heh ...

I am pretty happy so far with Centos/RHEL 5.6.  We haven't tested the 
6.0 much yet.  Will do that soon.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] [SPAM?] Storage Design Overview

2011-05-11 Thread Joe Landman


On 05/11/2011 10:22 AM, Burnash, James wrote:

Standard disclaimers apply ... we build really fast storage systems and 
storage clusters and have a financial interest in these things, so take 
what we say in this context.



Answers inline below as well J

Hope this helps.

James Burnash, Unix Engineering

*From:*gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] *On Behalf Of *Nyamul Hassan
*Sent:* Wednesday, May 11, 2011 10:04 AM
*To:* gluster-users@gluster.org
*Subject:* Re: [Gluster-users] [SPAM?] Storage Design Overview

Thank you for the prompt and insightful answer, James. My remarks are
inline.

1.Can we mount a GlusterFS on a client and expect it to provide
sustained throughput near wirespeed? 

In your scenario, what were the maximum read speeds that you observed?

 Read (using dd) approximately 60MB/sec to 100MB/sec.


Depends upon many things in a long chain ... network performance, local 
stack performance, remote disk performance, etc.


Our experience has been that the cause of a majority of the lower 
performing situations we have observed in self-designed systems, has 
been a significant (often severe and designed in) bottleneck, that 
actively prevents users from achieving anything more than moderate speed.


We have measured up to 700 MB/s for simple dd's over an SDR Infiniband 
network using gluster 3.1.3, and about 500 MB/s over 10GbE.  It is 
achievable, but you have to start with a good design.  Good designs 
aren't buzzword enabled ... there are methods to the madness as it were. 
  This is what we provide to our customer base.




3.Does it put extra pressure on the client?


Heavy IO will fill up work queue slots in the kernel.  This is true of 
every file system.




Thx for the insight. Can you describe your current deployment a bit
more, like configs of the storage nodes, and the client nodes, and what
type of application you are using it for? Don't want to be too
intrusive, just to get an idea on what others are doing.

All on Gluster 3.1.3


We are also at 3.1.3 in the lab after experiencing problems with 3.1.4 
and 3.2.0.  Have a few bugs filed.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Best practice to stop the Gluster CLIENT process?

2011-05-06 Thread Joe Landman


On 05/06/2011 03:14 PM, Martin Schenker wrote:

So if I get this right, you'll have to rip the heart out (kill all gluster
processes; server AND client) in order to get to the local server
filesystem.

I had hoped that the client part could be left running (to the second mirror
brick) when doing repairs etc. Looks like a wrong assumption, I guess...


or use fuser/lsof to determine which process is locking which volume. 
Kill only that process.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Best practice to stop the Gluster CLIENT process?

2011-05-06 Thread Joe Landman


On 05/06/2011 02:40 PM, Martin Schenker wrote:

Thanks for all the responses!

That's what I did, umount the client dir. But this STILL left the filesystem
locked... no luck here.

I'll try James' script next.


umount -l /mount/path
killall -15 glusterd
#wait a few seconds
killall -9 glusterd

then a 'ps -ealf| grep gluster' and kill the daemons by hand (-9).


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] GlusterFS Benchmarks

2011-05-04 Thread Joe Landman


On 05/04/2011 03:14 AM, Aleksanyan, Aleksandr wrote:

I test GlusterFS on this equipment:


[...]


Max Write: 1720.80 MiB/sec (1804.39 MB/sec)
Max Read: 1415.64 MiB/sec (1484.40 MB/sec)


hmmm ... seems low.  With 24 bricks we were getting ~10+ GB/s 2 years 
ago on the 2.0.x series of code.  You might have a bottleneck somewhere 
in the Fibre channel portion of things.



Run finished: Tue Oct 19 09:30:34 2010
Why *read *< *write* ? It's normal for GlusterFS ?


Its generally normal for most cluster/distributed file systems that have 
any sort of write caching (RAID, brick OS write cache, etc.)


You can absorb the write into cache (16 units mean only 10GB ram 
required per unit to cache), and commit it later.


When we do testing on our units, we recommend using data sizes that far 
exceed any conceivable cache.  We regularly do single machine TB sized 
reads and writes (as well as cluster storage reads and writes in the 
1-20TB region) as part of our normal testing regimen.  We recommend 
reporting the non-cached performance numbers as that is what users will 
often see (as a nominal case).


Regards

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Split brain; which file to choose for repair?

2011-05-04 Thread Joe Landman


On 05/04/2011 08:24 AM, Martin Schenker wrote:

Hi all!

Is there anybody who can give some pointers regarding which file to choose
in a "split brain" condition?

What tests do I need to run?


MD5sums.  Did the logs indicate a split brain?  Or are the signatures 
simply different?




What does the hex AFR code actually show? Is there a way to pinpoint the
"better/worse" file for deletion?

On pserver12:

# file: mnt/gluster/brick0/storage/pserver3-19
trusted.afr.storage0-client-5=0x3f01

On pserver13:

# file: mnt/gluster/brick0/storage/pserver3-19
trusted.afr.storage0-client-4=0xd701

These are test files, but I'd like to know what to do in a LIFE situation
which will be just around the corner.

The Timestamps show the same values, so I'm a bit puzzled HOW to choose a
file.


File sizes and time stamps the same?

Hmmm ... this sounds like an underlying caching issue (probably not 
flushed completely/properly on one or more of the units before reboot) 
with the base machine.  Check the battery backup  on the RAID and make 
sure it is functional.


Also, run an file system check on the underlying backend storage.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Hopefully answering some mirroring questions asked here and offline

2011-05-02 Thread Joe Landman


Hi folks

 We've fielded a number of mirroring questions offline as well as 
watched/participated in discussions here.  I thought it was important to 
make sure some of these are answered and searchable on the lists.


 One major question that kept arising was as follows:

q:  If I have a large image file (say a VM vmdk/other format) on a 
mirrored volume, will one small change of a few bytes result in a resync 
of the entire file?


a:  No.

To test this, we created a 20GB file on a mirror volume.

root@metal:/local2/home/landman# ls -alF /mirror1gfs/big.file
-rw-r--r-- 1 root root 21474836490 2011-05-02 12:44 /mirror1gfs/big.file

Then using the following quick and dirty Perl, we appended about 10-20 
bytes to the file.


#!/usr/bin/env perl

my $file=shift;
my $fh;
open($fh,">>".$file);
print $fh "end ".$$."\n";
close($fh);


root@metal:/local2/home/landman# ./app.pl /mirror1gfs/big.file

then I had to write a quick and dirty tail replacement, as I've 
discovered that tail doesn't seek ... (yeah, it started reading every 
'line' of that file ...)


#!/usr/bin/env perl

my $file=shift;
my $fh;
my $buf;

open($fh,"<".$file);
seek $fh,-200,2;
read $fh,$buf,200;
printf "buffer: \'%s\'\n",$buf;
close($fh);


root@metal:/local2/home/landman# ./tail.pl /mirror1gfs/big.file
buffer: 'end 19362'

While running the app.pl, I did not see any massive resyncs.  I had 
dstat running in another window.


You might say, that this is irrelevant, as we only appended, and that 
could be special cased.


So I wrote a random updater, that updated at random spots throughtout 
the large file (sorta like a VM vmdk and other files).



#!/usr/bin/env perl

my $file=shift;
my $fh;
my $buf;
my @stat;
my $loc;

@stat = stat($file);
$loc=   int(rand($stat[7]));
open($fh,">>+".$file);
seek $fh,$loc,0;
printf $fh "I was here!!!";
printf "loc: %i\n",$loc;
close($fh);

root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 17598205436
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 16468787891
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 9271612568
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 1356667302
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 12365324308
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 15654714313
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 10127739152
root@metal:/local2/home/landman# ./randupd.pl /mirror1gfs/big.file
loc: 10259920623

and again, no massive resyncs.

So I think its fairly safe to say that the concern over massive resyncs 
for small updates is not something we see in the field.


Regards,

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "Self heal" or sync issues?

2011-04-28 Thread Joe Landman


On 04/28/2011 11:33 AM, Martin Schenker wrote:

With millions of files on a system this is a HUGE overhead. Running the


Using fam or similar, this overhead could be reduced.  But I'd expect 
that the file system should do this correctly.


What specifically was the shutdown/startup sequence?


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "Self heal" or sync issues?

2011-04-28 Thread Joe Landman


On 04/28/2011 07:54 AM, Martin Schenker wrote:

Hi all!

We're running a 4 node cluster on Gluster 3.1.3 currently.

After a staged server update/reboot only ONE of the 4 servers shows some
mismatches in the file attributes. It shows that 28 files differ from
/0x/ the "all-in-sync" state. No sync or self
heal has happened within the last 16h, we checked last night, this
morning and now.

Even after opening each file with /od -c  | head -2/ the
self-heal/sync process doesn't seem to start.


I think you need something like an

touch `find /glusterfs/mount/point`

or similar to manually trigger the self-heal.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "Gluster volume show" function?

2011-04-27 Thread Joe Landman


On 04/27/2011 02:11 PM, Mike Hanby wrote:

The way I understand it, "gluster volume info" will only show the
options that have changed. Unfortunately, there's no way to list ALL of
the options, whether they are still at default or have been modified.

There really should be a way to see all of the configurable options and
their current settings.

I think a ticket was already opened on this as a feature request.


I did an informal RFE a while ago.  Not sure if someone opened a formal 
ticket.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Does gluster make use of a multicore setup? Hardware recs.?

2011-04-27 Thread Joe Landman


On 04/27/2011 04:39 AM, Martin Schenker wrote:

Hi all!

I'm new to the Gluster system and tried to find answers to some simple
questions (and couldn't find the information with Google etc.)

-does Gluster spread it's cpu load across a multicore environment? So


Yes, Gluster is multi-threaded.  You can tune the number of IO threads 
per brick.



does it make sense to have 50 core units as Gluster server? CPU loads


No ... as the rate limiting factor will be the storage units themselves, 
and not the processing behind Gluster.  You could dedicate some cores to 
it, but at some point you are going to run out of IO bandwidth or IOP 
capability before you run out of threading.



seem to go up quite high during file system repairs so spreading /
multithreading should help? What kind of CPUs are working well? How much


That high load is often a result of an IO system that is under load, 
poorly tuned (or poorly designed for the workload).



memory does help the preformance?


Gluster will cache, so depending upon how much of your data is 
anticipated to be "hot", you can adjust from there.




-Are there any recommendations for commodity hardware? We're thinking of


Well, we are biased ... see .sig :)


36 slot 4U servers, what kind of controllers DO work well for IO speed?


We've had other customers use these, and they've had cooling issues, not 
to mention issues with expandor performance.



Any real life experiences? Does it dramatically improve the performance
to increase the number of controllers per disk?


A good design can get you order of magnitude better performance than a 
poor design.  Lots of real world experience with this.




The aim is for a ~80-120T file system with 2-3 bricks.


Hmmm... going wide for larger scenarios is almost always a better move 
than reduction of chassis.




Thanks for any feedback!

Best, Martin
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-26 Thread Joe Landman


On 04/26/2011 05:48 PM, Mohit Anchlia wrote:

I am not sure how valid this performance url is

http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS

Does it make sense to separate out the journal and create mkfs -I 256?

Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?


Journals are small write heavy.  You really want a raw device for them. 
 You do not want file system caching underneath them.


Raw partition for an external journal is best.  Also, understand that 
ext* suffers badly under intense parallel loads.  Keep that in mind as 
you make your file system choice.




On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman
  wrote:

On 04/21/2011 08:49 PM, Mohit Anchlia wrote:


After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which


PERC is a rebadged LSI based on the 1068E chip.


tool it supports. Finally I installed lsiutil and was able to change
the cache size.

[root@dsdb1 ~]# lspci|grep LSI
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)


  This looks like PERC.  These are roughly equivalent to the LSI 3081 series.
  These are not fast units.  There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
  I might be thinking of the 1078 chip though.

  Regardless, these are fairly old designs.



[root@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=40k
oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s

I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.


So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s?  Seems a pretty simple choice :)

BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge).  The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc.  So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most.  A number of motherboards have NICs that are served by a single PCIe
x1 link.


Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?


Well, for a shared backend over a fabric, I'd say possibly.  For an internal
connected set, I'd say no.  Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.

Gluster lets you get many of these benefits at a higher level in the stack.
  Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level.  I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Client hang on find of directories

2011-04-25 Thread Joe Landman


On 04/25/2011 12:32 PM, Mohit Anchlia wrote:

There is something more to it. I am guessing.


Not really.  Stat-prefetch attempts to keep a copy of metadata local to 
a client for glusterfs mounts, and I think on the mountpoint server for 
the NFS clients.  Avoids the roundtrip stats if possible.  But it also 
means you have to institute some sort of cache coherency mechanism, so 
two different simultaneous accesses to the same file see the same 
metadata from two or more different servers.


I haven't gone into the code yet to look at it lately, but I suspect 
much of the issues we've been running into have been (likely) corner 
cases that the current code isn't handling well.  Turning off that 
translator seems to fix lots of our problems, so we ship systems with it 
off by default now.





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Client hang on find of directories

2011-04-24 Thread Joe Landman


On 04/24/2011 03:03 PM, Burnash, James wrote:

Gluster 3.1.1

CentOS 5.5 (servers), CentOS 5.2 (client).

/pfs2 is the mount point for a Duplicated-Replicate volume of 4 servers.

Given this command line executed on the client:

root@jc1lnxsamm46:/root # time find /pfs2/online_archive/2010 -type d –print

and this output:



[...]


Client was originally deployed as GlusterFS 3.0.4, that was uninstalled,
version 3.1.1 was installed, and then later upgraded to 3.1.3.

Any ideas on what is going on here?


possibly multiple things.  Looks like a slow stat issue, compounded by a 
run-time link issue.  You might need to locate where you installed your 
glusterfs installation, and make sure it is included in a file in 
/etc/ld.so.conf.d/gluster.conf


/usr/local/lib
/usr/local/lib64

then run

ldconfig -v

and then restart gluster.

As to the slow aspect, large directories with many stats will take a 
very long time.  At this point in time, we are turning off stat-prefetch 
and a number of other things by default due to breakage, for 
deployments.  This will negatively impact stat performance (requiring at 
least one round trip per stat), and show up as huge time delays in large 
directories.


It might help to turn up debugging on the servers, and pastebin the logs.



Thanks,

James Burnash

Unix Engineering



DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s)named herein and
may contain legally privileged and/or confidential information. If you
are not the intended recipient of this
e-mail, you are hereby notified that any dissemination, distribution or
copying of this e-mail and any attachments
thereto, is strictly prohibited. If you have received this in error,
please immediately notify me and permanently
delete the original and any printout thereof.E-mail transmission cannot
be guaranteed to be secure or error-free.
The sender therefore does not accept liability for any errors or
omissions in the contents of this message which
arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
Knight Capital Group may, at its discretion, monitor and review the
content of all e-mail communications.

http://www.knight.com 



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-21 Thread Joe Landman


On 04/21/2011 08:49 PM, Mohit Anchlia wrote:

After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which


PERC is a rebadged LSI based on the 1068E chip.


tool it supports. Finally I installed lsiutil and was able to change
the cache size.

[root@dsdb1 ~]# lspci|grep LSI
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)


 This looks like PERC.  These are roughly equivalent to the LSI 3081 
series.  These are not fast units.  There is a variant of this that does 
RAID6, its usually available as a software update or plugin module 
(button?) to this.  I might be thinking of the 1078 chip though.


 Regardless, these are fairly old designs.



[root@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=40k oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s

I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.


So the software RAID is giving you 300 MB/s and the hardware 'RAID' is 
giving you ~181 MB/s?  Seems a pretty simple choice :)


BTW: The 300MB/s could also be a limitation of the PCIe channel 
interconnect (or worse, if they hung the chip off a PCIx bridge).  The 
motherboard vendors are generally loathe to put more than a few PCIe 
lanes for handling SATA, Networking, etc.  So typically you wind up with 
very low powered 'RAID' and 'SATA/SAS' on the motherboard, connected by 
PCIe x2 or x4 at most.  A number of motherboards have NICs that are 
served by a single PCIe x1 link.



Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?


Well, for a shared backend over a fabric, I'd say possibly.  For an 
internal connected set, I'd say no.  Given what you are doing with 
Gluster, I'd say that the additional expense/pain of setting up a 
multipath scenario probably isn't worth it.


Gluster lets you get many of these benefits at a higher level in the 
stack.  Which to a degree, and in some use cases, obviates the need for 
multipathing at a lower level.  I'd still suggest real RAID at the lower 
level (RAID6, and sometimes RAID10 make the most sense) for the backing 
store.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] has anyone seen the impact of "running out of space" on a gluster volume?

2011-04-21 Thread Joe Landman

It looks like there are several possible modes, and I am wondering what 
the response to these modes are with gluster.


1) a brick runs out of space during a write.  While there is more space 
available elsewhere in the unit, this one brick is full.


2) the aggregate file system runs out of space.  This is a slight 
variation on the above, in that an allocation should occur on the least 
full brick at some point (when is that crossover?)


So, what should a user that is trying to do a write, where the write 
exceeds the size of the brick, observe in terms of an error return?


ENOSPC? something like this?

A customer just reported a hang for a 100TB file system after filling 
it.  Basically df and related hung.  They are using the native client 
due some unresolved issues on the NFS server side.  This may be part of 
the issue as well, as I'd expect the NFS server to be somewhat more 
likely to report errors people expect.


Anyone run into this?  We are going to do some experimentation here 
before figuring out if this is something that warrants a bug report, but 
I wanted to see if someone else had seen something like this.


Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-20 Thread Joe Landman


On 04/20/2011 07:50 PM, Mohit Anchlia wrote:

I did that but it looks the same. I did get an error even though it
says write-caching is on.

[root@dslg1 ~]# hdparm -W1 /dev/sda

/dev/sda:
  setting drive write-caching to 1 (on)
  HDIO_DRIVE_CMD(setcache) failed: Invalid argument
[root@dslg1 ~]# hdparm /dev/sda


You might need sdparm

sdparm -a /dev/sda | grep WCE

With WCE on I see

[root@smash ~]# sdparm -a /dev/sda | grep WCE
  WCE 1

and with it off, I see

[root@smash ~]# hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

[root@smash ~]# sdparm -a /dev/sda | grep WCE
  WCE 0

You might need to change WCE using

sdparm --set=WCE -a /dev/sda

or similar ...



/dev/sda:
  readonly =  0 (off)
  readahead= 256 (on)
  geometry = 36472/255/63, sectors = 585937500, start = 0
[root@dslg1 ~]# [A
[root@dslg1 ~]# dd if=/dev/zero of=/dev/sda bs=128k count=1k oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s


On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman
  wrote:

On 04/20/2011 07:28 PM, Mohit Anchlia wrote:


dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s


Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).

The write speed is definitely problematic.  I am wondering if write cache is
off, and other features are turned off in strange ways.

This is a 2 year old SATA disk

[root@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s

Write cache is enabled.  Turning write cache off (might not be so relevant
for a RAID0),

[root@smash ~]# hdparm -W /dev/sda

/dev/sda:
  write-caching =  1 (on)
[root@smash ~]# hdparm -W0 /dev/sda

/dev/sda:
  setting drive write-caching to 0 (off)
  write-caching =  0 (off)

[root@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s

See if you can do an

hdparm -W1 /dev/sda

and see if it has any impact on the write speed.  If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings.  If this doesn't work, you might need to get to the
console and tell it to allow caching.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-20 Thread Joe Landman


On 04/20/2011 07:28 PM, Mohit Anchlia wrote:

dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s


Ok, this is closer to what I was expecting (really ~150 MB/s would make 
more sense to me, but I can live with 128 MB/s).


The write speed is definitely problematic.  I am wondering if write 
cache is off, and other features are turned off in strange ways.


This is a 2 year old SATA disk

[root@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s

Write cache is enabled.  Turning write cache off (might not be so 
relevant for a RAID0),


[root@smash ~]# hdparm -W /dev/sda

/dev/sda:
 write-caching =  1 (on)
[root@smash ~]# hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

[root@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s

See if you can do an

hdparm -W1 /dev/sda

and see if it has any impact on the write speed.  If you are using a 
RAID0, safety isn't so much on your mind anyway, so you can see if you 
can adjust your cache settings.  If this doesn't work, you might need to 
get to the console and tell it to allow caching.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-20 Thread Joe Landman


On 04/20/2011 06:49 PM, Mohit Anchlia wrote:

Numbers look very disappointing. I destroyed RAID0. Does it mean disks
on all the servers are bad?


[...]


[root@dslg1 ~]# dd if=/dev/zero of=/dev/sda bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 572.117 seconds, 18.8 MB/s


how about the read?



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-20 Thread Joe Landman


On 04/20/2011 05:43 PM, paul simpson wrote:

many thanks for sharing guys.  an informative read indeed!

i've 4x dells - each running 12 drives on PERC 600.  was dissapointed to
hear they're so bad!  we never got round to doing intensive tests this
in depth.  12x2T WD RE4 (sata) is giving me ~600Mb/s write on the bare
filesystem.  joe, does that tally with your expectations for 12 SATA
drives running RAID6?  (i'd put more faith in your gut reaction than our
last tests...)  ;)


Hmmm ... I always put faith in the measurements ...

Ok, 600MB/s for 12 drives seems low, but they are WD drives (which is 
another long subject for us).


This means you are getting about 60 MB/s per drive write on the bare 
file system on drives that are at least (in theory) able to get (nearly) 
double that.


This is in line with what I expect from these units, towards the higher 
end of the range (was this direct or cached IO BTW).  Most of our 
customers never see more than about 300-450 MB/s out of their PERCs with 
direct IO (actual performance measurement).


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance

2011-04-20 Thread Joe Landman


On 04/20/2011 03:43 PM, Mohit Anchlia wrote:

Thanks! Is there any recommended configuration you want me to use when
using mdadm?

I got this link:

http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1


First things first, break the RAID0, and then lets measure performance 
per disk, to make sure nothing else bad is going on.


dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct

for /dev/DISK being one of the drives in your existing RAID0.  Once we 
know the raw performance, I'd suggest something like this


mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
 /dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf

then

dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct

and lets see how it behaves.  If these are good, then

mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0

(yeah, I know, gluster folk have a preference for ext* ... we generally 
don't recommend ext* for anything other than OS drives ... you might 
need to install xfsprogs and the xfs kernel module ... which kernel are 
you using BTW?)


then

mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress


dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct

and see how it handles things.

When btrfs finally stabilizes enough to be used, it should be a 
reasonable replacement for xfs, but this is likely to be a few years.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

1 2 >

1 - 100 of 183 matches

Mail list logo