Re: [Gluster-users] 3.1.1 crashing under moderate load

2010-12-06 Thread Lana Deere
OK, thanks.  If I have further follow-ups I'll attach them to that bug report.

.. Lana (lana.de...@gmail.com)






On Mon, Dec 6, 2010 at 9:46 PM, Raghavendra G  wrote:
> Hi Lana,
>
> We are able to reproduce the issue locally and are working on a fix to it. 
> Progress of the bug can be tracked at:
> http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197.
>
> Thanks for your inputs.
>
> regards,
> - Original Message -
> From: "Lana Deere" 
> To: "Raghavendra G" 
> Cc: gluster-users@gluster.org
> Sent: Tuesday, December 7, 2010 12:20:08 AM
> Subject: Re: [Gluster-users] 3.1.1 crashing under moderate load
>
> One other observation is that it seems to be genuinely related to the
> number of nodes involved.
>
> If I run, say, 50 instances of my script using 50 separate nodes, then
> they almost always generate some failures.
>
> If I run the same number of instances, or even a much greater number,
> but using only 10 separate nodes, then they seem always to work OK.
>
> Maybe this is due to some kind of caching behaviour?
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>
>
>
> On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere  wrote:
>> The gluster configuration is distribute, there are 4 server nodes.
>>
>> There are 53 physical client nodes in my setup, each with 8 cores; we
>> want to sometimes run more than 400 client processes simultaneously.
>> In practice we aren't yet trying that many.
>>
>> When I run the commands which break, I am running them on separate
>> clients simultaneously.
>>    for host in ; do ssh $host script& done  # Note the &
>> When I run on 25 clients simultaneously so far I have not seen it
>> fail.  But if I run on 40 or 50 simultaneously it often fails.
>>
>> Sometimes I have run more than one command on each client
>> simultaneously by listing all the hosts multiple times in the
>> for-loop,
>>   for host in   ; do ssh $host script& done
>> In example of 3 at a time I have noticed that when a host works, all
>> three on that client will work; but when it fails, all three will fail
>> exactly the same fashion.
>>
>> I've attached a tarfile containing two sets of logs.  In both cases I
>> had rotated all the log files and rebooted everything then run my
>> test.  In the first set of logs, I went directly to approx. 50
>> simultaneous sessions, and pretty much all of them just hung.  (When
>> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
>> logs again and rebooted everything, but this time I gradually worked
>> my way up to higher loads.  This time the failures were mostly cases
>> with the wrong checksum but no error message, though some of them did
>> give me errors like
>>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
>>
>> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
>> problem there.
>>
>>
>> .. Lana (lana.de...@gmail.com)
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G  
>> wrote:
>>> Hi Lana,
>>>
>>> I need some clarifications about test setup:
>>>
>>> * Are you seeing problem when there are more than 25 clients? If this is 
>>> the case, are these clients on different physical nodes or is it that more 
>>> than one client shares same node? In other words, clients are mounted on 
>>> how many physical nodes are there in your test setup? Also, are you running 
>>> the command on each of these clients simultaneously?
>>>
>>> * Or is it that there are more than 25 concurrent concurrent invocations of 
>>> the script? If this is the case, how many clients are present in your test 
>>> setup and on how many physical nodes these clients are mounted?
>>>
>>> regards,
>>> - Original Message -
>>> From: "Lana Deere" 
>>> To: gluster-users@gluster.org
>>> Sent: Saturday, December 4, 2010 12:13:30 AM
>>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>>>
>>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>>> transport, native/fuse access.
>>>
>>> I have a directory which is shared on the gluster.  In fact, it is a clone
>>> of /lib from one of the clients, shared so all can see it.
>>>
>>> I have a script which does
>>>    find lib -type f -print0 | xargs -0 sum | md5sum
>>>
>>> If I run this on my clients one at a time, they all yield the same md5sum:
>>>    for h in <>; do ssh $host script; done
>>>
>>> If I run this on my clients concurrently, up to roughly 25 at a time they
>>> still yield the same md5sum.
>>>    for h in <>; do ssh $host script& done
>>>
>>> Beyond that the gluster share often, but not always, fails.  The errors 
>>> vary.
>>>    - sometimes I get "sum: xxx.so not found"
>>>    - sometimes I get the wrong checksum without any error message
>>>    - sometimes the job simply hangs until I kill it
>>>
>>>
>>> Some of the server logs show messages like these from the time of the
>>> failures (other servers show nothing from around that time):
>>>
>>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>>> rpc-transport/rdma: rdma.R

[Gluster-users] AFR Question

2010-12-06 Thread Joshua Saayman
Referring to 
http://www.gluster.com/community/documentation/index.php/Setting_up_AFR_on_two_servers_with_client_side_replication

Was this feature changed in 3.1? Is there a reason to prefer
server-side AFR to client-side?

Thanks
Joshua
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Setting up the glusterFs on a local machine or a cluster of two machines

2010-12-06 Thread PradeepKumar Penchala
Hello,

Can someone please help me in setting up the gluster file system on either a
cluster or a local machine with the necessary details of the minimum system
requirements

I am finding it hard to explore from the gluster site as the details are
going above my level.


Really awaiting your king guidance which would be of immense help to do my
work,


Thanks and Regards,
Pradeep
MT2009104
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] performance stops at 1Gb

2010-12-06 Thread P.Gotwalt
Good idea with the dd. Here the results:

 

dd if=/dev/zero of=/gluster/file bs=1M count=1K  

1024+0 records in

1024+0 records out

1073741824 bytes (1.1 GB) copied, 9.53609 seconds, 113 MB/s

 

dd if=/gluster/file of=/dev/null bs=1M count=1K  

1024+0 records in

1024+0 records out

1073741824 bytes (1.1 GB) copied, 9.93758 seconds, 108 MB/s

 

My configuration:

]# mount

..

glusterfs#node70:/stripevol on /gluster type fuse
(rw,allow_other,default_permissions,max_read=131072)

]# ssh node70 gluster volume info stripevol

Volume Name: stripevol

Type: Stripe

Status: Started

Number of Bricks: 4

Transport-type: tcp

Bricks:

Brick1: node70.storage.surfnet.nl:/data1

Brick2: node80.storage.surfnet.nl:/data1

Brick3: node90.storage.surfnet.nl:/data1

Brick4: node100.storage.surfnet.nl:/data1

]#

 

To be sure my 10Gb nic is working:

 

]# ethtool eth2

Settings for eth2:

Supported ports: [ FIBRE ]

Supported link modes:   1baseT/Full 

Supports auto-negotiation: No

Advertised link modes:  1baseT/Full 

Advertised auto-negotiation: No

Speed: 1Mb/s

Duplex: Full

Port: FIBRE

PHYAD: 0

Transceiver: external

Auto-negotiation: off

Supports Wake-on: d

Wake-on: d

Current message level: 0x0007 (7)

Link detected: yes

 

 

I used bonnie++ for my tests and used a blocksize of 32KB

 

Peter

 

Van: Anand Avati [mailto:anand.av...@gmail.com] 
Verzonden: 03 December 2010 14:03
Aan: Gotwalt, P.
CC: gluster-users@gluster.org
Onderwerp: Re: [Gluster-users] performance stops at 1Gb

 

Do both read and write throughput peak at 1Gbit/s? What is the block size
used for performing I/O? Can you get the output of -

 

1. dd if=/dev/zero of=/mnt/stripe/file bs=1M count=1K

 

2. dd if=/mnt/stripe/file of=/dev/null bs=1M count=1K

 

Just one instance of dd is enough as the client network interface (10Gbit/s)
has enough juice to saturate 4x1Gbit servers.

 

Avati

On Fri, Dec 3, 2010 at 6:06 PM, Gotwalt, P.  wrote:

Craig,

Using multiple parallel bonnie++ benchmarks (4,8,16) does use several
files. These file are 1GB each, and we take care there will be at least
32 of them. As we have multiple processes (4,8,16 bonnie++s) and each
uses several files, we spread the io over different storage bricks. I
can see this when monitoring network and disk activity on the bricks.
For example: when bonnie++ does block read/writes on a striped (4
bricks) volume I notice that the load of the client (network throughput)
is evenly spread over the 4 nodes. These nodes have enough cpu, memory,
network and disk resources left! The accoumulated throughput doesn't get
over the 1 Gb.
The 10Gb nic at the client is set to fixed 10Gb, full duplex, All the
nics on the storage bricks are 1Gb, fixed, full duplex. The 10Gb client
(dual quadcore, 16GB) has plenty of resources to run 16 bonnie++s
parallel. We should be able to get more than this 1Gb throughput,
especially with a striped volume.

What kind of benchmarks do you run? And with what kind of setup?

Peter




> Peter -
> Using Gluster the performance of any single file is going to be
> limited to the performance of the server on which it exists, or in the

> case of a striped volume of the server on which the segment of the
file
> you are accessing exists. If you were able to start 4 processes,
> accessing different parts of the striped file, or lots of different
> files in a distribute cluster you would see your performance increase
> significantly.

> Thanks,

> Craig

> -->
> Craig Carl
> Senior Systems Engineer
> Gluster
>
>
> On 11/26/2010 07:57 AM, Gotwalt, P. wrote:
> > Hi All,
> >
> > I am doing some tests with gluster (3.1) and have a problem of not
> > getting higher throughput than 1 Gb (yes bit!) with 4 storage
bricks.
> > My setup:
> >
> > 4 storage bricks (dualcore, 4GB mem) each with 3 sata 1Tb disks,
> > connected to a switch with 1 Gb nics.  In my tests I only use 1 SATA
> > disk as a volume, per brick.
> > 1 client (2xquad core, 16 GB mem) with a 10Gb nic to the same switch
as
> > the bricks.
> >
> > When using striped of distributed configurations, with all 4 bricks
> > configured to act as a server, the performance will never be higher
than
> > just below 1 Gb! I tested with 4, 8 and 16 parallel bonnie++ runs.
> >
> > The idea is that parallel bonnie's create enough files to get
> > distributed over the storage bricks. And all this bonnie's wil
deliver
> > enough throughput to fill up this 10Gb line. I expect the throughput
to
> > be maximum 4Gb because that's the maximum the 4 storage bricks
together
> > can produce.
> >
> > I also tested the throughput of the network with iperf3 and got:
> > - 5Gb to a second temporary client on another switch 200 Km from my
> > site, connected with a 5Gb fiber
> > - 908-920 Mb to the interfaces of the bricks.
> > So the network seems ok.
> >
> > Can someone advise me on why I 

Re: [Gluster-users] Gluster 3.1 bailout error

2010-12-06 Thread Samuel Hassine
I have the same errors sometime.

I attach my entire client and server log files.

-Message d'origine-
De : gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] De la part de Raghavendra G
Envoyé : lundi 6 décembre 2010 08:55
À : Matt Keating
Cc : gluster-users@gluster.org
Objet : Re: [Gluster-users] Gluster 3.1 bailout error

Hi Matt,

Can you attach entire client and server log files?

regards,
- Original Message -
From: "Matt Keating" 
To: gluster-users@gluster.org
Sent: Sunday, December 5, 2010 3:39:34 AM
Subject: [Gluster-users] Gluster 3.1 bailout error

Hi,

I've got a GlusterFS share serving web pages and I'm finding that imagecache
isn't always able to create new files on the mount.
Since upgrading to GlusterFS 3.1, I'm having ALOT of these errors appearing
in the logs:

logs/EBS-drupal-shared-.log:[2010-11-29 10:29:18.141045] E
[rpc-clnt.c:199:call_bail] drupal-client-0: bailing out frame type(GlusterFS
3.1) op(FINODELK(30)) xid = 0xb69cc sent = 2010-11-29 09:59:10.112834.
timeout = 1800
logs/EBS-drupal-shared-.log:[2010-11-29 10:42:58.365735] E
[rpc-clnt.c:199:call_bail] drupal-client-0: bailing out frame type(GlusterFS
3.1) op(FINODELK(30)) xid = 0xb863e sent = 2010-11-29 10:12:54.584124.
timeout = 1800
logs/EBS-drupal-shared-.log:[2010-11-29 12:00:02.572679] E
[rpc-clnt.c:199:call_bail] drupal-client-0: bailing out frame type(GlusterFS
3.1) op(FINODELK(30)) xid = 0xbe36f sent = 2010-11-29 11:29:57.497653.
timeout = 1800


Could anyone shed any light on whats happening/wrong?

Thanks,
Matt

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] request for help on performance issue on using glusterfs via cifs

2010-12-06 Thread prometheus...@hotmail.com

hi list,

at the end of the mail ill post my configuration.
Whats the environment?
My current purpose is to export 4 Sata disks as one big storage. All 
disks are attached to the same host.
Since i want that also windows user can access the data i created a 
server.vol and client.vol and thus

i have 2 glusterfs processes running.
one for providing the glusterfs server to export the disks
and one for mounting this export on localhost so i can reexport this 
mount via samba.


Some numbers:
disk access via dd: write 17MB/s, read 50MB/s
file transfer from client ONLY to a samba directory without glusterfs in 
between but with destination disk: write 7MB/s, read 12MB/s

file transfer from client via samba and glusterfs: write 1.8MB/s, read 8MB/s

altough this numbers arent really mind blowing i really tought to get 
near samba values using glusterfs
i expected something like write 5MB read 11MB or such so i can tune 
other stuff but gluster seems to be priority one here


im using glusterfs 3.0.2 cause newer versions seem to have a memory leak 
for my configuration
anyway, i dont think the performance problem is not solved by upgrading 
since i already tried this

and the numbers keep the same
i also tried using ftp via glusterfs resulting in the same speed as 
samba via gluster

so i think my configuration is just bad

i also read about the possibility to provide server/client within a 
single process, but didnt find any usefull
documentation on how i need to start glusterfs or how i have to write 
the configuration

thats why im using 2 glusterfs processes

cause i dont think that the issue can be found on the hardware side or 
samba itself
i would be glad if someone can provide suggestions for better 
configuration AND also provide why the suggestion is better


i also tried many performance translators, which none really gave much 
benefit

so i turned back to the most basic config to start with

hardware: via mainboard with VIA C7 Processor 1800MHz and 1024mb ram

thx in advance
jd


 client.vol

volume node_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8
 type protocol/client
# unix domain socket isnt faster for this config
# option transport-type unix
# option transport.socket.connect-path /tmp/.glusterfs.server
 option transport-type tcp # for TCP/IP transport
 option remote-host 127.0.0.1
 option remote-port 6996
 option remote-subvolume Data_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8
end-volume


volume node_bf8adbcf-8c6d-48f1-a622-c26cb3792c49
 type protocol/client
# option transport-type unix
# option transport.socket.connect-path /tmp/.glusterfs.server
 option transport-type tcp # for TCP/IP transport
 option remote-host 127.0.0.1
 option remote-port 6996
 option remote-subvolume Data_bf8adbcf-8c6d-48f1-a622-c26cb3792c49
end-volume


volume node_839dc61d-c7df-4630-b375-b1f86ee0ace9
 type protocol/client
# option transport-type unix
# option transport.socket.connect-path /tmp/.glusterfs.server
 option transport-type tcp # for TCP/IP transport
 option remote-host 127.0.0.1
 option remote-port 6996
 option remote-subvolume Data_839dc61d-c7df-4630-b375-b1f86ee0ace9
end-volume


volume node_4c3c37f0-0ebc-46ff-9f37-0fa3dac56560
 type protocol/client
# option transport-type unix
# option transport.socket.connect-path /tmp/.glusterfs.server
 option transport-type tcp # for TCP/IP transport
 option remote-host 127.0.0.1
 option remote-port 6996
 option remote-subvolume Data_4c3c37f0-0ebc-46ff-9f37-0fa3dac56560
end-volume


volume distributeData
  type cluster/distribute
  subvolumes  node_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8 
node_bf8adbcf-8c6d-48f1-a622-c26cb3792c49 
node_839dc61d-c7df-4630-b375-b1f86ee0ace9 
node_4c3c37f0-0ebc-46ff-9f37-0fa3dac56560

end-volume
volume Data
  type performance/io-threads
  option thread-count 16
  subvolumes distributeData
end-volume



   server.vol

volume posix_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8
  type storage/posix
  option directory /media/ab8f19a5-c187-4b7e-bd2a-7781f646b3a8/storage/Data
end-volume
volume Data_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8
  type features/locks
  subvolumes posix_ab8f19a5-c187-4b7e-bd2a-7781f646b3a8
end-volume


volume posix_bf8adbcf-8c6d-48f1-a622-c26cb3792c49
  type storage/posix
  option directory /media/bf8adbcf-8c6d-48f1-a622-c26cb3792c49/storage/Data
end-volume
volume Data_bf8adbcf-8c6d-48f1-a622-c26cb3792c49
  type features/locks
  subvolumes posix_bf8adbcf-8c6d-48f1-a622-c26cb3792c49
end-volume


volume posix_839dc61d-c7df-4630-b375-b1f86ee0ace9
  type storage/posix
  option directory /media/839dc61d-c7df-4630-b375-b1f86ee0ace9/storage/Data
end-volume
volume Data_839dc61d-c7df-4630-b375-b1f86ee0ace9
  type features/locks
  subvolumes posix_839dc61d-c7df-4630-b375-b1f86ee0ace9
end-volume


volume posix_4c3c37f0-0ebc-46ff-9f37-0fa3dac56560
  type storage/posix
  option directory /media/4c3c37f0-0ebc-46ff-9f37-0fa3dac56560/storage/Data
end-volume
volume Data_4c3c37f0-0ebc-46ff-9f37-0fa3dac56560
  type featur

Re: [Gluster-users] 3.1.1 crashing under moderate load

2010-12-06 Thread Raghavendra G
Hi Lana,

We are able to reproduce the issue locally and are working on a fix to it. 
Progress of the bug can be tracked at:
http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197.

Thanks for your inputs.

regards,
- Original Message -
From: "Lana Deere" 
To: "Raghavendra G" 
Cc: gluster-users@gluster.org
Sent: Tuesday, December 7, 2010 12:20:08 AM
Subject: Re: [Gluster-users] 3.1.1 crashing under moderate load

One other observation is that it seems to be genuinely related to the
number of nodes involved.

If I run, say, 50 instances of my script using 50 separate nodes, then
they almost always generate some failures.

If I run the same number of instances, or even a much greater number,
but using only 10 separate nodes, then they seem always to work OK.

Maybe this is due to some kind of caching behaviour?

.. Lana (lana.de...@gmail.com)






On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere  wrote:
> The gluster configuration is distribute, there are 4 server nodes.
>
> There are 53 physical client nodes in my setup, each with 8 cores; we
> want to sometimes run more than 400 client processes simultaneously.
> In practice we aren't yet trying that many.
>
> When I run the commands which break, I am running them on separate
> clients simultaneously.
>    for host in ; do ssh $host script& done  # Note the &
> When I run on 25 clients simultaneously so far I have not seen it
> fail.  But if I run on 40 or 50 simultaneously it often fails.
>
> Sometimes I have run more than one command on each client
> simultaneously by listing all the hosts multiple times in the
> for-loop,
>   for host in   ; do ssh $host script& done
> In example of 3 at a time I have noticed that when a host works, all
> three on that client will work; but when it fails, all three will fail
> exactly the same fashion.
>
> I've attached a tarfile containing two sets of logs.  In both cases I
> had rotated all the log files and rebooted everything then run my
> test.  In the first set of logs, I went directly to approx. 50
> simultaneous sessions, and pretty much all of them just hung.  (When
> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
> logs again and rebooted everything, but this time I gradually worked
> my way up to higher loads.  This time the failures were mostly cases
> with the wrong checksum but no error message, though some of them did
> give me errors like
>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
>
> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
> problem there.
>
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>
>
>
> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G  
> wrote:
>> Hi Lana,
>>
>> I need some clarifications about test setup:
>>
>> * Are you seeing problem when there are more than 25 clients? If this is the 
>> case, are these clients on different physical nodes or is it that more than 
>> one client shares same node? In other words, clients are mounted on how many 
>> physical nodes are there in your test setup? Also, are you running the 
>> command on each of these clients simultaneously?
>>
>> * Or is it that there are more than 25 concurrent concurrent invocations of 
>> the script? If this is the case, how many clients are present in your test 
>> setup and on how many physical nodes these clients are mounted?
>>
>> regards,
>> - Original Message -
>> From: "Lana Deere" 
>> To: gluster-users@gluster.org
>> Sent: Saturday, December 4, 2010 12:13:30 AM
>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>>
>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>> transport, native/fuse access.
>>
>> I have a directory which is shared on the gluster.  In fact, it is a clone
>> of /lib from one of the clients, shared so all can see it.
>>
>> I have a script which does
>>    find lib -type f -print0 | xargs -0 sum | md5sum
>>
>> If I run this on my clients one at a time, they all yield the same md5sum:
>>    for h in <>; do ssh $host script; done
>>
>> If I run this on my clients concurrently, up to roughly 25 at a time they
>> still yield the same md5sum.
>>    for h in <>; do ssh $host script& done
>>
>> Beyond that the gluster share often, but not always, fails.  The errors vary.
>>    - sometimes I get "sum: xxx.so not found"
>>    - sometimes I get the wrong checksum without any error message
>>    - sometimes the job simply hangs until I kill it
>>
>>
>> Some of the server logs show messages like these from the time of the
>> failures (other servers show nothing from around that time):
>>
>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
>> socket (peer: 10.54.255.240:1022) after handshake is complete
>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e82, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-serv

[Gluster-users] gluster rebalance taking multiple days

2010-12-06 Thread Michael Robbert
How long should a rebalance take? I know that it depends so lets take this 
example. 4 servers, 1 brick per server. here is the df -i output from the 
servers:

[r...@ra5 ~]# pdsh -g rack7 "df -i|grep brick"
iosrv-7-1:  366288896 2720139 3635687571% /mnt/brick1
iosrv-7-4:  366288896 3240868 3630480281% /mnt/brick4
iosrv-7-2:  366288896 2594165 3636947311% /mnt/brick2
iosrv-7-3:  366288896 3267152 3630217441% /mnt/brick3

So, it looks like there are roughly 10 million files. I have a rebalance 
running on one of the servers since last Friday and this is what the status 
looks like right now:

[r...@iosrv-7-2 ~]# gluster volume rebalance gluster-test status
rebalance step 1: layout fix in progress: fixed layout 149531740

As a side note I started this rebalance when I noticed that about half of my 
clients are missing a certain set of files. Upon further investigation I found 
that a different set of clients are missing different data. This problem 
happened after many problems getting an upgrade to 3.1.1 working. Unfortunately 
I don't remember which version was running when I was last able to write to 
this volume.

Any thoughts?

Mike

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS - Abysmal Speed on Fresh Install

2010-12-06 Thread Jacob Shucart
Chris,

Can you tell me more about your environment such as a description of the
hardware, network speed and topology, and the "gluster volume info"
output?  The speed you sent is much lower than I would expect in most
environments.  Since you're only copying a single file I don't know that
caching will really help much.  There are cache options you can set if you
want to try with them.  You can find them on:

http://www.gluster.com/community/documentation/index.php/Gluster_3.1:_Sett
ing_Volume_Options

Specifically, you can set the size of files that get cached(the file you
have there might be too big, so try setting a max file size that is big
enough).  Also, the cache could be expiring so you might want to increase
the refresh timeout.

-Jacob

-Original Message-
From: gluster-users-boun...@gluster.org
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Christopher
Michaelis
Sent: Friday, December 03, 2010 3:55 PM
To: gluster-users@gluster.org
Subject: [Gluster-users] GlusterFS - Abysmal Speed on Fresh Install

Hi there,

I'm new to the GlusterFS project - looks superb and very promising! I'm 
running into massive performance issues, however. I'm using the stock 
default configuration that GlusterFS put in place when I created the 
volume - it seems to reference io-cache, quick-read, etc in the volume 
files, which makes me think it's already pulling in these optimizations.

This is a replicate setup - I've tried with 2, 3, and 4 nodes, and 
performance remains awful on any of them - network communication seems 
fine, with average speeds at 9-15MB/sec.

# gluster --version
glusterfs 3.1.1 built on Nov 29 2010 10:07:45
Repository revision: v3.1.1

With no other activity on the filesystems on any of my nodes:
# time ls -al
total 10072
drwx--x--x 10 root   root   4096 Dec  3 16:50 ./
drwxr-xr-x  3 root   root   4096 Dec  3 14:19 ../
-rw-r--r--  1 root   root   1024 Dec  3 16:49 testfile

real0m1.347s
user0m0.000s
sys0m0.000s

# time cp testfile testfile2

real0m11.254s
user0m0.000s
sys0m0.000s
# time diff testfile*

real0m5.792s
user0m0.004s
sys0m0.000s

Read speed is marginally faster than write speed, but still horrible - 
e.g. if Apache is serving content off of a glusterfs mountpoint, it 
times out 95% of the time before it can read the files. I'm using mount 
-t glusterfs, with default mount options.

Can anyone point me in the right direction to getting things nice and 
speedy here? I'd appreciate any feedback or help! I can provide any 
configuration files necessary, or even root login access to the box(es) 
via private e-mail if you want to poke around (these are just test boxes 
presently).

Thanks,
--Chris
chris.michae...@uk2group.com
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 3.1.1 crashing under moderate load

2010-12-06 Thread Lana Deere
One other observation is that it seems to be genuinely related to the
number of nodes involved.

If I run, say, 50 instances of my script using 50 separate nodes, then
they almost always generate some failures.

If I run the same number of instances, or even a much greater number,
but using only 10 separate nodes, then they seem always to work OK.

Maybe this is due to some kind of caching behaviour?

.. Lana (lana.de...@gmail.com)






On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere  wrote:
> The gluster configuration is distribute, there are 4 server nodes.
>
> There are 53 physical client nodes in my setup, each with 8 cores; we
> want to sometimes run more than 400 client processes simultaneously.
> In practice we aren't yet trying that many.
>
> When I run the commands which break, I am running them on separate
> clients simultaneously.
>    for host in ; do ssh $host script& done  # Note the &
> When I run on 25 clients simultaneously so far I have not seen it
> fail.  But if I run on 40 or 50 simultaneously it often fails.
>
> Sometimes I have run more than one command on each client
> simultaneously by listing all the hosts multiple times in the
> for-loop,
>   for host in   ; do ssh $host script& done
> In example of 3 at a time I have noticed that when a host works, all
> three on that client will work; but when it fails, all three will fail
> exactly the same fashion.
>
> I've attached a tarfile containing two sets of logs.  In both cases I
> had rotated all the log files and rebooted everything then run my
> test.  In the first set of logs, I went directly to approx. 50
> simultaneous sessions, and pretty much all of them just hung.  (When
> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
> logs again and rebooted everything, but this time I gradually worked
> my way up to higher loads.  This time the failures were mostly cases
> with the wrong checksum but no error message, though some of them did
> give me errors like
>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
>
> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
> problem there.
>
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>
>
>
> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G  
> wrote:
>> Hi Lana,
>>
>> I need some clarifications about test setup:
>>
>> * Are you seeing problem when there are more than 25 clients? If this is the 
>> case, are these clients on different physical nodes or is it that more than 
>> one client shares same node? In other words, clients are mounted on how many 
>> physical nodes are there in your test setup? Also, are you running the 
>> command on each of these clients simultaneously?
>>
>> * Or is it that there are more than 25 concurrent concurrent invocations of 
>> the script? If this is the case, how many clients are present in your test 
>> setup and on how many physical nodes these clients are mounted?
>>
>> regards,
>> - Original Message -
>> From: "Lana Deere" 
>> To: gluster-users@gluster.org
>> Sent: Saturday, December 4, 2010 12:13:30 AM
>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>>
>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>> transport, native/fuse access.
>>
>> I have a directory which is shared on the gluster.  In fact, it is a clone
>> of /lib from one of the clients, shared so all can see it.
>>
>> I have a script which does
>>    find lib -type f -print0 | xargs -0 sum | md5sum
>>
>> If I run this on my clients one at a time, they all yield the same md5sum:
>>    for h in <>; do ssh $host script; done
>>
>> If I run this on my clients concurrently, up to roughly 25 at a time they
>> still yield the same md5sum.
>>    for h in <>; do ssh $host script& done
>>
>> Beyond that the gluster share often, but not always, fails.  The errors vary.
>>    - sometimes I get "sum: xxx.so not found"
>>    - sometimes I get the wrong checksum without any error message
>>    - sometimes the job simply hangs until I kill it
>>
>>
>> Some of the server logs show messages like these from the time of the
>> failures (other servers show nothing from around that time):
>>
>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
>> socket (peer: 10.54.255.240:1022) after handshake is complete
>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e82, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e83, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>>
>>
>> On

[Gluster-users] GlusterFS pNFS support projected for the future?

2010-12-06 Thread Burnash, James
I was wondering if there are plans in the (unpublished?) road map for Gluster 
to support pNFS clients whenever they are finally deployed with Linux 
distributions?

RHEL 6 has internal support for NFSv4, so one hopes for a client at some 
not-to-distant time ... though there is no mention in the roadmap given here: 
http://www.redhat.com/promo/summit/2010/presentations/summit/whats-next/wed/tburke-1020-rhel6-roadmap/tburke_rhel6_summit.pdf

Thanks,

James Burnash, Unix Engineering


DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by the 
addressee(s) named herein and may contain legally privileged and/or 
confidential information. If you are not the intended recipient of this e-mail, 
you are hereby notified that any dissemination, distribution or copying of this 
e-mail, and any attachments thereto, is strictly prohibited. If you have 
received this in error, please immediately notify me and permanently delete the 
original and any copy of any e-mail and any printout thereof. E-mail 
transmission cannot be guaranteed to be secure or error-free. The sender 
therefore does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its 
discretion, monitor and review the content of all e-mail communications. 
http://www.knight.com
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 3.1.1 crashing under moderate load

2010-12-06 Thread Lana Deere
The gluster configuration is distribute, there are 4 server nodes.

There are 53 physical client nodes in my setup, each with 8 cores; we
want to sometimes run more than 400 client processes simultaneously.
In practice we aren't yet trying that many.

When I run the commands which break, I am running them on separate
clients simultaneously.
for host in ; do ssh $host script& done  # Note the &
When I run on 25 clients simultaneously so far I have not seen it
fail.  But if I run on 40 or 50 simultaneously it often fails.

Sometimes I have run more than one command on each client
simultaneously by listing all the hosts multiple times in the
for-loop,
   for host in   ; do ssh $host script& done
In example of 3 at a time I have noticed that when a host works, all
three on that client will work; but when it fails, all three will fail
exactly the same fashion.

I've attached a tarfile containing two sets of logs.  In both cases I
had rotated all the log files and rebooted everything then run my
test.  In the first set of logs, I went directly to approx. 50
simultaneous sessions, and pretty much all of them just hung.  (When
the find hangs, even a kill -9 will not unhang it.)  So I rotated the
logs again and rebooted everything, but this time I gradually worked
my way up to higher loads.  This time the failures were mostly cases
with the wrong checksum but no error message, though some of them did
give me errors like
find: lib/kbd/unimaps/cp865.uni: Invalid argument

Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
problem there.


.. Lana (lana.de...@gmail.com)






On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G  wrote:
> Hi Lana,
>
> I need some clarifications about test setup:
>
> * Are you seeing problem when there are more than 25 clients? If this is the 
> case, are these clients on different physical nodes or is it that more than 
> one client shares same node? In other words, clients are mounted on how many 
> physical nodes are there in your test setup? Also, are you running the 
> command on each of these clients simultaneously?
>
> * Or is it that there are more than 25 concurrent concurrent invocations of 
> the script? If this is the case, how many clients are present in your test 
> setup and on how many physical nodes these clients are mounted?
>
> regards,
> - Original Message -
> From: "Lana Deere" 
> To: gluster-users@gluster.org
> Sent: Saturday, December 4, 2010 12:13:30 AM
> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>
> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
> transport, native/fuse access.
>
> I have a directory which is shared on the gluster.  In fact, it is a clone
> of /lib from one of the clients, shared so all can see it.
>
> I have a script which does
>    find lib -type f -print0 | xargs -0 sum | md5sum
>
> If I run this on my clients one at a time, they all yield the same md5sum:
>    for h in <>; do ssh $host script; done
>
> If I run this on my clients concurrently, up to roughly 25 at a time they
> still yield the same md5sum.
>    for h in <>; do ssh $host script& done
>
> Beyond that the gluster share often, but not always, fails.  The errors vary.
>    - sometimes I get "sum: xxx.so not found"
>    - sometimes I get the wrong checksum without any error message
>    - sometimes the job simply hangs until I kill it
>
>
> Some of the server logs show messages like these from the time of the
> failures (other servers show nothing from around that time):
>
> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
> socket (peer: 10.54.255.240:1022) after handshake is complete
> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
> rpc-service: failed to submit message (XID: 0x55e82, Program:
> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
> (rdma.RaidData-server)
> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
> Reply submission failed
> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
> rpc-service: failed to submit message (XID: 0x55e83, Program:
> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
> (rdma.RaidData-server)
> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
> Reply submission failed
>
>
> On a client which had a failure I see messages like:
>
> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
> (peer: 10.54.50.101:24009) after handshake is complete
> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
> op(READ(12)) called at 2010-12-03 10

[Gluster-users] diagnostics.latency-measurement

2010-12-06 Thread Samuel Hassine
Hi,

 

I want to know why my Gluster is a little slow when accessing many little
files such as MySQL databases. I set the option
diagnostics.latency-measurement to yes.

 

Options Reconfigured:

diagnostics.latency-measurement: yes

cluster.self-heal-window-size: 1024

performance.cache-refresh-timeout: 10

performance.cache-size: 4096MB

 

What is the log file or the command to view this diagnostic?

 

Regards.

Samuel Hassine

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Many errors

2010-12-06 Thread Samuel Hassine
Hello there,

 

I have Gluster 3.1.1 and just 2 replicated nodes for one big partition. I
have many errors "operation not permitted" like :

 

[2010-12-06 12:00:54.637585] W [fuse-bridge.c:648:fuse_setattr_cbk]
glusterfs-fuse: 12221497: SETATTR()
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_topics_track.TMD =>
-1 (Operation not permitted)

[2010-12-06 12:00:55.2018] I [afr-common.c:716:afr_lookup_done]
dns-replicate-0: background  meta-data self-heal triggered. path:
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_topics_watch.MYD

[2010-12-06 12:00:55.99546] I
[afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] dns-replicate-0:
background  meta-data self-heal completed on
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_topics_watch.MYD

[2010-12-06 12:00:55.153282] W [fuse-bridge.c:648:fuse_setattr_cbk]
glusterfs-fuse: 12221747: SETATTR()
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_topics_watch.TMD =>
-1 (Operation not permitted)

[2010-12-06 12:00:55.196424] I [afr-common.c:599:afr_lookup_self_heal_check]
dns-replicate-0: permissions differ for
/com/olympe-network/var/lib/mysql/23619_bvlad

[2010-12-06 12:00:55.196451] I [afr-common.c:607:afr_lookup_self_heal_check]
dns-replicate-0: ownership differs for
/com/olympe-network/var/lib/mysql/23619_bvlad

[2010-12-06 12:00:55.196462] I [afr-common.c:716:afr_lookup_done]
dns-replicate-0: background  meta-data self-heal triggered. path:
/com/olympe-network/var/lib/mysql/23619_bvlad

[2010-12-06 12:00:55.205414] I
[afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] dns-replicate-0:
background  meta-data self-heal completed on
/com/olympe-network/var/lib/mysql/23619_bvlad

[2010-12-06 12:00:55.538273] I [afr-common.c:716:afr_lookup_done]
dns-replicate-0: background  meta-data self-heal triggered. path:
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_user_group.MYD

[2010-12-06 12:00:55.560411] I
[afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] dns-replicate-0:
background  meta-data self-heal completed on
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_user_group.MYD

[2010-12-06 12:00:55.64] W [fuse-bridge.c:648:fuse_setattr_cbk]
glusterfs-fuse: 1077: SETATTR()
/com/olympe-network/var/lib/mysql/13052_surfyport/phpbb_user_group.TMD => -1
(Operation not permitted)

 

What are these errors?

 

Thanks for your answer.

 

Regards.

Sam

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users