Re: [Gluster-users] EC planning

2015-10-14 Thread Xavier Hernandez

Hi Serkan,

On 14/10/15 15:13, Serkan Çoban wrote:

Hi Xavier,


I'm not sure if I understand you. Are you saying you will create two separate 
gluster volumes or you will add both bricks to the same distributed-dispersed 
volume ?


Is adding more than one brick from same host to a disperse gluster
volume recommended? I meant two different gluster volume.
If I add two bricks from same server to same dispersed volume and lets
say it is 8+1 configuration, then loosing one host will bring down the
volume right?


If you add two bricks from the same server to the *same* disperse set, 
then yes, a failure of the node will mean the failure of two bricks. 
However this is not what I'm saying. You can have more than one brick on 
the same server but assign each one of them to a different disperse set 
of the same gluster volume. This way, if a server fails, only one brick 
of each replica set is lost, causing no troubles.


For example, suppose you have 6 servers and you create 4 bricks in each 
server. You could create a volume like this:


  gluster volume create test disperse 6 redundancy 2 \
 server{1..6}:/bricks/test_1 \
 server{1..6}:/bricks/test_2 \
 server{1..6}:/bricks/test_3 \
 server{1..6}:/bricks/test_4

This way you will create distributed-dispersed volume with 4 independent 
disperse sets, each formed with one brick of each server.


  Bricks of disperse set 1:
server1:/bricks/test_1
server2:/bricks/test_1
server3:/bricks/test_1
server4:/bricks/test_1
server5:/bricks/test_1
server6:/bricks/test_1

In this case, if server1 fails for example, you will lose 
server1:/bricks/test_{1..4}, but all disperse sets will continue to work 
without trouble.





One possibility is to get rid of the server RAID and use each disk as a single brick. 
This way you can create 26 bricks per server and assign each one to a different 
disperse set. A big distributed-dispersed volume balances I/O load >between bricks 
better. Note that RAID configurations have a reduction in the available number of 
IOPS. For sequential writes, this is not so bad, but if you have many clients 
accessing the same bricks, you will see many random ?>accesses even if clients are 
doing sequential writes. Caching can alleviate this, but if you want to sustain a 
throughput of 2-3 GB/s, caching effects are not so evident.


I can create 26 JBOD disks and use them as bricks but is this
recommended? By using 50 servers, brick count will be 1300, is this
not a problem?


I cannot tell that for sure as I haven't tested gluster installations 
with so many bricks. I have tested configurations with ~200 bricks and 
it works well. Basically this is an scalability issue related with the 
distribution part of gluster. Maybe someone from the DHT team could help 
you more on this.



Can you explain the configuration a bit more? For example by using
16+2, 26 brick per server and 54 servers total. In the end I only want
one gluster volume and protection for 2 host failure.


You can use a command like this:

  gluster volume create test disperse 18 redundancy 2 \
 server{1..54}:/bricks/test_1 \
 server{1..54}:/bricks/test_2 \
 ...
 server{1..54}:/bricks/test_26

This way you get a single gluster volume that can tolerate up to two 
full node failures without losing data.



Also in this case disk failures will be handled by gluster I hope this
don't bring more problems. But I will also test this configuration
when I get the servers..


Yes, in this case the recovery of a failed disk will be handled by 
gluster. It should work well.


With RAID, the recovery of a disk is local to the server (no network 
communication) and thus reads/writes are faster. However, to do this on 
a 208TB RAID using 26 8TB disks, it will need to read 192TB and write 
8TB. Quite a lot. With gluster the recovery will use the network, but 
only the used data will be recovered (i.e. if the failed disk was only 
half filled, it will only recover 4TB of data). On a 16+2 configuration, 
this means that it will read less than 16*8=128 TB and write less than 8 TB.


A 10Gbit network is a must for these configurations.

Xavi



Serkan



On Wed, Oct 14, 2015 at 2:03 PM, Xavier Hernandez  wrote:

Hi Serkan,

On 13/10/15 15:53, Serkan Çoban wrote:


Hi Xavier and thanks for your answers.

Servers will have 26*8TB disks.I don't want to loose more than 2 disk
for raid,
so my options are HW RAID6 24+2 or 2 * HW RAID5 12+1,



A RAID5 of more than 8-10 disks is normally considered unsafe because the
probability of a second drive failure while reconstructing another failed
drive is considerably high. The same happens with a RAID6 of more than 16-20
disks.


in both cases I can create 2 bricks per server using LVM and use one brick
per server to create two distributed-disperse volumes.

Re: [Gluster-users] MTU size issue?

2015-10-14 Thread Atin Mukherjee


On 10/14/2015 05:09 PM, Sander Zijlstra wrote:
> LS,
> 
> I recently reconfigured one of my gluster nodes and forgot to update the MTU 
> size on the switch while I did configure the host with jumbo frames.
> 
> The result was that the complete cluster had communication issues.
> 
> All systems are part of a distributed striped volume with a replica size of 2 
> but still the cluster was completely unusable until I updated the switch port 
> to accept jumbo frames rather than to discard them.
This is expected. When enabling the network components to communicate
with TCP jumbo frames in a Gluster Trusted Storage Pool, you'd need to
ensure that all the network components such as switches, nodes are
configured properly. I think with this setting you'd fail to ping the
other nodes in the pool. So that could be a step of verification before
you set the cluster up.
> 
> The symptoms were:
> 
> - Gluster clients had a very hard time reading the volume information and 
> thus couldn’t do any filesystem ops on them.
> - The glusterfs servers could see each other (peer status) and a volume info 
> command was ok, but a volume status command would not return or would return 
> a “staging failed” error.
> 
> I know MTU size mixing and don’t fragment bit’s can screw up a lot but why 
> wasn’t that gluster peer just discarded from the cluster so that not all 
> clients kept on communicating with it and causing all sorts of errors.
To answer this question, peer status & volume info are local operation
and doesn't incur N/W, so in this very same case you might see peer
status showing all the nodes are connected all though there is a
breakage, OTOH in status command originator node communicates with other
peers and hence it fails there.

HTH,
Atin
> 
> I use glusterFS 3.6.2 at the moment…..
> 
> Kind regards
> Sander
> 
> 
> 
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Geo-rep failing initial sync

2015-10-14 Thread Wade Fitzpatrick
I have twice now tried to configure geo-replication of our 
Stripe-Replicate volume to a remote Stripe volume but it always seems to 
have issues.


root@james:~# gluster volume info

Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 5f446a10-651b-4ce0-a46b-69871f498dbc
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: james:/data/gluster1/geo-rep-meta/brick
Brick2: cupid:/data/gluster1/geo-rep-meta/brick
Brick3: hilton:/data/gluster1/geo-rep-meta/brick
Brick4: present:/data/gluster1/geo-rep-meta/brick
Options Reconfigured:
performance.readdir-ahead: on

Volume Name: static
Type: Striped-Replicate
Volume ID: 3f9f810d-a988-4914-a5ca-5bd7b251a273
Status: Started
Number of Bricks: 1 x 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: james:/data/gluster1/static/brick1
Brick2: cupid:/data/gluster1/static/brick2
Brick3: hilton:/data/gluster1/static/brick3
Brick4: present:/data/gluster1/static/brick4
Options Reconfigured:
auth.allow: 10.x.*
features.scrub: Active
features.bitrot: on
performance.readdir-ahead: on
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

root@palace:~# gluster volume info

Volume Name: static
Type: Stripe
Volume ID: 3de935db-329b-4876-9ca4-a0f8d5f184c3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: palace:/data/gluster1/static/brick1
Brick2: madonna:/data/gluster1/static/brick2
Options Reconfigured:
features.scrub: Active
features.bitrot: on
performance.readdir-ahead: on

root@james:~# gluster vol geo-rep static ssh://gluster-b1::static status detail

MASTER NODEMASTER VOLMASTER BRICKSLAVE USER
SLAVE   SLAVE NODESTATUS CRAWL STATUS   
LAST_SYNCEDENTRYDATAMETAFAILURESCHECKPOINT TIME
CHECKPOINT COMPLETEDCHECKPOINT COMPLETION TIME

james  static/data/gluster1/static/brick1root  
ssh://gluster-b1::static10.37.1.11Active Changelog Crawl
2015-10-13 14:23:2000   0   1952064 N/A
N/A N/A
hilton static/data/gluster1/static/brick3root  
ssh://gluster-b1::static10.37.1.11Active Changelog CrawlN/A 
   00   0   1008035 N/AN/A  
   N/A
presentstatic/data/gluster1/static/brick4root  
ssh://gluster-b1::static10.37.1.12PassiveN/AN/A 
   N/A  N/A N/A N/A N/AN/A  
   N/A
cupid  static/data/gluster1/static/brick2root  
ssh://gluster-b1::static10.37.1.12PassiveN/AN/A 
   N/A  N/A N/A N/A N/AN/A  
   N/A


So just to clarify, data is striped over bricks 1 and 3; bricks 2 and 4 
are the replica.


Can someone help me diagnose the problem and find a solution?

Thanks in advance,
Wade.
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster 4.0 - upgrades & backward compatibility strategy

2015-10-14 Thread Atin Mukherjee


On 10/14/2015 05:50 PM, Roman wrote:
> Hi,
> 
> Its hard to comment plans and things like these, but I suggest everyone
> will be happy to have a possibility to upgrade from 3 to 4 without new
> installation, OK with offline upgrade also (shut down volumes and
> upgrade). And I'm somehow pretty sure, that this upgrade process should
> be pretty flawless so no one under any circumstances would need any kind
> of rollbacks, so there should not be any IFs :)
Just to clarify that there will be and has to be an upgrade path. That's
what I mentioned in point 4 in my mail. The only limitation would be
here is no rolling upgrade support.
> 
> 2015-10-07 8:32 GMT+03:00 Atin Mukherjee  >:
> 
> Hi All,
> 
> Over the course of the design discussion, we got a chance to discuss
> about the upgrades and backward compatibility strategy for Gluster 4.0
> and here is what we came up with:
> 
> 1. 4.0 cluster would be separate from 3.x clusters. Heterogeneous
> support won't be available.
> 
> 2. All CLI interfaces exposed in 3.x would continue to work with 4.x.
> 
> 3. ReSTful APIs for all old & new management actions.
> 
> 4. Upgrade path from 3.x to 4.x would be necessary. We need not support
> rolling upgrades, however all data layouts from 3.x would need to be
> honored. Our upgrade path from 3.x to 4.x should not be cumbersome.
> 
> 
> Initiative wise upgrades strategy details:
> 
> GlusterD 2.0
> 
> 
> - No rolling upgrade, service disruption is expected
> - Smooth upgrade from 3.x to 4.x (migration script)
> - Rollback - If upgrade fails, revert back to 3.x, old configuration
> data shouldn't be wiped off.
> 
> 
> DHT 2.0
> ---
> - No in place upgrade to DHT2
> - Needs migration of data
> - Backward compat, hence does not exist
> 
> NSR
> ---
> - volume migration from AFR to NSR is possible with an offline upgrade
> 
> We would like to hear from the community about your opinion on this
> strategy.
> 
> Thanks,
> Atin
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 
> 
> -- 
> Best regards,
> Roman.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Glusterfs as a root file system on the same node

2015-10-14 Thread Atin Mukherjee
I don't think this is possible. I'd like to what why do you want to use
use a root file system as a Gluster volume, what's your use case.
Technically this is impossible (wrt GlusterD) as we then have no way to
segregate the configuration data.

~Atin

On 10/15/2015 12:09 AM, satish kondapalli wrote:
> is anyone has any thoughts on this?
> 
> Sateesh
> 
> On Tue, Oct 13, 2015 at 5:44 PM, satish kondapalli
> mailto:nitw.sat...@gmail.com>> wrote:
> 
> Hi,
> 
> I want to mount  gluster volume as a root file system for my node. 
> 
> Node will boot from network( only kernel and initrd images) but my
> root file system has to be one of the gluster volume ( bricks of the
> volume are on the disks which are attached to the same node). 
> Gluster configuration files also part of the root file system. 
> 
> Here i am facing chicken and egg problem.  Initially i thought to
> keep glusterfs libs, binary in the initrd and start the gluster
> server as part of initrd execution. But for mounting root file
> system (which is a gluster volume), all the gluster configuration
> files are stored in the root file system. My assumption is, without
> gluster configuration files(/var/lib/glusterd/xx) gluster will not
> find any volumes. 
> 
> Can someone help me on this?
> 
> Sateesh
> 
> 
> 
> 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Ravishankar N



On 10/14/2015 11:08 PM, Игорь Бирюлин wrote:

Thanks for detailed description.
Do you have a plans add resolution GFID split-brain by 'gluster volume 
heal VOLNAME split-brain ...' ?

Not at the moment..
What the main different between GFID split-brain and data split brain? 
On nodes this file absolutely different by data content and size or it 
isn't 'data' in glusterfs meaning?


GFID is unique to a file (something akin to an inode number) and is 
assigned when a file is created. Data split-brain occurs when the file 
with same gfid already exists on both bricks, but there's a difference 
in the file's content. (eg. one write succeeded only on brick1 and 
another write only on brick2).
gfid split-brain occurs when a file creation happens twice (say an 
application does an open() with O_CREAT) but succeeds only on one brick 
each time.

Best regards,
Igor



2015-10-14 20:13 GMT+03:00 Ravishankar N >:




On 10/14/2015 10:05 PM, Игорь Бирюлин wrote:

Thanks for your replay.

If I do listing in mount point (/repo):
# ls /repo/xxx/keyrings/debian-keyring.gpg
ls: cannot access /repo/xxx/keyrings/debian-keyring.gpg:
Input/output error
#
In log /var/log/glusterfs/repo.log I see:
[2015-10-14 16:27:36.006815] W [MSGID: 108008]
[afr-self-heal-name.c:359:afr_selfheal_name_gfid_mismatch_check]
0-repofiles-replicate-0: GFID mismatch for
/debian-keyring.gpg
69aaeee6-624b-400a-aa46-b5c6166c014c on repofiles-client-1 and
b95ad06e-786a-44e5-ba71-af661982071f on repofiles-client-0


So the file has ended up in GFID split-brain (The trusted.gfid
value is different in both bricks as seen in your output below.),
which cannot be handled by the split-brain resolution commands.
These commands can only resolve data and metadata split-brain. I'm
afraid you'll manually need to delete one of the file and the
.glusterfs hardlink from the brick. Not sure why the
parent-directory was not listed in 'gluster v heal VOLNAME info
split-brain' output.


[2015-10-14 16:27:36.008996] W [fuse-bridge.c:451:fuse_entry_cbk]
0-glusterfs-fuse: 65961: LOOKUP()
/xxx/keyrings/debian-keyring.gpg => -1 (Input/output error)

On first node getfattr return:
# getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
getfattr: Removing leading '/' from absolute path names
# file:
storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-1=0x00020001
trusted.bit-rot.version=0x020055fdf0910003b37b
trusted.gfid=0xb95ad06e786a44e5ba71af661982071f
# ls -l
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3456271 Oct 13 19:00
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
#

On second node getfattr return:
# getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
getfattr: Removing leading '/' from absolute path names
# file:
storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-0=0x
trusted.bit-rot.version=0x020055f97b57000dc3c6
trusted.gfid=0x69aaeee6624b400aaa46b5c6166c014c
# ls -l
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3450346 Oct  9 16:22
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
#

Best regards,
Igor






2015-10-14 19:14 GMT+03:00 Ravishankar N mailto:ravishan...@redhat.com>>:



On 10/14/2015 07:02 PM, Игорь Бирюлин wrote:

Hello,
today in my 2 nodes replica set I've found split-brain.
Command 'ls' start told 'Input/output error'.


What does the mount log
(/var/log/glusterfs/.log) say when you get
this  error?

Can you run getfattr as root for the file from *both* bricks
and share the result?
`getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg`

Thanks.
Ravi



But command 'gluster v heal VOLNAME info split-brain' does
not show problem files:
# gluster v heal repofiles info split-brain
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0
#
In output of 'gluster v heal VOLNAME info' I see problem
files (/xxx/keyrings/debian-keyring.gpg, /repos.json), but
without split-brain markers:
# gluster v heal repofiles info
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
/xxx/keyrings/deb

Re: [Gluster-users] Importing bricks/datastore into new gluster

2015-10-14 Thread Pranith Kumar Karampuri



On 10/15/2015 12:46 AM, Lindsay Mathieson wrote:


On 14 October 2015 at 15:17, Pranith Kumar Karampuri 
mailto:pkara...@redhat.com>> wrote:


I didn't understand the reason for recreating the setup. Is
upgrading rpms/debs not enough?

Pranith


The distro I'm using (Proxmox/Debian) broke backward compatibility 
with their latest major upgrade, essentially you have to reinstall. 
I'm taking advantage of that to improve my installations as well - 
switch to ZFS root etc.
Okay, so re-installation is going to change root partition, but the 
brick data is going to remain intact, am I correct? Are you going to 
stop the volume, re-install all the machines in cluster and bring them 
back up, or you want to do it one machine at a time keeping the volume 
active?


Pranith


Not going to do it for weeks yet, still in planning stage.

thanks,

--
Lindsay


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Speed up heal performance

2015-10-14 Thread Pranith Kumar Karampuri



On 10/14/2015 10:43 PM, Mohamed Pakkeer wrote:

Hi Pranith,

Will this patch improve the heal performance on distributed disperse 
volume?. Currently we are getting 10MB/s heal performance on 10G 
backed network. SHD daemon takes 5 days to complete the heal operation 
for single 4TB( 3.5 TB data) disk failure.
I will try to see if the patch for replication can be generalized for 
disperse volume as well. Keep watching the mailing list for updates :-)


Pranith


Regards,
Backer

On Wed, Oct 14, 2015 at 9:08 PM, Ben Turner > wrote:


- Original Message -
> From: "Pranith Kumar Karampuri" mailto:pkara...@redhat.com>>
> To: "Ben Turner" mailto:btur...@redhat.com>>, "Humble Devassy Chirammal"
mailto:humble.deva...@gmail.com>>,
"Atin Mukherjee"
> mailto:atin.mukherje...@gmail.com>>
> Cc: "gluster-users" mailto:gluster-users@gluster.org>>
> Sent: Wednesday, October 14, 2015 1:39:14 AM
> Subject: Re: [Gluster-users] Speed up heal performance
>
>
>
> On 10/13/2015 07:11 PM, Ben Turner wrote:
> > - Original Message -
> >> From: "Humble Devassy Chirammal" mailto:humble.deva...@gmail.com>>
> >> To: "Atin Mukherjee" mailto:atin.mukherje...@gmail.com>>
> >> Cc: "Ben Turner" mailto:btur...@redhat.com>>, "gluster-users"
> >> mailto:gluster-users@gluster.org>>
> >> Sent: Tuesday, October 13, 2015 6:14:46 AM
> >> Subject: Re: [Gluster-users] Speed up heal performance
> >>
> >>> Good news is we already have a WIP patch
review.glusterd.org/10851  to
> >> introduce multi threaded shd. Credits to Richard/Shreyas from
facebook for
> >> this. IIRC, we also have a BZ for the same
> >> Isnt it the same bugzilla (
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1221737)
mentioned in the
> >> commit log?
> > @Lindsay - No need for a BZ, the above BZ should suffice.
> >
> > @Anyone - In the commit I see:
> >
> >  { .key= "cluster.shd-max-threads",
> >.voltype= "cluster/replicate",
> >.option = "shd-max-threads",
> >.op_version = 1,
> >.flags  = OPT_FLAG_CLIENT_OPT
> >  },
> >  { .key= "cluster.shd-thread-batch-size",
> >.voltype= "cluster/replicate",
> >.option = "shd-thread-batch-size",
> >.op_version = 1,
> >.flags  = OPT_FLAG_CLIENT_OPT
> >  },
> >
> > So we can tune max threads and thread batch size?  I
understand max
> > threads, but what is batch size?  In my testing on 10G NICs
with a backend
> > that will service 10G throughput I see about 1.5 GB per minute
of SH
> > throughput.  To Lindsay's other point, will this patch improve SH
> > throughput?  My systems can write at 1.5 GB / Sec and NICs can
to 1.2 GB /
> > sec but I only see ~1.5 GB per _minute_ of SH throughput.  If
we can not
> > only make SH multi threaded, but improve the performance of a
single
> > thread that would be awesome.  Super bonus points if we can
have some sort
> > of tunible that can limit the bandwidth each thread can
consume.  It would
> > be great to be able to crank things up when the systems aren't
busy and
> > slow things down when load increases.
> This patch is not merged because I thought we needed throttling
feature
> to go in before we can merge this for better control of the
self-heal
> speed. We are doing that for 3.8. So expect to see both of these
for 3.8.

Great news!  You da man Pranith, next time I am on your side of
the world beers are on me :)

-b

>
> Pranith
> >
> > -b
> >
> >
> >> --Humble
> >>
> >>
> >> On Tue, Oct 13, 2015 at 7:26 AM, Atin Mukherjee
> >> mailto:atin.mukherje...@gmail.com>>
> >> wrote:
> >>
> >>> -Atin
> >>> Sent from one plus one
> >>> On Oct 13, 2015 3:16 AM, "Ben Turner" mailto:btur...@redhat.com>> wrote:
>  - Original Message -
> > From: "Lindsay Mathieson" mailto:lindsay.mathie...@gmail.com>>
> > To: "gluster-users" mailto:gluster-users@gluster.org>>
> > Sent: Friday, October 9, 2015 9:18:11 AM
> > Subject: [Gluster-users] Speed up heal performance
> >
> > Is there any way to max out heal performance? My cluster
is unused
> >>> overnight,
> > and lightly used at lunchtimes, it would be handy to speed
up a heal.
> >
> > The only tuneable I found was
cluster.self-heal-window-size, which
> >>> doesn't
> > seem to make much difference.
>  I don't know of any way to speed this up, maybe someone
else could chime
> >>> in here that knows the heal daemon better than me.  Maybe
   

[Gluster-users] geo-replications invalid names when using rsyncd

2015-10-14 Thread Brian Ericson

Admittedly an odd case, but...

o I have simple a simple geo-replication setup:  master -> slave.
o I've mounted the master's volume on the master host.
o I've also setup rsyncd server on the master:
  [master-volume]
 path = /mnt/master-volume
 read only = false
o I now rsync from a client to the master using the rsync protocol:
  rsync file rsync://master/master-volume

What I see is "file" when looking at the master volume, but that's not
I see in the slave volume.  This is what is replicated to the slave:

  .file.6chars

where "6chars" is some random letters & numbers.

I'm pretty sure the .file.6chars version is due to my client's rsync
and represents the name rsync gives the file during transport, after
which it renames it to file.  Is this rename at such a low level
that glusterfs's geo-replication doesn't catch it and doesn't see
that it should be doing a rename?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Importing bricks/datastore into new gluster

2015-10-14 Thread Lindsay Mathieson
On 14 October 2015 at 15:17, Pranith Kumar Karampuri 
wrote:

> I didn't understand the reason for recreating the setup. Is upgrading
> rpms/debs not enough?
>
> Pranith
>

The distro I'm using (Proxmox/Debian) broke backward compatibility with
their latest major upgrade, essentially you have to reinstall. I'm taking
advantage of that to improve my installations as well - switch to ZFS root
etc.

Not going to do it for weeks yet, still in planning stage.

thanks,

-- 
Lindsay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-devel] Glusterfs as a root file system on the same node

2015-10-14 Thread satish kondapalli
is anyone has any thoughts on this?

Sateesh

On Tue, Oct 13, 2015 at 5:44 PM, satish kondapalli 
wrote:

> Hi,
>
> I want to mount  gluster volume as a root file system for my node.
>
> Node will boot from network( only kernel and initrd images) but my root
> file system has to be one of the gluster volume ( bricks of the volume are
> on the disks which are attached to the same node).  Gluster configuration
> files also part of the root file system.
>
> Here i am facing chicken and egg problem.  Initially i thought to keep
> glusterfs libs, binary in the initrd and start the gluster server as part
> of initrd execution. But for mounting root file system (which is a gluster
> volume), all the gluster configuration files are stored in the root file
> system. My assumption is, without gluster configuration
> files(/var/lib/glusterd/xx) gluster will not find any volumes.
>
> Can someone help me on this?
>
> Sateesh
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Игорь Бирюлин
Thanks for detailed description.
Do you have a plans add resolution GFID split-brain by 'gluster volume heal
VOLNAME split-brain ...' ?
What the main different between GFID split-brain and data split brain? On
nodes this file absolutely different by data content and size or it isn't
'data' in glusterfs meaning?

Best regards,
Igor



2015-10-14 20:13 GMT+03:00 Ravishankar N :

>
>
> On 10/14/2015 10:05 PM, Игорь Бирюлин wrote:
>
> Thanks for your replay.
>
> If I do listing in mount point (/repo):
> # ls /repo/xxx/keyrings/debian-keyring.gpg
> ls: cannot access /repo/xxx/keyrings/debian-keyring.gpg: Input/output error
> #
> In log /var/log/glusterfs/repo.log I see:
> [2015-10-14 16:27:36.006815] W [MSGID: 108008]
> [afr-self-heal-name.c:359:afr_selfheal_name_gfid_mismatch_check]
> 0-repofiles-replicate-0: GFID mismatch for
> /debian-keyring.gpg
> 69aaeee6-624b-400a-aa46-b5c6166c014c on repofiles-client-1 and
> b95ad06e-786a-44e5-ba71-af661982071f on repofiles-client-0
>
>
> So the file has ended up in GFID split-brain (The trusted.gfid value is
> different in both bricks as seen in your output below.), which cannot be
> handled by the split-brain resolution commands. These commands can only
> resolve data and metadata split-brain. I'm afraid you'll manually need to
> delete one of the file and the .glusterfs hardlink from the brick. Not sure
> why the parent-directory was not listed in 'gluster v heal VOLNAME info
> split-brain' output.
>
> [2015-10-14 16:27:36.008996] W [fuse-bridge.c:451:fuse_entry_cbk]
> 0-glusterfs-fuse: 65961: LOOKUP() /xxx/keyrings/debian-keyring.gpg => -1
> (Input/output error)
>
> On first node getfattr return:
> # getfattr -d -m . -e hex
> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> getfattr: Removing leading '/' from absolute path names
> # file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> trusted.afr.dirty=0x
> trusted.afr.repofiles-client-1=0x00020001
> trusted.bit-rot.version=0x020055fdf0910003b37b
> trusted.gfid=0xb95ad06e786a44e5ba71af661982071f
> # ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> -rw-r--r-- 2 root root 3456271 Oct 13 19:00
> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> #
>
> On second node getfattr return:
> # getfattr -d -m . -e hex
> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> getfattr: Removing leading '/' from absolute path names
> # file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> trusted.afr.dirty=0x
> trusted.afr.repofiles-client-0=0x
> trusted.bit-rot.version=0x020055f97b57000dc3c6
> trusted.gfid=0x69aaeee6624b400aaa46b5c6166c014c
> # ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> -rw-r--r-- 2 root root 3450346 Oct  9 16:22
> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
> #
>
> Best regards,
> Igor
>
>
>
>
>
>
> 2015-10-14 19:14 GMT+03:00 Ravishankar N :
>
>>
>>
>> On 10/14/2015 07:02 PM, Игорь Бирюлин wrote:
>>
>> Hello,
>> today in my 2 nodes replica set I've found split-brain. Command 'ls'
>> start told 'Input/output error'.
>>
>>
>> What does the mount log (/var/log/glusterfs/.log) say when
>> you get this  error?
>>
>> Can you run getfattr as root for the file from *both* bricks and share
>> the result?
>> `getfattr -d -m . -e hex
>> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg`
>>
>> Thanks.
>> Ravi
>>
>>
>> But command 'gluster v heal VOLNAME info split-brain' does not show
>> problem files:
>> # gluster v heal repofiles info split-brain
>> Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
>> Number of entries in split-brain: 0
>>
>> Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
>> Number of entries in split-brain: 0
>> #
>> In output of 'gluster v heal VOLNAME info' I see problem files
>> (/xxx/keyrings/debian-keyring.gpg, /repos.json), but without split-brain
>> markers:
>> # gluster v heal repofiles info
>> Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
>> /xxx/keyrings/debian-keyring.gpg
>> 
>> 
>> /repos.json
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Number of entries: 11
>>
>> Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
>> Number of entries: 0
>> #
>>
>> I couldn't solve split-brain by new standard command:
>> # gluster v heal repofiles  split-brain bigger-file /repos.json
>> Lookup failed on /repos.json:Input/output error
>> Volume heal failed.
>> #
>>
>> Additional info:
>> # gluster v info
>>  Volume Name: repofiles
>>  Type: Replicate
>>  Volume ID: 4b0e2a74-f1ca-4fe7-8518-23919e1b5fa0
>>  Status: Started
>>  Number of Bricks: 1 x 2 = 2
>>  Transport-type: tcp
>>  Bricks:
>>  Brick1: dist-int-master03.xxx:/storage/gluster_brick_repofiles
>>  Brick2: dist-int-master04.xxx:/storage/gluster_brick_repofiles
>>  Options Reconfigured:
>>  performance.readdir-ahead: on
>>  client.event

Re: [Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Ravishankar N



On 10/14/2015 10:05 PM, Игорь Бирюлин wrote:

Thanks for your replay.

If I do listing in mount point (/repo):
# ls /repo/xxx/keyrings/debian-keyring.gpg
ls: cannot access /repo/xxx/keyrings/debian-keyring.gpg: Input/output 
error

#
In log /var/log/glusterfs/repo.log I see:
[2015-10-14 16:27:36.006815] W [MSGID: 108008] 
[afr-self-heal-name.c:359:afr_selfheal_name_gfid_mismatch_check] 
0-repofiles-replicate-0: GFID mismatch for 
/debian-keyring.gpg 
69aaeee6-624b-400a-aa46-b5c6166c014c on repofiles-client-1 and 
b95ad06e-786a-44e5-ba71-af661982071f on repofiles-client-0


So the file has ended up in GFID split-brain (The trusted.gfid value is 
different in both bricks as seen in your output below.), which cannot be 
handled by the split-brain resolution commands. These commands can only 
resolve data and metadata split-brain. I'm afraid you'll manually need 
to delete one of the file and the .glusterfs hardlink from the brick. 
Not sure why the parent-directory was not listed in 'gluster v heal 
VOLNAME info split-brain' output.


[2015-10-14 16:27:36.008996] W [fuse-bridge.c:451:fuse_entry_cbk] 
0-glusterfs-fuse: 65961: LOOKUP() /xxx/keyrings/debian-keyring.gpg => 
-1 (Input/output error)


On first node getfattr return:
# getfattr -d -m . -e hex 
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg

getfattr: Removing leading '/' from absolute path names
# file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-1=0x00020001
trusted.bit-rot.version=0x020055fdf0910003b37b
trusted.gfid=0xb95ad06e786a44e5ba71af661982071f
# ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3456271 Oct 13 19:00 
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg

#

On second node getfattr return:
# getfattr -d -m . -e hex 
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg

getfattr: Removing leading '/' from absolute path names
# file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-0=0x
trusted.bit-rot.version=0x020055f97b57000dc3c6
trusted.gfid=0x69aaeee6624b400aaa46b5c6166c014c
# ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3450346 Oct  9 16:22 
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg

#

Best regards,
Igor






2015-10-14 19:14 GMT+03:00 Ravishankar N >:




On 10/14/2015 07:02 PM, Игорь Бирюлин wrote:

Hello,
today in my 2 nodes replica set I've found split-brain. Command
'ls' start told 'Input/output error'.


What does the mount log (/var/log/glusterfs/.log)
say when you get this  error?

Can you run getfattr as root for the file from *both* bricks and
share the result?
`getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg`

Thanks.
Ravi



But command 'gluster v heal VOLNAME info split-brain' does not
show problem files:
# gluster v heal repofiles info split-brain
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0
#
In output of 'gluster v heal VOLNAME info' I see problem files
(/xxx/keyrings/debian-keyring.gpg, /repos.json), but without
split-brain markers:
# gluster v heal repofiles info
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
/xxx/keyrings/debian-keyring.gpg


/repos.json







Number of entries: 11

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries: 0
#

I couldn't solve split-brain by new standard command:
# gluster v heal repofiles  split-brain bigger-file /repos.json
Lookup failed on /repos.json:Input/output error
Volume heal failed.
#

Additional info:
# gluster v info
 Volume Name: repofiles
 Type: Replicate
 Volume ID: 4b0e2a74-f1ca-4fe7-8518-23919e1b5fa0
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: dist-int-master03.xxx:/storage/gluster_brick_repofiles
 Brick2: dist-int-master04.xxx:/storage/gluster_brick_repofiles
 Options Reconfigured:
 performance.readdir-ahead: on
 client.event-threads: 4
 server.event-threads: 4
 cluster.lookup-optimize: on
# cat /etc/issue
Ubuntu 14.04.3 LTS \n \l
# dpkg -l | grep glusterfs
ii  glusterfs-client 3.7.5-ubuntu1~trusty1amd64
clustered file-system (client package)
ii  glusterfs-common 3.7.5-ubuntu1~trusty1amd64
GlusterFS common libraries and translator modules
ii  glusterfs-se

Re: [Gluster-users] Speed up heal performance

2015-10-14 Thread Mohamed Pakkeer
Hi Pranith,

Will this patch improve the heal performance on distributed disperse
volume?. Currently we are getting 10MB/s heal performance on 10G backed
network. SHD daemon takes 5 days to complete the heal operation for single
4TB( 3.5 TB data) disk failure.

Regards,
Backer

On Wed, Oct 14, 2015 at 9:08 PM, Ben Turner  wrote:

> - Original Message -
> > From: "Pranith Kumar Karampuri" 
> > To: "Ben Turner" , "Humble Devassy Chirammal" <
> humble.deva...@gmail.com>, "Atin Mukherjee"
> > 
> > Cc: "gluster-users" 
> > Sent: Wednesday, October 14, 2015 1:39:14 AM
> > Subject: Re: [Gluster-users] Speed up heal performance
> >
> >
> >
> > On 10/13/2015 07:11 PM, Ben Turner wrote:
> > > - Original Message -
> > >> From: "Humble Devassy Chirammal" 
> > >> To: "Atin Mukherjee" 
> > >> Cc: "Ben Turner" , "gluster-users"
> > >> 
> > >> Sent: Tuesday, October 13, 2015 6:14:46 AM
> > >> Subject: Re: [Gluster-users] Speed up heal performance
> > >>
> > >>> Good news is we already have a WIP patch review.glusterd.org/10851
> to
> > >> introduce multi threaded shd. Credits to Richard/Shreyas from
> facebook for
> > >> this. IIRC, we also have a BZ for the same
> > >> Isnt it the same bugzilla (
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1221737) mentioned in the
> > >> commit log?
> > > @Lindsay - No need for a BZ, the above BZ should suffice.
> > >
> > > @Anyone - In the commit I see:
> > >
> > >  { .key= "cluster.shd-max-threads",
> > >.voltype= "cluster/replicate",
> > >.option = "shd-max-threads",
> > >.op_version = 1,
> > >.flags  = OPT_FLAG_CLIENT_OPT
> > >  },
> > >  { .key= "cluster.shd-thread-batch-size",
> > >.voltype= "cluster/replicate",
> > >.option = "shd-thread-batch-size",
> > >.op_version = 1,
> > >.flags  = OPT_FLAG_CLIENT_OPT
> > >  },
> > >
> > > So we can tune max threads and thread batch size?  I understand max
> > > threads, but what is batch size?  In my testing on 10G NICs with a
> backend
> > > that will service 10G throughput I see about 1.5 GB per minute of SH
> > > throughput.  To Lindsay's other point, will this patch improve SH
> > > throughput?  My systems can write at 1.5 GB / Sec and NICs can to 1.2
> GB /
> > > sec but I only see ~1.5 GB per _minute_ of SH throughput.  If we can
> not
> > > only make SH multi threaded, but improve the performance of a single
> > > thread that would be awesome.  Super bonus points if we can have some
> sort
> > > of tunible that can limit the bandwidth each thread can consume.  It
> would
> > > be great to be able to crank things up when the systems aren't busy and
> > > slow things down when load increases.
> > This patch is not merged because I thought we needed throttling feature
> > to go in before we can merge this for better control of the self-heal
> > speed. We are doing that for 3.8. So expect to see both of these for 3.8.
>
> Great news!  You da man Pranith, next time I am on your side of the world
> beers are on me :)
>
> -b
>
> >
> > Pranith
> > >
> > > -b
> > >
> > >
> > >> --Humble
> > >>
> > >>
> > >> On Tue, Oct 13, 2015 at 7:26 AM, Atin Mukherjee
> > >> 
> > >> wrote:
> > >>
> > >>> -Atin
> > >>> Sent from one plus one
> > >>> On Oct 13, 2015 3:16 AM, "Ben Turner"  wrote:
> >  - Original Message -
> > > From: "Lindsay Mathieson" 
> > > To: "gluster-users" 
> > > Sent: Friday, October 9, 2015 9:18:11 AM
> > > Subject: [Gluster-users] Speed up heal performance
> > >
> > > Is there any way to max out heal performance? My cluster is unused
> > >>> overnight,
> > > and lightly used at lunchtimes, it would be handy to speed up a
> heal.
> > >
> > > The only tuneable I found was cluster.self-heal-window-size, which
> > >>> doesn't
> > > seem to make much difference.
> >  I don't know of any way to speed this up, maybe someone else could
> chime
> > >>> in here that knows the heal daemon better than me.  Maybe you could
> open
> > >>> an
> > >>> RFE on this?  In my testing I only see 2 files getting healed at a
> time
> > >>> per
> > >>> replica pair.  I would like to see this be multi threaded(if its not
> > >>> already) with the ability to tune it to control resource
> usage(similar to
> > >>> what we did in the rebalance refactoring done recently).  If you let
> me
> > >>> know the BZ # I'll add my data + suggestions, I have been testing
> this
> > >>> pretty extensively in recent weeks and good data + some ideas on how
> to
> > >>> speed things up.
> > >>> Good news is we already have a WIP patch review.glusterd.org/10851
> to
> > >>> introduce multi threaded shd. Credits to Richard/Shreyas from
> facebook
> > >>> for
> > >>> this. IIRC, we also have a BZ for the same but the patch is in rfc
> as of
> > >>> now. AFAIK, this is a candidate to land in 3.8 as well, Vijay can
> correct
> > >>> me oth

Re: [Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Игорь Бирюлин
Thanks for your replay.

If I do listing in mount point (/repo):
# ls /repo/xxx/keyrings/debian-keyring.gpg
ls: cannot access /repo/xxx/keyrings/debian-keyring.gpg: Input/output error
#
In log /var/log/glusterfs/repo.log I see:
[2015-10-14 16:27:36.006815] W [MSGID: 108008]
[afr-self-heal-name.c:359:afr_selfheal_name_gfid_mismatch_check]
0-repofiles-replicate-0: GFID mismatch for
/debian-keyring.gpg
69aaeee6-624b-400a-aa46-b5c6166c014c on repofiles-client-1 and
b95ad06e-786a-44e5-ba71-af661982071f on repofiles-client-0
[2015-10-14 16:27:36.008996] W [fuse-bridge.c:451:fuse_entry_cbk]
0-glusterfs-fuse: 65961: LOOKUP() /xxx/keyrings/debian-keyring.gpg => -1
(Input/output error)

On first node getfattr return:
# getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
getfattr: Removing leading '/' from absolute path names
# file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-1=0x00020001
trusted.bit-rot.version=0x020055fdf0910003b37b
trusted.gfid=0xb95ad06e786a44e5ba71af661982071f
# ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3456271 Oct 13 19:00
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
#

On second node getfattr return:
# getfattr -d -m . -e hex
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
getfattr: Removing leading '/' from absolute path names
# file: storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
trusted.afr.dirty=0x
trusted.afr.repofiles-client-0=0x
trusted.bit-rot.version=0x020055f97b57000dc3c6
trusted.gfid=0x69aaeee6624b400aaa46b5c6166c014c
# ls -l /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
-rw-r--r-- 2 root root 3450346 Oct  9 16:22
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg
#

Best regards,
Igor






2015-10-14 19:14 GMT+03:00 Ravishankar N :

>
>
> On 10/14/2015 07:02 PM, Игорь Бирюлин wrote:
>
> Hello,
> today in my 2 nodes replica set I've found split-brain. Command 'ls' start
> told 'Input/output error'.
>
>
> What does the mount log (/var/log/glusterfs/.log) say when
> you get this  error?
>
> Can you run getfattr as root for the file from *both* bricks and share the
> result?
> `getfattr -d -m . -e hex
> /storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg`
>
> Thanks.
> Ravi
>
>
> But command 'gluster v heal VOLNAME info split-brain' does not show
> problem files:
> # gluster v heal repofiles info split-brain
> Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
> Number of entries in split-brain: 0
>
> Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
> Number of entries in split-brain: 0
> #
> In output of 'gluster v heal VOLNAME info' I see problem files
> (/xxx/keyrings/debian-keyring.gpg, /repos.json), but without split-brain
> markers:
> # gluster v heal repofiles info
> Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
> /xxx/keyrings/debian-keyring.gpg
> 
> 
> /repos.json
> 
> 
> 
> 
> 
> 
> 
> Number of entries: 11
>
> Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
> Number of entries: 0
> #
>
> I couldn't solve split-brain by new standard command:
> # gluster v heal repofiles  split-brain bigger-file /repos.json
> Lookup failed on /repos.json:Input/output error
> Volume heal failed.
> #
>
> Additional info:
> # gluster v info
>  Volume Name: repofiles
>  Type: Replicate
>  Volume ID: 4b0e2a74-f1ca-4fe7-8518-23919e1b5fa0
>  Status: Started
>  Number of Bricks: 1 x 2 = 2
>  Transport-type: tcp
>  Bricks:
>  Brick1: dist-int-master03.xxx:/storage/gluster_brick_repofiles
>  Brick2: dist-int-master04.xxx:/storage/gluster_brick_repofiles
>  Options Reconfigured:
>  performance.readdir-ahead: on
>  client.event-threads: 4
>  server.event-threads: 4
>  cluster.lookup-optimize: on
> # cat /etc/issue
> Ubuntu 14.04.3 LTS \n \l
> # dpkg -l | grep glusterfs
> ii  glusterfs-client
> 3.7.5-ubuntu1~trusty1amd64clustered file-system
> (client package)
> ii  glusterfs-common
> 3.7.5-ubuntu1~trusty1amd64GlusterFS common
> libraries and translator modules
> ii  glusterfs-server
> 3.7.5-ubuntu1~trusty1amd64clustered file-system
> (server package)
> #
>
> I have 2 questions:
> 1. Why 'gluster v heal VOLNAME info split-brain' doesn't show actual
> split-brain? Why in 'gluster v heal VOLNAME info' I doesn't see markers
> like 'possible in split-brain'?
> How I can monitor my gluster installation if these commands doesn't show
> problems?
> 2. Why 'gluster volume heal VOLNAME split-brain bigger-file FILE' doesn't
> solve split-brain? I understand that I can solve split-brain remove files
> from brick but I thought to use this killer feature.
>
> Best regards,
> Igor
>
>
> ___

Re: [Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Ravishankar N



On 10/14/2015 07:02 PM, Игорь Бирюлин wrote:

Hello,
today in my 2 nodes replica set I've found split-brain. Command 'ls' 
start told 'Input/output error'.


What does the mount log (/var/log/glusterfs/.log) say 
when you get this  error?


Can you run getfattr as root for the file from *both* bricks and share 
the result?
`getfattr -d -m . -e hex 
/storage/gluster_brick_repofiles/xxx/keyrings/debian-keyring.gpg`


Thanks.
Ravi


But command 'gluster v heal VOLNAME info split-brain' does not show 
problem files:

# gluster v heal repofiles info split-brain
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0
#
In output of 'gluster v heal VOLNAME info' I see problem files 
(/xxx/keyrings/debian-keyring.gpg, /repos.json), but without 
split-brain markers:

# gluster v heal repofiles info
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
/xxx/keyrings/debian-keyring.gpg


/repos.json







Number of entries: 11

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries: 0
#

I couldn't solve split-brain by new standard command:
# gluster v heal repofiles  split-brain bigger-file /repos.json
Lookup failed on /repos.json:Input/output error
Volume heal failed.
#

Additional info:
# gluster v info
 Volume Name: repofiles
 Type: Replicate
 Volume ID: 4b0e2a74-f1ca-4fe7-8518-23919e1b5fa0
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: dist-int-master03.xxx:/storage/gluster_brick_repofiles
 Brick2: dist-int-master04.xxx:/storage/gluster_brick_repofiles
 Options Reconfigured:
 performance.readdir-ahead: on
 client.event-threads: 4
 server.event-threads: 4
 cluster.lookup-optimize: on
# cat /etc/issue
Ubuntu 14.04.3 LTS \n \l
# dpkg -l | grep glusterfs
ii  glusterfs-client 3.7.5-ubuntu1~trusty1amd64
clustered file-system (client package)
ii  glusterfs-common 3.7.5-ubuntu1~trusty1amd64
GlusterFS common libraries and translator modules
ii  glusterfs-server 3.7.5-ubuntu1~trusty1amd64
clustered file-system (server package)

#

I have 2 questions:
1. Why 'gluster v heal VOLNAME info split-brain' doesn't show actual 
split-brain? Why in 'gluster v heal VOLNAME info' I doesn't see 
markers like 'possible in split-brain'?
How I can monitor my gluster installation if these commands doesn't 
show problems?
2. Why 'gluster volume heal VOLNAME split-brain bigger-file FILE' 
doesn't solve split-brain? I understand that I can solve split-brain 
remove files from brick but I thought to use this killer feature.


Best regards,
Igor


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Speed up heal performance

2015-10-14 Thread Ben Turner
- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Ben Turner" , "Humble Devassy Chirammal" 
> , "Atin Mukherjee"
> 
> Cc: "gluster-users" 
> Sent: Wednesday, October 14, 2015 1:39:14 AM
> Subject: Re: [Gluster-users] Speed up heal performance
> 
> 
> 
> On 10/13/2015 07:11 PM, Ben Turner wrote:
> > - Original Message -
> >> From: "Humble Devassy Chirammal" 
> >> To: "Atin Mukherjee" 
> >> Cc: "Ben Turner" , "gluster-users"
> >> 
> >> Sent: Tuesday, October 13, 2015 6:14:46 AM
> >> Subject: Re: [Gluster-users] Speed up heal performance
> >>
> >>> Good news is we already have a WIP patch review.glusterd.org/10851 to
> >> introduce multi threaded shd. Credits to Richard/Shreyas from facebook for
> >> this. IIRC, we also have a BZ for the same
> >> Isnt it the same bugzilla (
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1221737) mentioned in the
> >> commit log?
> > @Lindsay - No need for a BZ, the above BZ should suffice.
> >
> > @Anyone - In the commit I see:
> >
> >  { .key= "cluster.shd-max-threads",
> >.voltype= "cluster/replicate",
> >.option = "shd-max-threads",
> >.op_version = 1,
> >.flags  = OPT_FLAG_CLIENT_OPT
> >  },
> >  { .key= "cluster.shd-thread-batch-size",
> >.voltype= "cluster/replicate",
> >.option = "shd-thread-batch-size",
> >.op_version = 1,
> >.flags  = OPT_FLAG_CLIENT_OPT
> >  },
> >
> > So we can tune max threads and thread batch size?  I understand max
> > threads, but what is batch size?  In my testing on 10G NICs with a backend
> > that will service 10G throughput I see about 1.5 GB per minute of SH
> > throughput.  To Lindsay's other point, will this patch improve SH
> > throughput?  My systems can write at 1.5 GB / Sec and NICs can to 1.2 GB /
> > sec but I only see ~1.5 GB per _minute_ of SH throughput.  If we can not
> > only make SH multi threaded, but improve the performance of a single
> > thread that would be awesome.  Super bonus points if we can have some sort
> > of tunible that can limit the bandwidth each thread can consume.  It would
> > be great to be able to crank things up when the systems aren't busy and
> > slow things down when load increases.
> This patch is not merged because I thought we needed throttling feature
> to go in before we can merge this for better control of the self-heal
> speed. We are doing that for 3.8. So expect to see both of these for 3.8.

Great news!  You da man Pranith, next time I am on your side of the world beers 
are on me :)

-b

> 
> Pranith
> >
> > -b
> >
> >
> >> --Humble
> >>
> >>
> >> On Tue, Oct 13, 2015 at 7:26 AM, Atin Mukherjee
> >> 
> >> wrote:
> >>
> >>> -Atin
> >>> Sent from one plus one
> >>> On Oct 13, 2015 3:16 AM, "Ben Turner"  wrote:
>  - Original Message -
> > From: "Lindsay Mathieson" 
> > To: "gluster-users" 
> > Sent: Friday, October 9, 2015 9:18:11 AM
> > Subject: [Gluster-users] Speed up heal performance
> >
> > Is there any way to max out heal performance? My cluster is unused
> >>> overnight,
> > and lightly used at lunchtimes, it would be handy to speed up a heal.
> >
> > The only tuneable I found was cluster.self-heal-window-size, which
> >>> doesn't
> > seem to make much difference.
>  I don't know of any way to speed this up, maybe someone else could chime
> >>> in here that knows the heal daemon better than me.  Maybe you could open
> >>> an
> >>> RFE on this?  In my testing I only see 2 files getting healed at a time
> >>> per
> >>> replica pair.  I would like to see this be multi threaded(if its not
> >>> already) with the ability to tune it to control resource usage(similar to
> >>> what we did in the rebalance refactoring done recently).  If you let me
> >>> know the BZ # I'll add my data + suggestions, I have been testing this
> >>> pretty extensively in recent weeks and good data + some ideas on how to
> >>> speed things up.
> >>> Good news is we already have a WIP patch review.glusterd.org/10851 to
> >>> introduce multi threaded shd. Credits to Richard/Shreyas from facebook
> >>> for
> >>> this. IIRC, we also have a BZ for the same but the patch is in rfc as of
> >>> now. AFAIK, this is a candidate to land in 3.8 as well, Vijay can correct
> >>> me otherwise.
>  -b
> 
> > thanks,
> > --
> > Lindsay
> >
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>  ___
>  Gluster-users mailing list
>  Gluster-users@gluster.org
>  http://www.gluster.org/mailman/listinfo/gluster-users
> >>> ___
> >>> Gluster-users mailing list
> >>> Gluster-users@gluster.org
> >>> http://www.gluster.org/mailman/li

[Gluster-users] Geo-Replication "FILES SKIPPED"

2015-10-14 Thread Logan Barfield
We had a connectivity issue on a "tar+ssh" geo-rep link yesterday that
caused a lot of issues.  When the link came back up it immediately went
into a "faulty" state, and the logs were showing "Operation not permitted"
and "File Exists" errors in a loop.

We were finally able to get things back on track by shutting down the
geo-rep link, killing the hung tar processes on the slave, and bringing the
link back up in "rsync" mode.

The master is now back in a "Changelog Crawl" status, and I have confirmed
new files are being copied to the slave correctly.

The status on the master is currently showing 100k+ "FILED SKIPPED."

My question is: Where can I see which files were skipped, and how can I
force them to replicate/update to the slave?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Monitoring and solving split-brain

2015-10-14 Thread Игорь Бирюлин
Hello,
today in my 2 nodes replica set I've found split-brain. Command 'ls' start
told 'Input/output error'.
But command 'gluster v heal VOLNAME info split-brain' does not show problem
files:
# gluster v heal repofiles info split-brain
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries in split-brain: 0
#
In output of 'gluster v heal VOLNAME info' I see problem files
(/xxx/keyrings/debian-keyring.gpg, /repos.json), but without split-brain
markers:
# gluster v heal repofiles info
Brick dist-int-master03.xxx:/storage/gluster_brick_repofiles
/xxx/keyrings/debian-keyring.gpg


/repos.json







Number of entries: 11

Brick dist-int-master04.xxx:/storage/gluster_brick_repofiles
Number of entries: 0
#

I couldn't solve split-brain by new standard command:
# gluster v heal repofiles  split-brain bigger-file /repos.json
Lookup failed on /repos.json:Input/output error
Volume heal failed.
#

Additional info:
# gluster v info
 Volume Name: repofiles
 Type: Replicate
 Volume ID: 4b0e2a74-f1ca-4fe7-8518-23919e1b5fa0
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: dist-int-master03.xxx:/storage/gluster_brick_repofiles
 Brick2: dist-int-master04.xxx:/storage/gluster_brick_repofiles
 Options Reconfigured:
 performance.readdir-ahead: on
 client.event-threads: 4
 server.event-threads: 4
 cluster.lookup-optimize: on
# cat /etc/issue
Ubuntu 14.04.3 LTS \n \l
# dpkg -l | grep glusterfs
ii  glusterfs-client
3.7.5-ubuntu1~trusty1amd64clustered file-system
(client package)
ii  glusterfs-common
3.7.5-ubuntu1~trusty1amd64GlusterFS common
libraries and translator modules
ii  glusterfs-server
3.7.5-ubuntu1~trusty1amd64clustered file-system
(server package)
#

I have 2 questions:
1. Why 'gluster v heal VOLNAME info split-brain' doesn't show actual
split-brain? Why in 'gluster v heal VOLNAME info' I doesn't see markers
like 'possible in split-brain'?
How I can monitor my gluster installation if these commands doesn't show
problems?
2. Why 'gluster volume heal VOLNAME split-brain bigger-file FILE' doesn't
solve split-brain? I understand that I can solve split-brain remove files
from brick but I thought to use this killer feature.

Best regards,
Igor
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] EC planning

2015-10-14 Thread Serkan Çoban
Hi Xavier,

>I'm not sure if I understand you. Are you saying you will create two separate 
>gluster volumes or you will add both bricks to the same distributed-dispersed 
>volume ?

Is adding more than one brick from same host to a disperse gluster
volume recommended? I meant two different gluster volume.
If I add two bricks from same server to same dispersed volume and lets
say it is 8+1 configuration, then loosing one host will bring down the
volume right?

>One possibility is to get rid of the server RAID and use each disk as a single 
>brick. This way you can create 26 bricks per server and assign each one to a 
>different disperse set. A big distributed-dispersed volume balances I/O load 
>>between bricks better. Note that RAID configurations have a reduction in the 
>available number of IOPS. For sequential writes, this is not so bad, but if 
>you have many clients accessing the same bricks, you will see many random 
>?>accesses even if clients are doing sequential writes. Caching can alleviate 
>this, but if you want to sustain a throughput of 2-3 GB/s, caching effects are 
>not so evident.

I can create 26 JBOD disks and use them as bricks but is this
recommended? By using 50 servers, brick count will be 1300, is this
not a problem?
Can you explain the configuration a bit more? For example by using
16+2, 26 brick per server and 54 servers total. In the end I only want
one gluster volume and protection for 2 host failure.
Also in this case disk failures will be handled by gluster I hope this
don't bring more problems. But I will also test this configuration
when I get the servers..

Serkan



On Wed, Oct 14, 2015 at 2:03 PM, Xavier Hernandez  wrote:
> Hi Serkan,
>
> On 13/10/15 15:53, Serkan Çoban wrote:
>>
>> Hi Xavier and thanks for your answers.
>>
>> Servers will have 26*8TB disks.I don't want to loose more than 2 disk
>> for raid,
>> so my options are HW RAID6 24+2 or 2 * HW RAID5 12+1,
>
>
> A RAID5 of more than 8-10 disks is normally considered unsafe because the
> probability of a second drive failure while reconstructing another failed
> drive is considerably high. The same happens with a RAID6 of more than 16-20
> disks.
>
>> in both cases I can create 2 bricks per server using LVM and use one brick
>> per server to create two distributed-disperse volumes. I will test those
>> configurations when servers arrive.
>
>
> I'm not sure if I understand you. Are you saying you will create two
> separate gluster volumes or you will add both bricks to the same
> distributed-dispersed volume ?
>
>>
>> I can go with 8+1 or 16+2, will make tests when servers arrive. But 8+2
>> will
>> be too much, I lost nearly %25 space in this case.
>>
>> For the client count, this cluster will get backups from hadoop nodes
>> so there will be 750-1000 clients at least which sends data at the same
>> time.
>> Can 16+2 * 3 = 54 gluster nodes handle this or should I increase node
>> count?
>
>
> In this case I think it would be better to increase the number of bricks,
> otherwise you may have some performance hit to serve all these clients.
>
> One possibility is to get rid of the server RAID and use each disk as a
> single brick. This way you can create 26 bricks per server and assign each
> one to a different disperse set. A big distributed-dispersed volume balances
> I/O load between bricks better. Note that RAID configurations have a
> reduction in the available number of IOPS. For sequential writes, this is
> not so bad, but if you have many clients accessing the same bricks, you will
> see many random accesses even if clients are doing sequential writes.
> Caching can alleviate this, but if you want to sustain a throughput of 2-3
> GB/s, caching effects are not so evident.
>
> Without RAID you could use a 16+2 or even a 16+3 dispersed volume. This
> gives you a good protection and increased storage.
>
> Xavi
>
>>
>> I will check the parameters you mentioned.
>>
>> Serkan
>>
>> On Tue, Oct 13, 2015 at 1:43 PM, Xavier Hernandez > > wrote:
>>
>> +gluster-users
>>
>>
>> On 13/10/15 12:34, Xavier Hernandez wrote:
>>
>> Hi Serkan,
>>
>> On 12/10/15 16:52, Serkan Çoban wrote:
>>
>> Hi,
>>
>> I am planning to use GlusterFS for backup purposes. I write
>> big files
>> (>100MB) with a throughput of 2-3GB/sn. In order to gain
>> from space we
>> plan to use erasure coding. I have some questions for EC and
>> brick
>> planning:
>> - I am planning to use 200TB XFS/ZFS RAID6 volume to hold
>> one brick per
>> server. Should I increase brick count? is increasing brick
>> count also
>> increases performance?
>>
>>
>> Using a distributed-dispersed volume increases performance. You
>> can
>> split each RAID6 volume into multiple bricks to create such a
>> volume.
>> This is because a single 

Re: [Gluster-users] Gluster 4.0 - upgrades & backward compatibility strategy

2015-10-14 Thread Roman
Hi,

Its hard to comment plans and things like these, but I suggest everyone
will be happy to have a possibility to upgrade from 3 to 4 without new
installation, OK with offline upgrade also (shut down volumes and upgrade).
And I'm somehow pretty sure, that this upgrade process should be pretty
flawless so no one under any circumstances would need any kind of
rollbacks, so there should not be any IFs :)

2015-10-07 8:32 GMT+03:00 Atin Mukherjee :

> Hi All,
>
> Over the course of the design discussion, we got a chance to discuss
> about the upgrades and backward compatibility strategy for Gluster 4.0
> and here is what we came up with:
>
> 1. 4.0 cluster would be separate from 3.x clusters. Heterogeneous
> support won't be available.
>
> 2. All CLI interfaces exposed in 3.x would continue to work with 4.x.
>
> 3. ReSTful APIs for all old & new management actions.
>
> 4. Upgrade path from 3.x to 4.x would be necessary. We need not support
> rolling upgrades, however all data layouts from 3.x would need to be
> honored. Our upgrade path from 3.x to 4.x should not be cumbersome.
>
>
> Initiative wise upgrades strategy details:
>
> GlusterD 2.0
> 
>
> - No rolling upgrade, service disruption is expected
> - Smooth upgrade from 3.x to 4.x (migration script)
> - Rollback - If upgrade fails, revert back to 3.x, old configuration
> data shouldn't be wiped off.
>
>
> DHT 2.0
> ---
> - No in place upgrade to DHT2
> - Needs migration of data
> - Backward compat, hence does not exist
>
> NSR
> ---
> - volume migration from AFR to NSR is possible with an offline upgrade
>
> We would like to hear from the community about your opinion on this
> strategy.
>
> Thanks,
> Atin
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>



-- 
Best regards,
Roman.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Can we run VMs from Glaster volume

2015-10-14 Thread Roman
I'd recommend Proxmox as vritualizing platform. In new version (4) HA works
with few clicks and no need for external fencing devices (all is done by
the watchdog from now on). Also it runs fine with GlusterFS as VM storage
(I'm running about 20 VMs / KVM / on gluster and thinking of moving 10 more
there from OpenVZ).

2015-10-01 21:25 GMT+03:00 Steve Dainard :

> If you have enough Linux background to think of implementing gluster
> storage, why not virtualize on Linux as well?
>
> If you're using the standard Hyperv free version you don't get
> clustering support anyways, so standalone KVM gives you the same basic
> capabilities and you can use virt-manager to manage standalone
> hypervisors with a gui. You might have to use NFS mounts instead of
> glusterfs, not sure where the support is yet.
>
> If you want clustering (HA, centralized storage, migration etc) then
> take a look at Ovirt which has native gluster storage support. Its
> also significantly more complex than KVM/virt-manager, especially when
> troubleshooting.
>
> FYI hyperv only supports SMB storage on 2012+ if that makes any
> difference. Technically you could export glusterfs over samba, but I
> can envision a world of hurt here.
>
> Lastly if you do implement gluster in any way/shape/form, make sure
> you've got a solid amount of time in your testing phase to work
> through any issues, and figure out how to recover from disasters and
> what you're using for backup.
>
> On Thu, Sep 17, 2015 at 5:47 PM, Nashid Farhad 
> wrote:
> > Hi,
> >
> >
> >
> > We’re looking into rolling out glister for our network storage.
> >
> >
> >
> > What we want to do is run VMs from the glister volume using Hyper-V.
> >
> >
> >
> > Now, I’ve read in the documentation that glister does not support live
> data
> > (e.g. live sql database)
> >
> >
> >
> > Does it also imply that we can’t have live VMs? And, what if we want to
> run
> > sql database in the VM?
> >
> >
> >
> > Regards,
> >
> > Nashid
> >
> >
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>



-- 
Best regards,
Roman.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] MTU size issue?

2015-10-14 Thread Sander Zijlstra
LS,

I recently reconfigured one of my gluster nodes and forgot to update the MTU 
size on the switch while I did configure the host with jumbo frames.

The result was that the complete cluster had communication issues.

All systems are part of a distributed striped volume with a replica size of 2 
but still the cluster was completely unusable until I updated the switch port 
to accept jumbo frames rather than to discard them.

The symptoms were:

- Gluster clients had a very hard time reading the volume information and thus 
couldn’t do any filesystem ops on them.
- The glusterfs servers could see each other (peer status) and a volume info 
command was ok, but a volume status command would not return or would return a 
“staging failed” error.

I know MTU size mixing and don’t fragment bit’s can screw up a lot but why 
wasn’t that gluster peer just discarded from the cluster so that not all 
clients kept on communicating with it and causing all sorts of errors.

I use glusterFS 3.6.2 at the moment…..

Kind regards
Sander





signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] REMINDER: Weekly gluster community meeting to start in 30 minutes

2015-10-14 Thread Raghavendra Bhat
Hi All,

In 30 minutes from now we will have the regular weekly Gluster
Community meeting.

Meeting details:
- location: #gluster-meeting on Freenode IRC
- date: every Wednesday
- time: 12:00 UTC, 14:00 CEST, 17:30 IST
(in your terminal, run: date -d "12:00 UTC")
- agenda: https://public.pad.fsfe.org/p/gluster-community-meetings

Currently the following items are listed:
* Roll Call
* Status of last week's action items
* Gluster 3.7
* Gluster 3.8
* Gluster 3.6
* Gluster 3.5
* Gluster 4.0
* Open Floor
- bring your own topic!

The last topic has space for additions. If you have a suitable topic to
discuss, please add it to the agenda.


Regards,
Raghavendra Bhat
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] EC planning

2015-10-14 Thread Xavier Hernandez

Hi Serkan,

On 13/10/15 15:53, Serkan Çoban wrote:

Hi Xavier and thanks for your answers.

Servers will have 26*8TB disks.I don't want to loose more than 2 disk
for raid,
so my options are HW RAID6 24+2 or 2 * HW RAID5 12+1,


A RAID5 of more than 8-10 disks is normally considered unsafe because 
the probability of a second drive failure while reconstructing another 
failed drive is considerably high. The same happens with a RAID6 of more 
than 16-20 disks.



in both cases I can create 2 bricks per server using LVM and use one brick
per server to create two distributed-disperse volumes. I will test those
configurations when servers arrive.


I'm not sure if I understand you. Are you saying you will create two 
separate gluster volumes or you will add both bricks to the same 
distributed-dispersed volume ?




I can go with 8+1 or 16+2, will make tests when servers arrive. But 8+2 will
be too much, I lost nearly %25 space in this case.

For the client count, this cluster will get backups from hadoop nodes
so there will be 750-1000 clients at least which sends data at the same
time.
Can 16+2 * 3 = 54 gluster nodes handle this or should I increase node count?


In this case I think it would be better to increase the number of 
bricks, otherwise you may have some performance hit to serve all these 
clients.


One possibility is to get rid of the server RAID and use each disk as a 
single brick. This way you can create 26 bricks per server and assign 
each one to a different disperse set. A big distributed-dispersed volume 
balances I/O load between bricks better. Note that RAID configurations 
have a reduction in the available number of IOPS. For sequential writes, 
this is not so bad, but if you have many clients accessing the same 
bricks, you will see many random accesses even if clients are doing 
sequential writes. Caching can alleviate this, but if you want to 
sustain a throughput of 2-3 GB/s, caching effects are not so evident.


Without RAID you could use a 16+2 or even a 16+3 dispersed volume. This 
gives you a good protection and increased storage.


Xavi



I will check the parameters you mentioned.

Serkan

On Tue, Oct 13, 2015 at 1:43 PM, Xavier Hernandez mailto:xhernan...@datalab.es>> wrote:

+gluster-users


On 13/10/15 12:34, Xavier Hernandez wrote:

Hi Serkan,

On 12/10/15 16:52, Serkan Çoban wrote:

Hi,

I am planning to use GlusterFS for backup purposes. I write
big files
(>100MB) with a throughput of 2-3GB/sn. In order to gain
from space we
plan to use erasure coding. I have some questions for EC and
brick
planning:
- I am planning to use 200TB XFS/ZFS RAID6 volume to hold
one brick per
server. Should I increase brick count? is increasing brick
count also
increases performance?


Using a distributed-dispersed volume increases performance. You can
split each RAID6 volume into multiple bricks to create such a
volume.
This is because a single brick process cannot achieve the maximum
throughput of the disk, so creating multiple bricks improves this.
However having too many bricks could be worse because all
request will
go to the same filesystem and will compete between them in your
case.

Another thing to consider is the size of the RAID volume. A
200TB RAID
will require *a lot* of time to reconstruct in case of failure
of any
disk. Also, a 200 TB RAID means you need almost 30 8TB disks. A
RAID6 of
30 disks is quite fragile. Maybe it would be better to create
multiple
RAID6 volumes, each with 18 disks at most (16+2 is a good and
efficient
configuration, specially for XFS on non-hardware raids). Even in
this
configuration, you can create multiple bricks in each RAID6 volume.

- I plan to use 16+2 for EC. Is this a problem? Should I
decrease this
to 12+2 or 10+2? Or is it completely safe to use whatever we
want?


16+2 is a very big configuration. It requires much computation
power and
forces you to grow (if you need to grow the gluster volume at some
point) in multiples of 18 bricks.

Considering that you are already using a RAID6 in your servers,
what you
are really protecting with the disperse redundancy is the
failure of the
servers themselves. Maybe a 8+1 configuration could be enough
for your
needs and requires less computation. If you really need
redundancy 2,
8+2 should be ok.

Using values that are not a power of 2 has a theoretical impact
on the
performance of the disperse volume when applications write
blocks whose
size is a multiple of a power

[Gluster-users] glusterfs-3.7.5 released

2015-10-14 Thread Pranith Kumar Karampuri

Hi all,

I'm pleased to announce the release of GlusterFS-3.7.5. This release
includes 70 changes after 3.7.4. The list of fixed bugs is included
below.

Tarball and RPMs can be downloaded from
http://download.gluster.org/pub/gluster/glusterfs/3.7/3.7.5/

Ubuntu debs are available from
https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-3.7

Debian Unstable (sid) packages have been updated and should be
available from default repos.

NetBSD has updated ports at
ftp://ftp.netbsd.org/pub/pkgsrc/current/pkgsrc/filesystems/glusterfs/README.html


Upgrade notes from 3.7.2 and earlier

GlusterFS uses insecure ports by default from release v3.7.3. This
causes problems when upgrading from release 3.7.2 and below to 3.7.3
and above. Performing the following steps before upgrading helps avoid
problems.

- Enable insecure ports for all volumes.

 ```
 gluster volume set  server.allow-insecure on
 gluster volume set  client.bind-insecure on
 ```

- Enable insecure ports for GlusterD. Set the following line in
`/etc/glusterfs/glusterd.vol`

 ```
 option rpc-auth-allow-insecure on
 ```

 This needs to be done on all the members in the cluster.


Fixed bugs
==
1258313 - Start self-heal and display correct heal info after replace brick
1268804 - Test tests/bugs/shard/bug-1245547.t failing consistently when run 
with patch http://review.gluster.org/#/c/11938/
1261234 - Possible memory leak during rebalance with large quantity of files
1259697 - Disperse volume: Huge memory leak of glusterfsd process
1267817 - No quota API to get real hard-limit value.
1267822 - Have a way to disable readdirp on dht from glusterd volume set command
1267823 - Perf: Getting bad performance while doing ls
1267532 - Data Tiering:CLI crashes with segmentation fault when user tries "gluster 
v tier" command
1267149 - Perf: Getting bad performance while doing ls
1266822 - Add more logs in failure code paths + port existing messages to the 
msg-id framework
1262335 - Fix invalid logic in tier.t
1251821 - /usr/lib/glusterfs/ganesha/ganesha_ha.sh is distro specific
1258338 - Data Tiering: Tiering related information is not displayed in gluster 
volume info xml output
1266872 - FOP handling during file migration is broken in the release-3.7 
branch.
1266882 - RFE: posix: xattrop 'GF_XATTROP_ADD_DEF_ARRAY' implementation
1246397 - POSIX ACLs as used by a FUSE mount can not use more than 32 groups
1265633 - AFR : "gluster volume heal  dest=:1.65 reply_serial=2"
1265890 - rm command fails with "Transport end point not connected" during add 
brick
1261444 - cli : volume start will create/overwrite ganesha export file
1258347 - Data Tiering: Tiering related information is not displayed in gluster 
volume status xml output
1258340 - Data Tiering:Volume task status showing as remove brick when detach 
tier is trigger
1260919 - Quota+Rebalance : While rebalance is in progress , quota list shows 
'Used Space' more than the Hard Limit set
1264738 - 'gluster v tier/attach-tier/detach-tier help' command shows the 
usage, and then throws 'Tier command failed' error message
1262700 - DHT + rebalance :- file permission got changed (sticky bit and setgid 
is set) after file migration failure
1263191 - Error not propagated correctly if selfheal layout lock fails
1258244 - Data Tieirng:Change error message as detach-tier error message throws as 
"remove-brick"
1263746 - Data Tiering:Setting only promote frequency and no demote frequency 
causes crash
1262408 - Data Tieirng:Detach tier status shows number of failures even when 
all files are migrated successfully
1262547 - `getfattr -n replica.split-brain-status ' command hung on the 
mount
1262547 - `getfattr -n replica.split-brain-status ' command hung on the 
mount
1262344 - quota: numbers of warning messages in nfs.log a single file itself
1260858 - glusterd: volume status backward compatibility
1261742 - Tier: glusterd crash when trying to detach , when hot tier is having 
exactly one brick and cold tier is of replica type
1262197 - DHT: Few files are missing after remove-brick operation
1261008 - Do not expose internal sharding xattrs to the application.
1262341 - Database locking due to write contention between CTR sql connection 
and tier migrator sql connection
1261715 - [HC] Fuse mount crashes, when client-quorum is not met
1260511 - fuse client crashed during i/o
1261664 - Tiering status command is very cumbersome.
1259694 - Data Tiering:Regression:Commit of detach tier passes without directly 
without even issuing a detach tier start
1260859 - snapshot: from nfs-ganesha mount no content seen in 
.snaps/ directory
1260856 - xml output for volume status on tiered volume
1260593 - man or info page of gluster needs to be updated with self-heal 
commands.
1257394 - Provide more meaningful errors on peer probe and peer detach
1258769 - Porting log messages to new framework
1255110 - client is sending io to arbiter with replica 2
1259652 - quota test 'quota-nfs.t' f