[Bug 3099] Please parallelize filesystem scan

2015-07-17 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099

--- Comment #8 from Chip Schweiss c...@innovates.com ---
I would argue that optionally all directory scanning should be made parallel.  
Modern file systems perform best when request queues are kept full.  The
current mode of rsync scanning directories does nothing to take advantage of
this.   

I currently use scripts to split a couple dozen or so rsync jobs in to
literally 100's of jobs.   This reduces execution time from what would be days
to a couple hours every night.   There are lots of scripts like this appearing
on the net because the current state of rsync is inadequate.  

This ticket could reasonably combined with 5124.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [Bug 3099] Please parallelize filesystem scan

2015-07-17 Thread Ken Chase
I dont understand - scanning metadata is sped up by thrashing the head
all over the disk instead of mostly-sequentially scanning through?

How does that work out?

/kc


On Fri, Jul 17, 2015 at 02:37:21PM +, samba-b...@samba.org said:
  https://bugzilla.samba.org/show_bug.cgi?id=3099
  
  --- Comment #8 from Chip Schweiss c...@innovates.com ---
  I would argue that optionally all directory scanning should be made 
parallel.  
  Modern file systems perform best when request queues are kept full.  The
  current mode of rsync scanning directories does nothing to take advantage of
  this.   
  
  I currently use scripts to split a couple dozen or so rsync jobs in to
  literally 100's of jobs.   This reduces execution time from what would be 
days
  to a couple hours every night.   There are lots of scripts like this 
appearing
  on the net because the current state of rsync is inadequate.  
  
  This ticket could reasonably combined with 5124.
  
  -- 
  You are receiving this mail because:
  You are the QA Contact for the bug.
  
  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 Toronto Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


[Bug 3099] Please parallelize filesystem scan

2015-07-17 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099

--- Comment #7 from Rainer rai...@voigt-home.net ---
Hi,

I'm experiencing the very same problem: I'm trying to sync a set of VMWare disk
files (about 2.5TB) with not too many changes, and direct copying is still
faster than the checksumming by a quite large margin because of the sequential
checksumming on source and target just doubles the time needed.

I think the point is that the GigE link between the PC and the NAS achieves
about 80MB/s, and the HDD read rate is not much higher (approx. 130MB/s). 

When doing the checksumming on source and target in parallel we could ideally
(if nothing changed) reach the read rate of the HDDs as 'transfer' bandwidth,
because this is the speed at which we can verify that the data is the same on
source and target. The sequential approach like it is now reduces the initial
check to half the HDD read rate, so transfering unchanged files will only yield
about 65MB/s in my case, which is slower than simple copying.

Is this patch you proposed some years ago something I can apply to and try on a
current rsync version? If not, could you update it to the 3.1.x version so I
can benchmark the parallel checksumming in my situation?

Best Regards
Rainer

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [Bug 3099] Please parallelize filesystem scan

2015-07-17 Thread Ken Chase
Sounds to me like maintaining the metadata cache is important - and tuning the
filesystem to do so would be more beneficial than caching writes, especially
with a backup target where a write already written will likely never be read
again (and isnt a big deal if it is since so few files are changed compared to
the total # of inodes to scan).

Your report of the minutes for the re-sync shows the unthrashed cache is highly
valuable. So all we need to do is tune the backup target (and even the 
operational
servers themselves) to maintain more metadata. I dont know how much ram is used
per inode, but I'd throw in another 4-8gb just for metadata caching per box, or
even more, if it meant scanning was sped up.

(Really, actually, one only needs it in the backup target - if you can run all
the backups in parallel, and there's N servers to backup, they can all run at 
1/N
speed, as long as scanning metadata on the backup target is fast enough to keep
up with it all -- my total data written is only 20-30GB for example, which at 
reasonable
speed (20-30MB/s even, which is slow) is only 15 minutes total writing. Even 
200-300GB
changed would be 150 minutes at that rate, and the rate could easily be 4x 
faster.

So, tuning caches to prefer metadata seems to be key. How?

As we've discussed before, letting the filesystem at it throws away precious
metadata cache, and so tracking your own changes (since the backup system will 
never
be used for anything else, right? :) would be beneficial. Of course the danger
is using the backup system for anything else and changing any of the target 
info -
inconsistencies would crop up and make the backup worthless very quickly.

/kc

On Fri, Jul 17, 2015 at 03:18:02PM +, Schweiss, Chip said:
  Modern file systems have many internal queues, and service many clients 
simultaneously.  They arrange their work to maximize throughput in both read 
and write operations.This is the norm on any enterprise file system, be it 
Hitachi, Oracle, Dell, HP, Isilon, etc.  You will get significantly higher 
throughput if you hit it with multiple threads.   These systems have elaborate 
predictive read ahead caches and perform best when multiple threads hit them.
  
  Using the test case of a single server with a simple file system such as 
ext3/4, or xfs, no gains will be seen in multithreading rsync.   Use an 
enterprise file system with 100's of TBs and the more threads you use the 
faster you will go.   Metadata and data on these systems ends up across 100's 
of disks.   Single threads end up severely bound by latency.  This is why 
multi-threading should be optional.  It doesn't help everyone.
  
  For example, one of my rsync jobs moving from a ZFS system in St. Louis, 
Missouri to a Hitachi HNAS in Minneapolis, Minnesota has over 100 million 
files.   Each day 50 to 100 thousand files get added or updated.   A single 
rsync job would take weeks to parse this job and send the changes.   I split it 
into 120 jobs and it typically completes in 2 hours when no humans are using 
the systems.   A re-sync immediately afterwards, again with 120 jobs, scans 
both ends in minutes.
  
  -Chip
  
  -Original Message-
  From: rsync [mailto:rsync-boun...@lists.samba.org] On Behalf Of Ken Chase
  Sent: Friday, July 17, 2015 9:51 AM
  To: samba-b...@samba.org
  Cc: rsync...@samba.org
  Subject: Re: [Bug 3099] Please parallelize filesystem scan
  
  I dont understand - scanning metadata is sped up by thrashing the head
  all over the disk instead of mostly-sequentially scanning through?
  
  How does that work out?
  
  /kc
  
  
  On Fri, Jul 17, 2015 at 02:37:21PM +, samba-b...@samba.org said:
https://bugzilla.samba.org/show_bug.cgi?id=3099

--- Comment #8 from Chip Schweiss c...@innovates.com ---
I would argue that optionally all directory scanning should be made 
parallel.
Modern file systems perform best when request queues are kept full.  The
current mode of rsync scanning directories does nothing to take advantage 
of
this.

I currently use scripts to split a couple dozen or so rsync jobs in to
literally 100's of jobs.   This reduces execution time from what would be 
days
to a couple hours every night.   There are lots of scripts like this 
appearing
on the net because the current state of rsync is inadequate.

This ticket could reasonably combined with 5124.

--
You are receiving this mail because:
You are the QA Contact for the bug.

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
  
  --
  Ken Chase - k...@heavycomputing.ca skype:kenchase23 Toronto Canada
  Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 
Front St. W.
  
  --
  Please use reply-all for most replies to avoid omitting

Re: [Bug 3099] Please parallelize filesystem scan

2015-07-17 Thread ray vantassle
Ken, this just happens to be a special case where your configuration has a
huge number of spindles.  If you have multiple threads reading the same
spindle you'll just be thrashing the heads back  forth.  If there is one
thread reading at the front of the disk and another thread reading at the
end of the disk, it will be *slower* that if you have just one thread
reading first the front of the disk and then the end of the disk.  Two
threads will just have the head whipping back and forth.

one of my rsync jobs moving from a ZFS system ... has over 100 million
files
Spreads over how many spindles?

The problem is, the optimum way to access the disks depends on how the data
lies on the disks.  And that's something that a mere program cannot know.
Only the filesystem can know that information.  Whether it's ext4, md,
brtfs, zfs, or whatever -- a program like rsync cannot possibly know how
best to access the disk(s) and with how many simultaneous threads.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

[Bug 3099] Please parallelize filesystem scan

2013-02-09 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099

--- Comment #6 from Arie Skliarouk sklia...@gmail.com 2013-02-10 06:45:30 UTC 
---
Any hope for the bug to be resolved? It is really inconvenient to have
production database to be down for double amount of time than what is really
necessary.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the QA contact for the bug.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [Bug 3099] Please parallelize filesystem scan

2005-09-16 Thread Chris Shoemaker
On Thu, Sep 15, 2005 at 09:32:44PM -0400, Chris Shoemaker wrote:
 On Thu, Sep 15, 2005 at 04:23:24PM -0700, [EMAIL PROTECTED] wrote:
  https://bugzilla.samba.org/show_bug.cgi?id=3099
  
  
  
  
  
  --- Additional Comments From [EMAIL PROTECTED]  2005-09-15 16:23 ---
  Created an attachment (id=1448)
   -- (https://bugzilla.samba.org/attachment.cgi?id=1448action=view)
  One possible way to reorder the checksum computation.
  
   how could it possibly require a change to the rsync protocol for the
   second host in the sequence to pre-scan its filesystem, so that that
   data is available when needed?
  
  The only way to know what to scan is to look at the file list from the 
  sender
  (since the receiver usually doesn't know anything other than the destination
  directory, and options such as -R, --exclude, and --files-from can radically
  limit what files need to be scanned).
  
  I suppose it would be possible for the receiver to compute the full-file
  checksums as the file list is arriving from the sender (yes, the sender 
  sends
  the list incrementally as it is created), but the code currently doesn't 
  know
  if the destination spec is a file or a directory until after it receives the
  file list, so the code would need to be made to attempt a chdir to the
  destination arg and to skip the pre-caching if that doesn't work.
  
  One bad thing about this solution is that we really should be making the
  sending side not pre-compute the checksums before the start of the transfer
  phase (to be like the generator, which computes the checksums while looking 
  for
  files to transfer). Computing them during the transfer makes it more likley
  that the file's data in the disk cache will be able to be re-used when a 
  file
  needs to be updated. Thus, changing the receiving side to pre-compute the
  checksums before starting the transfer seems to be going in the wrong 
  direction
  (though it might speed up a large transfer where few files were different, 
  it
  might also slow down a large transfer where many files were changed).
 
 IMHO, in general, optimizing for the few-changes (small delta) case
 is the right thing to do.  Rsync's utility diminishes anyway as delta
 increases, so there's no reason not to make efficiency increase with
 increasing delta.

err... I meant: make efficiency increase as delta *decreases*.
i.e. optimize for small-changes case.

 
 -chris
 
  
  The attached patch implements a simple pre-scan that works with basic 
  options.
  It could be improved to handle things like --compare-dest better, but I 
  think
  it basically works.  If you'd care to run some speed tests, maybe you could
  persuade me that this kluge would be worth looking at further (I'm not
  considering it at the moment).
  
  -- 
  Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
  --- You are receiving this mail because: ---
  You are the QA contact for the bug, or are watching the QA contact.
  -- 
  To unsubscribe or change options: 
  https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
 -- 
 To unsubscribe or change options: 
 https://lists.samba.org/mailman/listinfo/rsync
 Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


[Bug 3099] Please parallelize filesystem scan

2005-09-16 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099


[EMAIL PROTECTED] changed:

   What|Removed |Added

Attachment #1448 is|0   |1
   obsolete||




--- Additional Comments From [EMAIL PROTECTED]  2005-09-16 09:47 ---
Created an attachment (id=1452)
 -- (https://bugzilla.samba.org/attachment.cgi?id=1452action=view)
Improved patch for eary checksums

This version of the patch fixes a few potential problems with the first one.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the QA contact for the bug, or are watching the QA contact.
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


[Bug 3099] Please parallelize filesystem scan

2005-09-16 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099


[EMAIL PROTECTED] changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|WONTFIX |




--- Additional Comments From [EMAIL PROTECTED]  2005-09-16 09:47 ---
I've reopened this suggestion to consider the attached patch.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the QA contact for the bug, or are watching the QA contact.
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


[Bug 3099] Please parallelize filesystem scan

2005-09-15 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099





--- Additional Comments From [EMAIL PROTECTED]  2005-09-15 13:49 ---
Pardon me for being dense, but how could it possibly require a change to the
rsync protocol for the second host in the sequence to pre-scan its filesystem,
so that that data is available when needed?

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the QA contact for the bug, or are watching the QA contact.
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


[Bug 3099] Please parallelize filesystem scan

2005-09-15 Thread samba-bugs
https://bugzilla.samba.org/show_bug.cgi?id=3099





--- Additional Comments From [EMAIL PROTECTED]  2005-09-15 16:23 ---
Created an attachment (id=1448)
 -- (https://bugzilla.samba.org/attachment.cgi?id=1448action=view)
One possible way to reorder the checksum computation.

 how could it possibly require a change to the rsync protocol for the
 second host in the sequence to pre-scan its filesystem, so that that
 data is available when needed?

The only way to know what to scan is to look at the file list from the sender
(since the receiver usually doesn't know anything other than the destination
directory, and options such as -R, --exclude, and --files-from can radically
limit what files need to be scanned).

I suppose it would be possible for the receiver to compute the full-file
checksums as the file list is arriving from the sender (yes, the sender sends
the list incrementally as it is created), but the code currently doesn't know
if the destination spec is a file or a directory until after it receives the
file list, so the code would need to be made to attempt a chdir to the
destination arg and to skip the pre-caching if that doesn't work.

One bad thing about this solution is that we really should be making the
sending side not pre-compute the checksums before the start of the transfer
phase (to be like the generator, which computes the checksums while looking for
files to transfer). Computing them during the transfer makes it more likley
that the file's data in the disk cache will be able to be re-used when a file
needs to be updated. Thus, changing the receiving side to pre-compute the
checksums before starting the transfer seems to be going in the wrong direction
(though it might speed up a large transfer where few files were different, it
might also slow down a large transfer where many files were changed).

The attached patch implements a simple pre-scan that works with basic options.
It could be improved to handle things like --compare-dest better, but I think
it basically works.  If you'd care to run some speed tests, maybe you could
persuade me that this kluge would be worth looking at further (I'm not
considering it at the moment).

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the QA contact for the bug, or are watching the QA contact.
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [Bug 3099] Please parallelize filesystem scan

2005-09-15 Thread Chris Shoemaker
On Thu, Sep 15, 2005 at 04:23:24PM -0700, [EMAIL PROTECTED] wrote:
 https://bugzilla.samba.org/show_bug.cgi?id=3099
 
 
 
 
 
 --- Additional Comments From [EMAIL PROTECTED]  2005-09-15 16:23 ---
 Created an attachment (id=1448)
  -- (https://bugzilla.samba.org/attachment.cgi?id=1448action=view)
 One possible way to reorder the checksum computation.
 
  how could it possibly require a change to the rsync protocol for the
  second host in the sequence to pre-scan its filesystem, so that that
  data is available when needed?
 
 The only way to know what to scan is to look at the file list from the sender
 (since the receiver usually doesn't know anything other than the destination
 directory, and options such as -R, --exclude, and --files-from can radically
 limit what files need to be scanned).
 
 I suppose it would be possible for the receiver to compute the full-file
 checksums as the file list is arriving from the sender (yes, the sender sends
 the list incrementally as it is created), but the code currently doesn't know
 if the destination spec is a file or a directory until after it receives the
 file list, so the code would need to be made to attempt a chdir to the
 destination arg and to skip the pre-caching if that doesn't work.
 
 One bad thing about this solution is that we really should be making the
 sending side not pre-compute the checksums before the start of the transfer
 phase (to be like the generator, which computes the checksums while looking 
 for
 files to transfer). Computing them during the transfer makes it more likley
 that the file's data in the disk cache will be able to be re-used when a file
 needs to be updated. Thus, changing the receiving side to pre-compute the
 checksums before starting the transfer seems to be going in the wrong 
 direction
 (though it might speed up a large transfer where few files were different, it
 might also slow down a large transfer where many files were changed).

IMHO, in general, optimizing for the few-changes (small delta) case
is the right thing to do.  Rsync's utility diminishes anyway as delta
increases, so there's no reason not to make efficiency increase with
increasing delta.

-chris

 
 The attached patch implements a simple pre-scan that works with basic options.
 It could be improved to handle things like --compare-dest better, but I think
 it basically works.  If you'd care to run some speed tests, maybe you could
 persuade me that this kluge would be worth looking at further (I'm not
 considering it at the moment).
 
 -- 
 Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
 --- You are receiving this mail because: ---
 You are the QA contact for the bug, or are watching the QA contact.
 -- 
 To unsubscribe or change options: 
 https://lists.samba.org/mailman/listinfo/rsync
 Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html