Re: [Puppet Users] Mirror folder with large files

Daniel Piddock Tue, 25 Jan 2011 06:40:26 -0800

On 25/01/11 12:45, Brice Figureau wrote:
> On Mon, 2011-01-24 at 17:14 +0000, Daniel Piddock wrote:
>> Dear list,
>>
>> I'm attempting to mirror a folder containing a few large files from an
>> NFS location to the local drive. Subsequent runs take a lot longer than
>> I'd have expected, after the first run.
>>
>> Using the following block a puppet apply run is currently taking 30 seconds:
>> file { '/usr/share/target':
>>         source   => 'file:///home/archive/source/',
>>         recurse  => true,
>>         backup   => false,
>>         checksum => mtime,
>> }
>>
>> There are 42 files taking up 870MB. I'd have thought stating the files
>> in the source and target, comparing to each other (or a cache internal
>> to puppet as it doesn't set the mtime on files) would be a lot faster
>> than it is.
> This is a naive view of the problem :)
> The puppet file type is certainly the most complex resource abstraction
> puppet embeds (just think about the fact that it handles dir, files,
> link, remote recursion, local recursion, etc...).


Yes, it's a shame that the implication of "checksum => mtime" doesn't do
what it says on the tin, or the documentation doesn't really mention
anything about how the checksums differ or function. However md5summing
every file twice when recursing seems a bit broken.

>> I was curious about what puppet was up to, so ran it in strace. It's
>> reading every file every run, multiple times! Reads the target twice,
>> then the source twice before reading the target again. Considering I
>> wasn't expecting it to open any of the files at all this is total over kill.
>>
>> Is this horribly bugged or have I got a magic incantation that's causing
>> this behaviour? strace is rather verbose and I haven't exactly read all
>> 80MB of the dump line by line.
>>
>> Is there a neater way of just mirroring a folder based on modification
>> time? I suppose the easiest route would be an exec of rsync, at least I
>> have control over that.
> Yes, I think rsync is the sanest way to do this.
>
> Recursive file resources (and especially sourced ones) are really tough
> for puppet to handle in the current way the code is working.
>
> Puppet manages individual file resources, and for every resource it
> manages it as an instance of this resource in memory.
>
> For deep/large file hierarchies, Puppet has to create/manage an
> individual resource per file/directory present in this hierarchy, which
> consumes both cpu and ram (due to the way the ruby GC is poorly
> implemented and the time it takes to create a ruby object). 
> And I don't even talk about the scalability issues of the generation and
> handling of billions of "change" event coming up each time a file is
> changed (which happens for instance the first time puppet runs).
>
> I think I remember mtime is a checksum valid only for directory, and
> puppet automatically switches to md5 for files (I don't really know the
> reason, but I'm sure redmine knows it).
>
> (One of) The problem is that puppet reads the file once to compute the
> md5 sum, then it also reads it again to perform the copy when it detects
> a change. I don't exactly know why it would write multiple times, but
> I'm sure you can debug this by adding debug statements in
> puppet/type/file/content.rb where all the write happens.

In recursion, the source file is read twice, target is tested and if it
doesn't exist the source is read again for the copy. If the target did
exist, it's read twice as well. It does not matter if the checksum was
specified as md5 or mtime. I put more detail on issue 6003
http://projects.puppetlabs.com/issues/6003 .

Writing only happens once per changed file.

>> I'm using Puppet 2.6.4.
>>
>> Dan
>> I especially like the way Ruby searches for and loads the md5 library
>> every time it's used. What a performant language.
> This certainly comes from this code in Puppet::Util::Checksums:
>   # Calculate a checksum of a file's content using Digest::MD5.
>   def md5_file(filename, lite = false)
>     require 'digest/md5'
>
>     digest = Digest::MD5.new
>     checksum_file(digest, filename,  lite)
>   end
>
> Notice how the "require" is in the function instead of being outside.
> I'd think that ruby would be smart enough to understand the file has
> already been "required" and not bother, but apparently it doesn't do
> that for you. Can you give us what ruby version and what platform you're
> using?

The client I'm using for testing is Fedora 14, ruby-1.8.7.330-1.fc14.x86_64

Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Re: [Puppet Users] Mirror folder with large files

Reply via email to