File store approaches 3tb now - just about 290gb free on a 3.1tb partition.

The concern is that I've noticed a fair number of ISO files (and
potentially a lot of other files including zip and other archives, and
mpegs, etc.) that seem to be duplicates of each other.

I want to generate a report for the VP of engineering, and let him
know how bad the situation is - I'm going to guess there's close to
1tb of redundancy currently.

Yes, this will consume hours of time, but I can launch it over a
weekend and take a look on the Monday following.

I like your idea of restartability, though - it's worth looking at as
a secondary goal.

Kurt

On Thu, Jul 30, 2015 at 2:19 PM, James Button
<jamesbut...@blueyonder.co.uk> wrote:
> Not experienced enough in powershell to suggest code,
> BUT
> I would advise that you make the process run as a restartable facility such 
> that the process can be interrupted ( if not by escape or ctrl+c, then by 
> task killing) and then, when restarted will continue processing a list of 
> files from the one after the last one for which a result was recorded.
>
> Working on the basis that you have a 1TB file store, and are working towards 
> a 3, or 6TB filestore, even assuming your filestore connection runs at 
> 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full time 
> use of the interface, and I'd really expect the hashing process to take 
> getting on for a day elapsed if the system is running - spinning media on a 
> more common interface connection, rather than a solid state store on the 
> fastest possible multi-channel interface.
>
> You may also need to consider the system overhead in assembling the list of 
> files - sheer volume of the MFT to be processed,
> I know from  a fair amount of the restructuring work I used to do for clients 
> on a 4GB memory system  with caddy'd drives -
> Such as renaming files that filled a 1TB drive, for access as 'home drives' - 
> before you had all the maintenance goodies in the admin facilities.
>
> (Having taken a complete list of files, stuck them in Excel, sorted them 
> there, and generated a set of rename commands.)
>
> It took more time processing the MFT entries to "rename" the files in situ - 
> than it did to copy them to another drive with the new names.
> Simply because of the thrashing on the MFT blocks in the OS allocated disk 
> read cache.
>
> JimB
>
>
> -----Original Message-----
> From: listsadmin@lists.myITforum.com [mailto:listsadmin@lists.myITforum.com] 
> On Behalf Of Kurt Buff
> Sent: Thursday, July 30, 2015 8:45 PM
> To: powersh...@lists.myitforum.com
> Subject: [powershell] Need some pointers on an exercise I've set for myself
>
> I'm putting together what should be a simple little script, and failing.
>
> I am ultimately looking to run this against a directory, then sort the
> output on the hash field and then parse for duplicates. There are two
> conditions that concern me: 1) there are over 3m files in the target
> directory, and 2) many of the files are quite large, over 1g.
>
> I'm more concerned about the effects of the script on memory than on
> processor - the data is fairly static, and I intend to run it once a
> month or even less, but I did choose MD5 as the hash algorithm for
> speed, rather than accept the default of SHA256.
>
> This is pretty simple stuff, I'm sure, but I'm using this as a
> learning exercise more than anything, as there are duplicate file
> finders out in the world already.
>
> There are several problems with what I have put together so far, which
> this this:
>
>      Get-ChildItem c:\stuff -Recurse | select length, fullname |
> export-csv -NoTypeInformation c:\temp\files.csv
>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
> -algorithm md5 $_.FullName) }; Length | Sort hash
>
> Using Length (or $_.Length) anywhere in the foreach statement gives an
> error, or gives weird output.
>
> Sample Output when not using Length, and therefore getting reasonable
> output (extra spaces and hyphen delimiters elided):
>      Algorithm   Hash
>         Path
>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
> C:\stuff\Tools\SomeFile.DOC
>
> What I'd like to see instead
>      Hash                                                          Length   
> Path
>      592BE1AD0ED83C36D5E68CA7A014A510    79872    C:\stuff\Tools\SomeFile.DOC
>
> If anyone can offer some instruction, I'd appreciate it.
>
> Kurt
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

Reply via email to