Outstanding. I worked really hard at getting option 3 to work, and with your help that works for me now.
But, can you tell me if any of these approaches will put significantly less pressure on memory than the other two? I think break this down a bit further, though. I think that exporting the data with the hashes to another file, then using that file to do the sorting and detecting the duplicates Kurt Kurt On Thu, Jul 30, 2015 at 3:33 PM, Bailey, Doug <dbai...@utexas.edu> wrote: > In the sample code that you provided, you're only outputting the result of > Get-FileHash to the pipeline but it sounds like you want to add it to what's > in the Import-CSV objects. Here are 3 methods of doing that: > > Get-ChildItem . -Recurse | select length, fullname | export-csv > -NoTypeInformation $env:TEMP\files.csv > > # Method 1 > # Add-Member with -PassThru > Import-CSV $env:TEMP\files.csv | ForEach-Object { > $_ | Add-Member -MemberType NoteProperty -Name Hash -Value (get-filehash > -algorithm md5 $_.FullName).Hash -PassThru > } | Sort hash > > > # Method 2 > # Create a PSCustomObject with hash table > Import-CSV $env:TEMP\files.csv | ForEach-Object { > [pscustomObject] @{ > Hash = (get-filehash -algorithm md5 $_.FullName).Hash > Legth = $_.length > Path = $_.FullName > } > } | Sort hash > > # Method 3 > # Select-Object with Name/Expression hash table as a Property parameter > Import-CSV $env:TEMP\files.csv | Select-Object -Property > @{Name="Hash";Expression={(get-filehash -algorithm md5 > $_.FullName).Hash}},Length,FullName | Sort hash > > -----Original Message----- > From: listsadmin@lists.myITforum.com [mailto:listsadmin@lists.myITforum.com] > On Behalf Of Kurt Buff > Sent: Thursday, July 30, 2015 5:09 PM > To: powersh...@lists.myitforum.com > Subject: Re: [powershell] Need some pointers on an exercise I've set for > myself > > File store approaches 3tb now - just about 290gb free on a 3.1tb partition. > > The concern is that I've noticed a fair number of ISO files (and potentially > a lot of other files including zip and other archives, and mpegs, etc.) that > seem to be duplicates of each other. > > I want to generate a report for the VP of engineering, and let him know how > bad the situation is - I'm going to guess there's close to 1tb of redundancy > currently. > > Yes, this will consume hours of time, but I can launch it over a weekend and > take a look on the Monday following. > > I like your idea of restartability, though - it's worth looking at as a > secondary goal. > > Kurt > > On Thu, Jul 30, 2015 at 2:19 PM, James Button <jamesbut...@blueyonder.co.uk> > wrote: >> Not experienced enough in powershell to suggest code, BUT I would >> advise that you make the process run as a restartable facility such that the >> process can be interrupted ( if not by escape or ctrl+c, then by task >> killing) and then, when restarted will continue processing a list of files >> from the one after the last one for which a result was recorded. >> >> Working on the basis that you have a 1TB file store, and are working towards >> a 3, or 6TB filestore, even assuming your filestore connection runs at >> 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full time >> use of the interface, and I'd really expect the hashing process to take >> getting on for a day elapsed if the system is running - spinning media on a >> more common interface connection, rather than a solid state store on the >> fastest possible multi-channel interface. >> >> You may also need to consider the system overhead in assembling the >> list of files - sheer volume of the MFT to be processed, I know from >> a fair amount of the restructuring work I used to do for clients on a 4GB >> memory system with caddy'd drives - Such as renaming files that filled a >> 1TB drive, for access as 'home drives' - before you had all the maintenance >> goodies in the admin facilities. >> >> (Having taken a complete list of files, stuck them in Excel, sorted >> them there, and generated a set of rename commands.) >> >> It took more time processing the MFT entries to "rename" the files in situ - >> than it did to copy them to another drive with the new names. >> Simply because of the thrashing on the MFT blocks in the OS allocated disk >> read cache. >> >> JimB >> >> >> -----Original Message----- >> From: listsadmin@lists.myITforum.com >> [mailto:listsadmin@lists.myITforum.com] On Behalf Of Kurt Buff >> Sent: Thursday, July 30, 2015 8:45 PM >> To: powersh...@lists.myitforum.com >> Subject: [powershell] Need some pointers on an exercise I've set for >> myself >> >> I'm putting together what should be a simple little script, and failing. >> >> I am ultimately looking to run this against a directory, then sort the >> output on the hash field and then parse for duplicates. There are two >> conditions that concern me: 1) there are over 3m files in the target >> directory, and 2) many of the files are quite large, over 1g. >> >> I'm more concerned about the effects of the script on memory than on >> processor - the data is fairly static, and I intend to run it once a >> month or even less, but I did choose MD5 as the hash algorithm for >> speed, rather than accept the default of SHA256. >> >> This is pretty simple stuff, I'm sure, but I'm using this as a >> learning exercise more than anything, as there are duplicate file >> finders out in the world already. >> >> There are several problems with what I have put together so far, which >> this this: >> >> Get-ChildItem c:\stuff -Recurse | select length, fullname | >> export-csv -NoTypeInformation c:\temp\files.csv >> Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash >> -algorithm md5 $_.FullName) }; Length | Sort hash >> >> Using Length (or $_.Length) anywhere in the foreach statement gives an >> error, or gives weird output. >> >> Sample Output when not using Length, and therefore getting reasonable >> output (extra spaces and hyphen delimiters elided): >> Algorithm Hash >> Path >> MD5 592BE1AD0ED83C36D5E68CA7A014A510 >> C:\stuff\Tools\SomeFile.DOC >> >> What I'd like to see instead >> Hash Length >> Path >> 592BE1AD0ED83C36D5E68CA7A014A510 79872 C:\stuff\Tools\SomeFile.DOC >> >> If anyone can offer some instruction, I'd appreciate it. >> >> Kurt >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 >> >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 >> > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1