Err, not sort unique by size, but instead sort by size then discard unique.
Kurt On Thu, Jul 30, 2015 at 6:33 PM, Kurt Buff <kurt.b...@gmail.com> wrote: > That's an interesting thought. > > Generate the list of files, sort unique by size,then hash only those > that match each other. > > Excellent idea. That should speed things up quite a bit by eliminating > processor time for hashing. > > Thanks! > > Kurt > > > > On Thu, Jul 30, 2015 at 6:12 PM, Weber, Mark A <mark-a-we...@uiowa.edu> wrote: >> If your (only) goal is to find duplicate files, you might want to add logic >> to only hash files that are the same size - m >> >> -----Original Message----- >> From: listsadmin@lists.myITforum.com [mailto:listsadmin@lists.myITforum.com] >> On Behalf Of Kurt Buff >> Sent: Thursday, July 30, 2015 6:29 PM >> To: powersh...@lists.myitforum.com >> Subject: Re: [powershell] Need some pointers on an exercise I've set for >> myself >> >> Outstanding. >> >> I worked really hard at getting option 3 to work, and with your help that >> works for me now. >> >> But, can you tell me if any of these approaches will put significantly less >> pressure on memory than the other two? >> >> I think break this down a bit further, though. I think that exporting the >> data with the hashes to another file, then using that file to do the sorting >> and detecting the duplicates >> >> Kurt >> >> Kurt >> >> On Thu, Jul 30, 2015 at 3:33 PM, Bailey, Doug <dbai...@utexas.edu> wrote: >>> In the sample code that you provided, you're only outputting the result of >>> Get-FileHash to the pipeline but it sounds like you want to add it to >>> what's in the Import-CSV objects. Here are 3 methods of doing that: >>> >>> Get-ChildItem . -Recurse | select length, fullname | export-csv >>> -NoTypeInformation $env:TEMP\files.csv >>> >>> # Method 1 >>> # Add-Member with -PassThru >>> Import-CSV $env:TEMP\files.csv | ForEach-Object { >>> $_ | Add-Member -MemberType NoteProperty -Name Hash -Value >>> (get-filehash -algorithm md5 $_.FullName).Hash -PassThru } | Sort hash >>> >>> >>> # Method 2 >>> # Create a PSCustomObject with hash table Import-CSV >>> $env:TEMP\files.csv | ForEach-Object { >>> [pscustomObject] @{ >>> Hash = (get-filehash -algorithm md5 $_.FullName).Hash >>> Legth = $_.length >>> Path = $_.FullName >>> } >>> } | Sort hash >>> >>> # Method 3 >>> # Select-Object with Name/Expression hash table as a Property >>> parameter Import-CSV $env:TEMP\files.csv | Select-Object -Property >>> @{Name="Hash";Expression={(get-filehash -algorithm md5 >>> $_.FullName).Hash}},Length,FullName | Sort hash >>> >>> -----Original Message----- >>> From: listsadmin@lists.myITforum.com >>> [mailto:listsadmin@lists.myITforum.com] On Behalf Of Kurt Buff >>> Sent: Thursday, July 30, 2015 5:09 PM >>> To: powersh...@lists.myitforum.com >>> Subject: Re: [powershell] Need some pointers on an exercise I've set >>> for myself >>> >>> File store approaches 3tb now - just about 290gb free on a 3.1tb partition. >>> >>> The concern is that I've noticed a fair number of ISO files (and >>> potentially a lot of other files including zip and other archives, and >>> mpegs, etc.) that seem to be duplicates of each other. >>> >>> I want to generate a report for the VP of engineering, and let him know how >>> bad the situation is - I'm going to guess there's close to 1tb of >>> redundancy currently. >>> >>> Yes, this will consume hours of time, but I can launch it over a weekend >>> and take a look on the Monday following. >>> >>> I like your idea of restartability, though - it's worth looking at as a >>> secondary goal. >>> >>> Kurt >>> >>> On Thu, Jul 30, 2015 at 2:19 PM, James Button >>> <jamesbut...@blueyonder.co.uk> wrote: >>>> Not experienced enough in powershell to suggest code, BUT I would >>>> advise that you make the process run as a restartable facility such that >>>> the process can be interrupted ( if not by escape or ctrl+c, then by task >>>> killing) and then, when restarted will continue processing a list of files >>>> from the one after the last one for which a result was recorded. >>>> >>>> Working on the basis that you have a 1TB file store, and are working >>>> towards a 3, or 6TB filestore, even assuming your filestore connection >>>> runs at 8Gb/sec, as in 60GB per minute, that's surely going to be an hours >>>> full time use of the interface, and I'd really expect the hashing process >>>> to take getting on for a day elapsed if the system is running - spinning >>>> media on a more common interface connection, rather than a solid state >>>> store on the fastest possible multi-channel interface. >>>> >>>> You may also need to consider the system overhead in assembling the >>>> list of files - sheer volume of the MFT to be processed, I know from >>>> a fair amount of the restructuring work I used to do for clients on a 4GB >>>> memory system with caddy'd drives - Such as renaming files that filled a >>>> 1TB drive, for access as 'home drives' - before you had all the >>>> maintenance goodies in the admin facilities. >>>> >>>> (Having taken a complete list of files, stuck them in Excel, sorted >>>> them there, and generated a set of rename commands.) >>>> >>>> It took more time processing the MFT entries to "rename" the files in situ >>>> - than it did to copy them to another drive with the new names. >>>> Simply because of the thrashing on the MFT blocks in the OS allocated disk >>>> read cache. >>>> >>>> JimB >>>> >>>> >>>> -----Original Message----- >>>> From: listsadmin@lists.myITforum.com >>>> [mailto:listsadmin@lists.myITforum.com] On Behalf Of Kurt Buff >>>> Sent: Thursday, July 30, 2015 8:45 PM >>>> To: powersh...@lists.myitforum.com >>>> Subject: [powershell] Need some pointers on an exercise I've set for >>>> myself >>>> >>>> I'm putting together what should be a simple little script, and failing. >>>> >>>> I am ultimately looking to run this against a directory, then sort >>>> the output on the hash field and then parse for duplicates. There are >>>> two conditions that concern me: 1) there are over 3m files in the >>>> target directory, and 2) many of the files are quite large, over 1g. >>>> >>>> I'm more concerned about the effects of the script on memory than on >>>> processor - the data is fairly static, and I intend to run it once a >>>> month or even less, but I did choose MD5 as the hash algorithm for >>>> speed, rather than accept the default of SHA256. >>>> >>>> This is pretty simple stuff, I'm sure, but I'm using this as a >>>> learning exercise more than anything, as there are duplicate file >>>> finders out in the world already. >>>> >>>> There are several problems with what I have put together so far, >>>> which this this: >>>> >>>> Get-ChildItem c:\stuff -Recurse | select length, fullname | >>>> export-csv -NoTypeInformation c:\temp\files.csv >>>> Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash >>>> -algorithm md5 $_.FullName) }; Length | Sort hash >>>> >>>> Using Length (or $_.Length) anywhere in the foreach statement gives >>>> an error, or gives weird output. >>>> >>>> Sample Output when not using Length, and therefore getting reasonable >>>> output (extra spaces and hyphen delimiters elided): >>>> Algorithm Hash >>>> Path >>>> MD5 592BE1AD0ED83C36D5E68CA7A014A510 >>>> C:\stuff\Tools\SomeFile.DOC >>>> >>>> What I'd like to see instead >>>> Hash Length >>>> Path >>>> 592BE1AD0ED83C36D5E68CA7A014A510 79872 >>>> C:\stuff\Tools\SomeFile.DOC >>>> >>>> If anyone can offer some instruction, I'd appreciate it. >>>> >>>> Kurt >>>> >>>> >>>> ================================================ >>>> Did you know you can also post and find answers on PowerShell in the >>>> forums? >>>> http://www.myitforum.com/forums/default.asp?catApp=1 >>>> >>>> >>>> >>>> ================================================ >>>> Did you know you can also post and find answers on PowerShell in the >>>> forums? >>>> http://www.myitforum.com/forums/default.asp?catApp=1 >>>> >>> >>> >>> ================================================ >>> Did you know you can also post and find answers on PowerShell in the forums? >>> http://www.myitforum.com/forums/default.asp?catApp=1 >>> >>> >>> ================================================ >>> Did you know you can also post and find answers on PowerShell in the forums? >>> http://www.myitforum.com/forums/default.asp?catApp=1 >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1