Re: [powershell] Need some pointers on an exercise I've set for myself

Kurt Buff Thu, 30 Jul 2015 18:59:41 -0700

Err, not sort unique by size, but instead sort by size then discard unique.


Kurt

On Thu, Jul 30, 2015 at 6:33 PM, Kurt Buff <kurt.b...@gmail.com> wrote:
> That's an interesting thought.
>
> Generate the list of files, sort unique by size,then hash only those
> that match each other.
>
> Excellent idea. That should speed things up quite a bit by eliminating
> processor time for hashing.
>
> Thanks!
>
> Kurt
>
>
>
> On Thu, Jul 30, 2015 at 6:12 PM, Weber, Mark A <mark-a-we...@uiowa.edu> wrote:
>> If your (only) goal is to find duplicate files, you might want to add logic 
>> to only hash files that are the same size - m
>>
>> -----Original Message-----
>> From: listsadmin@lists.myITforum.com [mailto:listsadmin@lists.myITforum.com] 
>> On Behalf Of Kurt Buff
>> Sent: Thursday, July 30, 2015 6:29 PM
>> To: powersh...@lists.myitforum.com
>> Subject: Re: [powershell] Need some pointers on an exercise I've set for 
>> myself
>>
>> Outstanding.
>>
>> I worked really hard at getting option 3 to work, and with your help that 
>> works for me now.
>>
>> But, can you tell me if any of these approaches will put significantly less 
>> pressure on memory than the other two?
>>
>> I think break this down a bit further, though. I think that exporting the 
>> data with the hashes to another file, then using that file to do the sorting 
>> and detecting the duplicates
>>
>> Kurt
>>
>> Kurt
>>
>> On Thu, Jul 30, 2015 at 3:33 PM, Bailey, Doug <dbai...@utexas.edu> wrote:
>>> In the sample code that you provided, you're only outputting the result of 
>>> Get-FileHash to the pipeline but it sounds like you want to add it to 
>>> what's in the Import-CSV objects. Here are 3 methods of doing that:
>>>
>>> Get-ChildItem . -Recurse | select length, fullname | export-csv
>>> -NoTypeInformation $env:TEMP\files.csv
>>>
>>> # Method 1
>>> # Add-Member with -PassThru
>>> Import-CSV $env:TEMP\files.csv | ForEach-Object {
>>>     $_ | Add-Member -MemberType NoteProperty -Name Hash -Value
>>> (get-filehash -algorithm md5 $_.FullName).Hash -PassThru } | Sort hash
>>>
>>>
>>> # Method 2
>>> # Create a PSCustomObject with hash table Import-CSV
>>> $env:TEMP\files.csv | ForEach-Object {
>>>     [pscustomObject] @{
>>>         Hash = (get-filehash -algorithm md5 $_.FullName).Hash
>>>         Legth = $_.length
>>>         Path = $_.FullName
>>>     }
>>> }  | Sort hash
>>>
>>> # Method 3
>>> # Select-Object with Name/Expression hash table as a Property
>>> parameter Import-CSV $env:TEMP\files.csv | Select-Object -Property
>>> @{Name="Hash";Expression={(get-filehash -algorithm md5
>>> $_.FullName).Hash}},Length,FullName | Sort hash
>>>
>>> -----Original Message-----
>>> From: listsadmin@lists.myITforum.com
>>> [mailto:listsadmin@lists.myITforum.com] On Behalf Of Kurt Buff
>>> Sent: Thursday, July 30, 2015 5:09 PM
>>> To: powersh...@lists.myitforum.com
>>> Subject: Re: [powershell] Need some pointers on an exercise I've set
>>> for myself
>>>
>>> File store approaches 3tb now - just about 290gb free on a 3.1tb partition.
>>>
>>> The concern is that I've noticed a fair number of ISO files (and 
>>> potentially a lot of other files including zip and other archives, and 
>>> mpegs, etc.) that seem to be duplicates of each other.
>>>
>>> I want to generate a report for the VP of engineering, and let him know how 
>>> bad the situation is - I'm going to guess there's close to 1tb of 
>>> redundancy currently.
>>>
>>> Yes, this will consume hours of time, but I can launch it over a weekend 
>>> and take a look on the Monday following.
>>>
>>> I like your idea of restartability, though - it's worth looking at as a 
>>> secondary goal.
>>>
>>> Kurt
>>>
>>> On Thu, Jul 30, 2015 at 2:19 PM, James Button 
>>> <jamesbut...@blueyonder.co.uk> wrote:
>>>> Not experienced enough in powershell to suggest code, BUT I would
>>>> advise that you make the process run as a restartable facility such that 
>>>> the process can be interrupted ( if not by escape or ctrl+c, then by task 
>>>> killing) and then, when restarted will continue processing a list of files 
>>>> from the one after the last one for which a result was recorded.
>>>>
>>>> Working on the basis that you have a 1TB file store, and are working 
>>>> towards a 3, or 6TB filestore, even assuming your filestore connection 
>>>> runs at 8Gb/sec, as in 60GB per minute, that's surely going to be an hours 
>>>> full time use of the interface, and I'd really expect the hashing process 
>>>> to take getting on for a day elapsed if the system is running - spinning 
>>>> media on a more common interface connection, rather than a solid state 
>>>> store on the fastest possible multi-channel interface.
>>>>
>>>> You may also need to consider the system overhead in assembling the
>>>> list of files - sheer volume of the MFT to be processed, I know from
>>>> a fair amount of the restructuring work I used to do for clients on a 4GB 
>>>> memory system  with caddy'd drives - Such as renaming files that filled a 
>>>> 1TB drive, for access as 'home drives' - before you had all the 
>>>> maintenance goodies in the admin facilities.
>>>>
>>>> (Having taken a complete list of files, stuck them in Excel, sorted
>>>> them there, and generated a set of rename commands.)
>>>>
>>>> It took more time processing the MFT entries to "rename" the files in situ 
>>>> - than it did to copy them to another drive with the new names.
>>>> Simply because of the thrashing on the MFT blocks in the OS allocated disk 
>>>> read cache.
>>>>
>>>> JimB
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: listsadmin@lists.myITforum.com
>>>> [mailto:listsadmin@lists.myITforum.com] On Behalf Of Kurt Buff
>>>> Sent: Thursday, July 30, 2015 8:45 PM
>>>> To: powersh...@lists.myitforum.com
>>>> Subject: [powershell] Need some pointers on an exercise I've set for
>>>> myself
>>>>
>>>> I'm putting together what should be a simple little script, and failing.
>>>>
>>>> I am ultimately looking to run this against a directory, then sort
>>>> the output on the hash field and then parse for duplicates. There are
>>>> two conditions that concern me: 1) there are over 3m files in the
>>>> target directory, and 2) many of the files are quite large, over 1g.
>>>>
>>>> I'm more concerned about the effects of the script on memory than on
>>>> processor - the data is fairly static, and I intend to run it once a
>>>> month or even less, but I did choose MD5 as the hash algorithm for
>>>> speed, rather than accept the default of SHA256.
>>>>
>>>> This is pretty simple stuff, I'm sure, but I'm using this as a
>>>> learning exercise more than anything, as there are duplicate file
>>>> finders out in the world already.
>>>>
>>>> There are several problems with what I have put together so far,
>>>> which this this:
>>>>
>>>>      Get-ChildItem c:\stuff -Recurse | select length, fullname |
>>>> export-csv -NoTypeInformation c:\temp\files.csv
>>>>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
>>>> -algorithm md5 $_.FullName) }; Length | Sort hash
>>>>
>>>> Using Length (or $_.Length) anywhere in the foreach statement gives
>>>> an error, or gives weird output.
>>>>
>>>> Sample Output when not using Length, and therefore getting reasonable
>>>> output (extra spaces and hyphen delimiters elided):
>>>>      Algorithm   Hash
>>>>         Path
>>>>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
>>>> C:\stuff\Tools\SomeFile.DOC
>>>>
>>>> What I'd like to see instead
>>>>      Hash                                                          Length  
>>>>  Path
>>>>      592BE1AD0ED83C36D5E68CA7A014A510    79872    
>>>> C:\stuff\Tools\SomeFile.DOC
>>>>
>>>> If anyone can offer some instruction, I'd appreciate it.
>>>>
>>>> Kurt
>>>>
>>>>
>>>> ================================================
>>>> Did you know you can also post and find answers on PowerShell in the 
>>>> forums?
>>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>>
>>>>
>>>>
>>>> ================================================
>>>> Did you know you can also post and find answers on PowerShell in the 
>>>> forums?
>>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>>
>>>
>>>
>>> ================================================
>>> Did you know you can also post and find answers on PowerShell in the forums?
>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>
>>>
>>> ================================================
>>> Did you know you can also post and find answers on PowerShell in the forums?
>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

Re: [powershell] Need some pointers on an exercise I've set for myself

Reply via email to