C1     C2     C3   C4
 ------------------------
 12345 efghij klmno pqrs
 34567 abnerv oiuuy uyrv
 94567 abnerv gtuuy hyrv
 12345 aswrfr rtyyt erer
 94567 abnerv gtuuy hyrv


Here row1 and row4 are duplicates...those needs
to be removed or moved to another file




--- Madhu Reddy <[EMAIL PROTECTED]> wrote:
> Hi,
>    I want find a duplicate records in a large
> file....
> it contains around 22 millions records.....
> 
> basically following is my file structure....
> 
>  C1     C2     C3   C4
> ------------------------
> 12345 efghij klmno pqrs
> 34567 abnerv oiuuy uyrv
> .......
> .......
> 
> ...........
> ............
> .............
> 
> it has 22 million records..and each record have 
> 4 columns (C1,C2,C3 and C4)
> 
> C1  is primary key....
> 
> here i want to do some validation..
> following is my validation...
> 
> 1. Validate record length
> 2. Check if first column is NULL
> 3. Separate duplicate records....
> 
> How do i separate dulicate records on such a huge
> file.....
> 
> duplicate means...only primary key (column)...
> if column1 (C1) is duplicate, that means that row is
> duplicate row and need to write into another
> file....
> 
> Does anybody have effeciant algorithm to find
> duplicate records on a large file ....
> 
> duplicate means, not complete row duplicate..if
> column1 is duplicate, that means that row is
> duplicate....
> 
> I appreciate u r help
> -Madhu
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up
> now.
> http://mailplus.yahoo.com
> 
> -- 
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to