-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Inquirer posts the following algorith for doing an MD5 based file comparison of Linux and SCO kernel sources:
http://www.theinquirer.net/?article=10061 Shutting down SCO's FUD machine By Egan Orion: Wednesday 18 June 2003, 10:58 "Yesterday I realized how trivial it was to find matching code within two source trees. "While working on this stuff, I realized that [the] SCO lawsuit is indeed pure FUD, and they will keep it like that till the end. So it seems like the best thing for the linux community now would be to find the matching code ourselves and figure out where it came from. SCO help is not needed. Otherwise Linux is so to speak a sitting duck. If Linux community knows what is very similar and why, that would fully protect Linux in press and leave IBM to annihilate SCO." I don't know how "fully" this might be effective, because certain press elements are practically extensions of the Vole's propaganda office. It does sound interesting enough to look into closely, though. Our unnamed correspondent continues: "Since I do not have access to System V code, I took Linux 2.4.20 and BSD-lite 4.4. I'll give the technical details later, but here are the findings: "[Linux versus] 4.4BSD-Lite " lines Linux BSD 200- 260 ...amd7930.c ...bsd_audio.c 398- 519 ...slhc.c ...slcompress.c 739- 766 ...balloc.c ...ffs_alloc.c 2267-2299 ...bonding.c ...inet_addr.c [Note: We truncated the full paths for formatting purposes, but the original email is available containing all paths and other details.] "On the left is the file in the Linux tree, on the right is the file in the 4.4BSD tree. Also the range of matching lines in Linux is given on the left. It is unlikely that I missed any other large matching fragments. "Now, it seems to be quite likely that the matching Linux-System V code shown to the "experts" by SCO came from one of these files. And all because this is the original BSD code, which got copied everywhere." As our reader intimates, he's found a clever way to compare Unix source code without viewing the code directly or violating copyrights We will let him explain in further detail how it's possible to do this: "Here is the procedure for finding the matching code.... "1. Each file withing each source tree is "shredded" into 5 line pieces (1-5, 2-6, 3-7, etc.). MD5 sum is computed for each block of lines. The output is 3 columns: MD5sum, source file, 1st line in the block. "At this stage, 4.4BSD had [a] ~40Mb file, linux ~160Mb. Potentially, one could shred into smaller or larger pieces, however, with pieces too small there'll be a lot of noise, with pieces too large some matches won't be seen. 5 liners seem to be a good compromise. "2. Within each source tree the "shredded" file is sorted by MD5sum, and duplicate entries within the same tree are removed completely (these are either trivial 5-line sequences or licensing disclaimers). Unix sort here takes a couple of minutes on a 600Mhz P3. "3. A column indicating the origin of the file is inserted into the file (0 - BSD, 1 - linux). Both Linux and BSD "shredded" files are merged such that MD5sums stay sorted. "4. At this point a given MD5sum will occur either once or twice, i.e., in both source trees. Here remove all thesingle lines, and have the 5 liners left that are matching. "5. Count for each file in Linux tree the number of matches with the BSD tree using the file generated at step 4. Sort this list, and the largest counts will occur for the files with the largest number of matching lines. The range can be extracted from the file from step 4, since at step 1 we kept the address of the 1st line in the block. That is how the info above was generated. "The beauty of this scheme is that anybody with System V code can inform the Linux community about what is identical without revealing any System V code. And this might actually be legal, since I do not think that there are clauses in the contracts NOT to shred the code and compare it with other code. Also, it is quite easy to stay anonymous since the person who does the analysis need not to reveal him/herself in any way." Peace. - -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? "Life," said Marvin, "don't talk to me about life." -- HHGTG -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQE+8Piv2MO5UukaubkRAprdAKCJVf9dGgS5Ly0eW62ZocL/EHeneACdGj4d O10hlMZPa2wyTUu6UW1Xn70= =3kWp -----END PGP SIGNATURE-----
pgp00000.pgp
Description: PGP signature
_______________________________________________ Linux-users mailing list [EMAIL PROTECTED] Unsubscribe/Suspend/Etc -> http://www.linux-sxs.org/mailman/listinfo/linux-users