-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The Inquirer posts the following algorith for doing an MD5 based file
comparison of Linux and SCO kernel sources:

    http://www.theinquirer.net/?article=10061
    Shutting down SCO's FUD machine
    By Egan Orion: Wednesday 18 June 2003, 10:58

    "Yesterday I realized how trivial it was to find matching code within
    two source trees.

    "While working on this stuff, I realized that [the] SCO lawsuit is
    indeed pure FUD, and they will keep it like that till the end. So it
    seems like the best thing for the linux community now would be to find
    the matching code ourselves and figure out where it came from. SCO
    help is not needed. Otherwise Linux is so to speak a sitting duck. If
    Linux community knows what is very similar and why, that would fully
    protect Linux in press and leave IBM to annihilate SCO."

    I don't know how "fully" this might be effective, because certain
    press elements are practically extensions of the Vole's propaganda
    office. It does sound interesting enough to look into closely,
    though. Our unnamed correspondent continues:

    "Since I do not have access to System V code, I took Linux 2.4.20
    and BSD-lite 4.4. I'll give the technical details later, but here
    are the findings:

    "[Linux versus] 4.4BSD-Lite

    " lines        Linux           BSD

      200- 260  ...amd7930.c    ...bsd_audio.c


      398- 519  ...slhc.c       ...slcompress.c


      739- 766  ...balloc.c     ...ffs_alloc.c


     2267-2299  ...bonding.c    ...inet_addr.c

    [Note: We truncated the full paths for formatting purposes, but the
    original email is available containing all paths and other details.]

    "On the left is the file in the Linux tree, on the right is the
    file in the 4.4BSD tree. Also the range of matching lines in Linux
    is given on the left. It is unlikely that I missed any other large
    matching fragments.

    "Now, it seems to be quite likely that the matching Linux-System
    V code shown to the "experts" by SCO came from one of these
    files. And all because this is the original BSD code, which got
    copied everywhere."

    As our reader intimates, he's found a clever way to compare Unix
    source code without viewing the code directly or violating copyrights
    We will let him explain in further detail how it's possible to
    do this:

    "Here is the procedure for finding the matching code.... "1. Each
    file withing each source tree is "shredded" into 5 line pieces (1-5,
    2-6, 3-7, etc.). MD5 sum is computed for each block of lines. The
    output is 3 columns: MD5sum, source file, 1st line in the block.

    "At this stage, 4.4BSD had [a] ~40Mb file, linux ~160Mb. Potentially,
    one could shred into smaller or larger pieces, however, with pieces
    too small there'll be a lot of noise, with pieces too large some
    matches won't be seen. 5 liners seem to be a good compromise.

    "2. Within each source tree the "shredded" file is sorted by
    MD5sum, and duplicate entries within the same tree are removed
    completely (these are either trivial 5-line sequences or licensing
    disclaimers). Unix sort here takes a couple of minutes on a 600Mhz P3.

    "3. A column indicating the origin of the file is inserted into the
    file (0 - BSD, 1 - linux). Both Linux and BSD "shredded" files are
    merged such that MD5sums stay sorted.

    "4. At this point a given MD5sum will occur either once or twice,
    i.e., in both source trees. Here remove all thesingle lines, and
    have the 5 liners left that are matching.

    "5. Count for each file in Linux tree the number of matches with
    the BSD tree using the file generated at step 4. Sort this list,
    and the largest counts will occur for the files with the largest
    number of matching lines. The range can be extracted from the file
    from step 4, since at step 1 we kept the address of the 1st line in
    the block. That is how the info above was generated.

    "The beauty of this scheme is that anybody with System V code can
    inform the Linux community about what is identical without revealing
    any System V code. And this might actually be legal, since I do
    not think that there are clauses in the contracts NOT to shred the
    code and compare it with other code. Also, it is quite easy to stay
    anonymous since the person who does the analysis need not to reveal
    him/herself in any way."

Peace.

- --
Karsten M. Self <[EMAIL PROTECTED]>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    "Life," said Marvin, "don't talk to me about life."
    -- HHGTG


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE+8Piv2MO5UukaubkRAprdAKCJVf9dGgS5Ly0eW62ZocL/EHeneACdGj4d
O10hlMZPa2wyTUu6UW1Xn70=
=3kWp
-----END PGP SIGNATURE-----

Attachment: pgp00000.pgp
Description: PGP signature

_______________________________________________
Linux-users mailing list
[EMAIL PROTECTED]
Unsubscribe/Suspend/Etc -> http://www.linux-sxs.org/mailman/listinfo/linux-users

Reply via email to