On Thu, Mar 2, 2017 at 6:31 AM, Mark D. Baushke <m...@juniper.net> wrote:
> Hi Folks,
>
> A question for SPDX-Tech folks concerning file equivalences.
>
> If I have a file with a RCS Keywords such as is available in RCS, CVS,
> SVN, git (svn:keywords) or changeset ID (Mercurial), is it desirable to
> be able to define a different FileChecksum attribute that explicitly
> canonicalizes these values so that equivalent implementations will show
> as being the same, even if they had been checked into another SCM system
> and had the hash overwritten locally?
>
> So, instead of the hash processing a file with this text
>
>   $Id: foo.c 123456 2015-01-31 12:34:56 mdb $
>
> as is found in a file, it would instead process the above text as if it
> were written $Id$
>
> This would allow two files that are identical other than RCS Keyword
> vaues to have the same 'hash' for an SPDX report.
>
> So if the only diference between two files was
>
> 2c2
> <  * $Id: foo.c 123456 2015-01-31 12:34:56 mdb $
> ---
>>  * $Id: foo.c 1.22 2010-02-20 11:56:00 jon $
>
> the 'RCSkeywordlessFileChecksum: SHA1: <hashvalue>'
> would be identical for the two files (or whatever attribute id you want
> to use).
>
> (Note: Where I see this differnce most often is in different branches of
> the same software.)

Mark:
this is eventually a problem with no simple answer. Luckily this is
going away eventually in the future as as far as I know git does not
support keyword expansions (IMHO for the better).

That said, there are various ways I have handled this practically:

1. you preprocess the source code and remove any keyword. I was not
really happy with this and this led to this (ugly) python snippet:
---------------------------

    # see 
http://cvs-nserver.sourceforge.net/doc/unstable/manual/html_chapter/cvs_12.html#SEC101
    # for a list of the keyword and for the substitution rules
    vcs_replacement_regex =
re.compile('\$(Author|Date|Header|Id|Name|Locker|Log|RCSfile|Revision|Source|State)(?:\:[^\$\n]*)?\$',
re.MULTILINE | re.UNICODE | re.DOTALL)

Once you have preprocessed your code with this you could use any
checksum alright as you suggested above.
I would never warrant a new SPDX tag IMHO, but rather you can name the
checksum algo with something new, such as

FileChecksum: SHA1NOCVS:21ada346713283327

or anything that would be not too ugly and then you can document for
your SPDX doc recipients.


2. You do not expand keyword at all, if you control the checkout. This
may not be always practical.
---------------------------

3. You use a non-crypto, "locality sensitive" checksum hash that you
use for approximate file comparison.
---------------------------

You can then name this and use this instead of or in addition to a SHA1.
For instance you could use "ssdeep" or tlsh. Say you use tlsh and name
this as "TLSH" for an algo identifier:
You could then use this SPDX Tag/value snippet:

FileChecksum: TSLH:21ada346713283327


To illustrate this consider these two files where the only difference is
>
> 2c2
> <  * $Id: foo.c 123456 2015-01-31 12:34:56 mdb $
> ---
>>  * $Id: foo.c 1.22 2010-02-20 11:56:00 jon $
>


I have attached these to this gist:
https://gist.github.com/pombredanne/4f767acb1ab480ad3b93287f6cdb1de9

If I compute a tlsh for these using the head of
https://github.com/trendmicro/tlsh, I can see right away that
the checksums are very similar, like the files are

$ bin/tlsh -r foo
0051625E224447B306E647A6691BA8CCA00DD01D3767AE05240FB16C4E2B27EC6FFC94
foo/foo1.c
7251725E224457B306E647A66A1FA8CC600DD01D37A7AE04240FB16C4E2B37EC6FFC94
foo/foo2.c

They are still not exactly the same but you see how the LSH approach
differs from a crypto checksum.

And the tlsh "distance" between the two files is very small (e.g. 9)
which is the property you want to
exploit to determine that two files are essentially the same except
for very minor changes:

$ bin/tlsh -f foo/foo1.c -c foo/foo2.c
   9 foo/foo1.c



As a side note, I have built several alternative lsh fingerprints and
I will push them as FOSS in
a https://github.com/nexB/ repo sometimes.


I hope this helps!
-- 
Cordially
Philippe Ombredanne
_______________________________________________
Spdx-tech mailing list
Spdx-tech@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-tech

Reply via email to