I think there has been a misunderstanding.


The "encoding" item on the agenda simply means that there is a proposal to 
standardize on UTF-8 for the file format in which the XML version of the 
licenses (in the SPDX master license repo) are stored.



As to what you should be looking for, in order to extract copyright notices, 
the list is longer than what you include. For example, when reading an HTML 
file, the copyright symbol might be encoded as the characters "©" or 
"©" (besides the "©" that you have). And strings in C or Python code 
might use ""\u00A9"" or "u"\u00A9"", although these are probably not a 
copyright notice for the file itself.





-- zvr -



-----Original Message-----
From: spdx-legal-boun...@lists.spdx.org 
[mailto:spdx-legal-boun...@lists.spdx.org] On Behalf Of Mark D. Baushke
Sent: Friday, 21 October, 2016 18:16
To: J Lovejoy <opensou...@jilayne.com>
Cc: SPDX-legal <spdx-legal@lists.spdx.org>
Subject: Re: Joint Call: Tuesday, Oct 25th w/Tech Team



Hi Jilayne & Paul,



- Encoding (propose UTF-8)



I have no problem with this. I do think that some folks may not completely 
understand the implications.



I would like to see a table of all of the representations of various copyright 
signs that we need to consider when we extract from a file.



To date I have observed the following:



  (c)         - 0x28 0x63 0x29

           (U+0028 U+0063 U+0029)

  (C)        - 0x28 0x43 0x29

           (U+0028 U+0043 U+0029)

         - 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'

         - U+24B8 'circled latin capital letter c'

  &copy; - 0x26 0x63 0x6f 0x70 0x79 0x3b

           (U+0026 U+0063 U+006f U+0070 U+0079 U+003b)



Although I have only seen the graphic for the 'SOUND RECORDING COPYRIGHT' on 
labels, I thought it may also be worth mentioning:



  (P)    - 0x28 0x50 0x29 (U+0028 U+0050 U+0029)

               - 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'

               - 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'



Note that I have also seen a bare 0xa9 in a file without the proceeding

0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So, we may 
need to also consider how to handle those kinds of situations.



There are other interesting multiple representations in licenses such as:



  - ''as is'' (uses U+0027) and

  - "as is"   (uses quotation mark U+0022) and

  - &ldquo;as is&rdquo; and

  - <U+201C>as is<U+201D>

  - <U+201F>as is<U+201F>



there may be a few others as well.



I guess the point I am trying to make is that it may be desirable to transcode 
some UTF-8 into a cannonical and recommended encoding form when doing things 
like license extraction.



--

Mark D. Baushke

m...@juniper.net<mailto:m...@juniper.net>

_______________________________________________

Spdx-legal mailing list

Spdx-legal@lists.spdx.org<mailto:Spdx-legal@lists.spdx.org>

https://lists.spdx.org/mailman/listinfo/spdx-legal
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
_______________________________________________
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal

Reply via email to