Re: Joint Call: Tuesday, Oct 25th w/Tech Team

2016-10-21 Thread Mark D. Baushke
Hi Alexios,

Zavras, Alexios  writes:

> I think there has been a misunderstanding.

Yes, that is very likely. I regret that I seem to be having trouble
understanding the topic. I will endeavor to make my point with more
clarity.

> The "encoding" item on the agenda simply means that there is a
> proposal to standardize on UTF-8 for the file format in which the XML
> version of the licenses (in the SPDX master license repo) are stored.

Yes. My question seems to have been unclear. I regret this.

The difficulty is in the word standardize. UTF-8 allows for many
possible expressions of the same token. In particular, the text
expected in a standard license in XML will contain a number of
different characters which have multiple representations.

One meaning of the term standardize would be to come up with a single
cannincal representation for the template.

Will this meeting take up which of those many representations should be
used as the cannonical representation in the SPDX XML master license
repository?

Items we see in a copyright and license file may include multiple
representations of:

Double Quote, Single quote, Copyright Sign, Registered Sign, Trade
Mark Sign, etc.

Will there be an SPDX specification of what to put into the template
even if it may also be needful to look for the laternatives when doing
an extraction? Or, will there be an SPDX XML token that specifies the
class of representations that may be present?

fwiw: I would also hope that a full set of DTDs are to be generated for
the SPDX dialect of XML.

> As to what you should be looking for, in order to extract copyright
> notices, the list is longer than what you include. For example, when
> reading an HTML file, the copyright symbol might be encoded as the
> characters "" or "" (besides the "" that you have).
> And strings in C or Python code might use ""\u00A9"" or "u"\u00A9"",
> although these are probably not a copyright notice for the file
> itself.

True. However, looking at the XML prototype license, what cannonical
form should be used to represent all of the other possible forms?

My original question was not clear.

I am asking if we are going to see something like  as
the SPDX XML template to represent any of the various encodings that
could exist?

For example, in MIT.xml should I see

  Copyright (c) year copyright holder

or

  Copyright   

so that each element could be used as a processing token for pattern
matching?

Also, in that file we have the text

  (the "Software")

which uses U+0022 for the double quote. I have seen some documents that
are using the multibyte 'LEFT DOUBLE QUOTATION MARK' (U+201C) Software
'RIGHT DOUBLE QUOTATION MARK' (U+201D). What cannonical representation
will be used in the XML templates? My personal preference is U+201D.

I hope this helps with the understanding of my question as it relates to
UTF-8 selection for XML templates.

Please pardon the length of this message, I only endeavor to make my
question more clear.
-- 
Mark D. Baushke
m...@juniper.net
___
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal


RE: Joint Call: Tuesday, Oct 25th w/Tech Team

2016-10-21 Thread Zavras, Alexios
I think there has been a misunderstanding.



The "encoding" item on the agenda simply means that there is a proposal to 
standardize on UTF-8 for the file format in which the XML version of the 
licenses (in the SPDX master license repo) are stored.



As to what you should be looking for, in order to extract copyright notices, 
the list is longer than what you include. For example, when reading an HTML 
file, the copyright symbol might be encoded as the characters "" or 
"" (besides the "" that you have). And strings in C or Python code 
might use ""\u00A9"" or "u"\u00A9"", although these are probably not a 
copyright notice for the file itself.





-- zvr -



-Original Message-
From: spdx-legal-boun...@lists.spdx.org 
[mailto:spdx-legal-boun...@lists.spdx.org] On Behalf Of Mark D. Baushke
Sent: Friday, 21 October, 2016 18:16
To: J Lovejoy 
Cc: SPDX-legal 
Subject: Re: Joint Call: Tuesday, Oct 25th w/Tech Team



Hi Jilayne & Paul,



- Encoding (propose UTF-8)



I have no problem with this. I do think that some folks may not completely 
understand the implications.



I would like to see a table of all of the representations of various copyright 
signs that we need to consider when we extract from a file.



To date I have observed the following:



  (c) - 0x28 0x63 0x29

   (U+0028 U+0063 U+0029)

  (C)- 0x28 0x43 0x29

   (U+0028 U+0043 U+0029)

 - 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'

 - U+24B8 'circled latin capital letter c'

   - 0x26 0x63 0x6f 0x70 0x79 0x3b

   (U+0026 U+0063 U+006f U+0070 U+0079 U+003b)



Although I have only seen the graphic for the 'SOUND RECORDING COPYRIGHT' on 
labels, I thought it may also be worth mentioning:



  (P)- 0x28 0x50 0x29 (U+0028 U+0050 U+0029)

   - 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'

   - 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'



Note that I have also seen a bare 0xa9 in a file without the proceeding

0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So, we may 
need to also consider how to handle those kinds of situations.



There are other interesting multiple representations in licenses such as:



  - ''as is'' (uses U+0027) and

  - "as is"   (uses quotation mark U+0022) and

  - as is and

  - as is

  - as is



there may be a few others as well.



I guess the point I am trying to make is that it may be desirable to transcode 
some UTF-8 into a cannonical and recommended encoding form when doing things 
like license extraction.



--

Mark D. Baushke

m...@juniper.net

___

Spdx-legal mailing list

Spdx-legal@lists.spdx.org

https://lists.spdx.org/mailman/listinfo/spdx-legal
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
___
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal


Re: Joint Call: Tuesday, Oct 25th w/Tech Team

2016-10-21 Thread Mark D. Baushke
Hi Jilayne & Paul,

 - Encoding (propose UTF-8)

I have no problem with this. I do think that some folks may not
completely understand the implications.

I would like to see a table of all of the representations of various
copyright signs that we need to consider when we extract from a file.

To date I have observed the following:

  (c)- 0x28 0x63 0x29
   (U+0028 U+0063 U+0029)
  (C)- 0x28 0x43 0x29
   (U+0028 U+0043 U+0029)
 - 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'
 - U+24B8 'circled latin capital letter c'
   - 0x26 0x63 0x6f 0x70 0x79 0x3b
   (U+0026 U+0063 U+006f U+0070 U+0079 U+003b)

Although I have only seen the graphic for the 'SOUND RECORDING
COPYRIGHT' on labels, I thought it may also be worth mentioning:

  (P)- 0x28 0x50 0x29 (U+0028 U+0050 U+0029)
 - 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'
 - 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'

Note that I have also seen a bare 0xa9 in a file without the proceeding
0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So,
we may need to also consider how to handle those kinds of situations.

There are other interesting multiple representations in licenses such as:

  - ''as is'' (uses U+0027) and
  - "as is"   (uses quotation mark U+0022) and
  - as is and
  - as is
  - as is

there may be a few others as well.

I guess the point I am trying to make is that it may be desirable to
transcode some UTF-8 into a cannonical and recommended encoding form
when doing things like license extraction.

-- 
Mark D. Baushke
m...@juniper.net
___
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal


Re: Joint Call: Tuesday, Oct 25th w/Tech Team

2016-10-21 Thread Brad Edmondson
Works for me; thanks Jilayne and Gary.

Best,
Brad

--
Brad Edmondson, *Esq.*
512-673-8782 | brad.edmond...@gmail.com

On Fri, Oct 21, 2016 at 12:29 AM, J Lovejoy  wrote:

> We will have a joint call with tech team, joining their regular call time
> on *Tuesday, Oct 25th @ 18:00 GMT (10:00AM PT, 11:00 MT, 12:00PM CT,
> 1:00PM ET)*.  Please mark your calendars.
>
> Dial-in (same as we use): http://uberconference.com/SPDXTeam or  Call:
> +1-857-216-2871
> PIN # 38633
>
> Agenda:
>
> Close on the terms and discuss any next steps related to the following
> items:
>
> -  Encoding (propose UTF-8)
> -  The high level element name
> -  Paragraph tag or p or some other term
> -  Use of the  tags
>
> All of the proposals except encoding are on the Google docs page:
>  https://docs.google.com/document/d/1z9n44xLH2MxT576KS_
> AbTOBtecyl5cw6RsrrQHibQtg/edit
>
>
> Thanks,
> Jilayne & Paul
> SPDX Legal Team co-leads
>
>
>
> ___
> Spdx-legal mailing list
> Spdx-legal@lists.spdx.org
> https://lists.spdx.org/mailman/listinfo/spdx-legal
>
>
___
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal