Hi Alexios, Zavras, Alexios <alexios.zav...@intel.com> writes:
> I think there has been a misunderstanding. Yes, that is very likely. I regret that I seem to be having trouble understanding the topic. I will endeavor to make my point with more clarity. > The "encoding" item on the agenda simply means that there is a > proposal to standardize on UTF-8 for the file format in which the XML > version of the licenses (in the SPDX master license repo) are stored. Yes. My question seems to have been unclear. I regret this. The difficulty is in the word standardize. UTF-8 allows for many possible expressions of the same token. In particular, the text expected in a standard license in XML will contain a number of different characters which have multiple representations. One meaning of the term standardize would be to come up with a single cannincal representation for the template. Will this meeting take up which of those many representations should be used as the cannonical representation in the SPDX XML master license repository? Items we see in a copyright and license file may include multiple representations of: Double Quote, Single quote, Copyright Sign, Registered Sign, Trade Mark Sign, etc. Will there be an SPDX specification of what to put into the template even if it may also be needful to look for the laternatives when doing an extraction? Or, will there be an SPDX XML token that specifies the class of representations that may be present? fwiw: I would also hope that a full set of DTDs are to be generated for the SPDX dialect of XML. > As to what you should be looking for, in order to extract copyright > notices, the list is longer than what you include. For example, when > reading an HTML file, the copyright symbol might be encoded as the > characters "©" or "©" (besides the "©" that you have). > And strings in C or Python code might use ""\u00A9"" or "u"\u00A9"", > although these are probably not a copyright notice for the file > itself. True. However, looking at the XML prototype license, what cannonical form should be used to represent all of the other possible forms? My original question was not clear. I am asking if we are going to see something like <copyright-sign/> as the SPDX XML template to represent any of the various encodings that could exist? For example, in MIT.xml should I see <p>Copyright (c) <year> <copyright holder></p> or <p>Copyright <copyright-sign/> <year-range/> <copyright-holder/></p> so that each element could be used as a processing token for pattern matching? Also, in that file we have the text (the "Software") which uses U+0022 for the double quote. I have seen some documents that are using the multibyte 'LEFT DOUBLE QUOTATION MARK' (U+201C) Software 'RIGHT DOUBLE QUOTATION MARK' (U+201D). What cannonical representation will be used in the XML templates? My personal preference is U+201D. I hope this helps with the understanding of my question as it relates to UTF-8 selection for XML templates. Please pardon the length of this message, I only endeavor to make my question more clear. -- Mark D. Baushke m...@juniper.net _______________________________________________ Spdx-legal mailing list Spdx-legal@lists.spdx.org https://lists.spdx.org/mailman/listinfo/spdx-legal