On 05/04/2015 02:11 PM, Laszlo Ersek wrote:
> (I couldn't find a better point to insert my two cents in the thread, so
> I'll just follow up here.)
>
> On 05/04/15 20:06, Jordan Justen wrote:
>> On 2015-05-04 10:48:05, Andrew Fish wrote:
>>> On May 4, 2015, at 10:33 AM, Jordan Justen <[email protected]>
>>> wrote:
>>> On 2015-05-04 08:57:29, Kinney, Michael D wrote:
>>>
>>> Jordan,
>>>
>>> Some source control systems provide support a file type of UTF-16LE,
>>> so the use of 'binary' should be avoided. What source control
>>> systems require the use of 'binary'?
>>>
>>> Svn seems to require it so the data doesn't get munged.
>>>
>>> Git seems to auto-detect UTF-16 .uni files as binary.
>
> UTF-16 files are *by definition* binary, because they are byte order
> dependent. UTF-8 in comparison is byte order agnostic.
>
>>>
>>> What diff utilities are having issues with UTF-16LE files? Can you
>>> provide some examples?
>>>
>>> I don't think it is possible to create a .uni patch file for EDK II
>>> that actually works.
>>>
>>> Normally, for .uni files we just see something about the diff being
>>> omitted on a binary file.
>>>
>>> With git, I know there are ways to force a diff to be produced, but I
>>> don't believe it would actually work to apply a patch to a UTF-16
>>> file.
>>>
>>> This
>>> http://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
>>> stackoverflow
>>> article seems to imply you can can make git-merge work with a
>>> .gitattributes file setting?
>
> I had posted a more or less "complete" solution for working with UNI
> files under git, in Feb 2014, and Brian J. Johnson added a bunch of
> valuable info to it (see all messages in the thread):
>
> http://thread.gmane.org/gmane.comp.bios.tianocore.devel/6351
>
> This approach solves diffing (+ merging, rebasing etc) patches for UNI
> files. What it doesn't solve, unfortunately, are the following issues:
>
FWIW, I vote for adding UTF-8 input file support. It should make
working with string files on Linux a lot simpler.
> (a) UTF-8 is a standard *external* text representation, while UTF-16 /
> UCS-2 is at best a standard *internal* text representation (ie. maybe
> for objects living inside C programs).
>
> On most Linux computers, locales will have been set up for handling
> UTF-8, allowing terminal emulators and various text-oriented utilities
> to work with (incl. "display") UTF-8 files out of the box.
>
> (In fact I had to write a small script called "uni-grep" because
> otherwise I can't git-grep the UNI files of the edk2 tree for patterns.)
>
> Additionally, UTF-16 capable editors are presumably a rare species in
> comparison to UTF-8 capable ones. If we consider that the edk2 tree
> mostly contains English translations (which fit in ASCII), then choosing
> UTF-16 excludes "old" (ie. 8-bit-only) editors for no good practical reason.
>
> IMO, choosing UTF-16LE for external text encoding is an inconvenient
> Windows-ism that breaks standard (=POSIX) text utilities, and
> non-standard yet widespread tools, in most Linux distros.
>
All good points. I agree.
> (b) The git hacks mentioned thus far do not cover git-format-patch. Git
> will always think that UTF-16LE encoded files are binary (because they
> are), and therefore it will format patches for them as binary deltas
> (encoded with base64 or something similar). The hacks referenced above
> exist on a higher (more abstract) level only.
>
> Binary delta patches are unreviewable on a mailing list.
>
> For patches that touch UTF-8 text files, git-format-patch generates
> plaintext emails, with correct headers such as:
>
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
>> Is there a concern with supporting UTF-8?
>>
>> It seems like in general UTF-8 is better supported, and requires no
>> configuration tweaks.
>>
>> I think the situation is that UTF-8 has become the most commonly used
>> format, and therefore it is the most likely format to work well with
>> tools.
>>
>> For EDK II's needs, I can't see a downside to supporting UTF-8, and it
>> did not require a huge amount of effort.
>
Seconded.
> I'd support the idea of going UTF-8-only. The files can be converted all
> at once (it would take a 10-20 line shell script approximately, and
> conversion errors could be caught), in one big commit, or else we could
> move forward gradually (same as with the nasm conversion).
>
No opinion on whether the open source files should be converted at once,
or over time, or at all. But closed-source vendors have their own
UTF-16 files, so we shouldn't remove support for UTF-16.
> I do think such files should be distinguished with a separate filename
> suffix.
Yes. Otherwise developers will get confused why some ".uni" files work
with their tools, and some do not.
Nice work, Jordan!
--
Brian J. Johnson
--------------------------------------------------------------------
My statements are my own, are not authorized by SGI, and do not
necessarily represent SGI’s positions.
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/edk2-devel