Hi Shigio

I am thinking about tackling this feature in a reasonably useful and robust
way. I am not concerned about Python 2.x, but wouldn't want to break
compatibility either. As it stands, ‘latin1’ encoding is used for
implementing something like “binary but with newlines”.

   - The current implementation of pygments_parser.py is incomplete wrt I/O
   encoding and will probably break when challenged with characters outside
   the ASCII range.
   - Encodings of any form of input that are not ASCII-compatible are
   probably not going to work at all.
   - Many OS-facing functions, such as ‘os.getenv’, but also the low-level
   parts of ‘subprocess.Popen()’ use ‘sys.getfilesystemencoding()’ for
   determining the desired encoding. Most current unixoid OS are configured to
   UTF-8 based locales, and even Python on Windows defaults to UTF-8 for
   OS-facing encoding (since 2016, Python 3.6+, PEP 529).
   - Any non-ASCII content of gtags.conf is most likely going to break
   pygments_parser.py in one way or another. I'd propose to rely on
   ‘sys.getfilesystemencoding()’ as well for reading.
   - Source code must be presented to Pygment's Lexers as string.
   Programming languages that allow non-ASCII source code would normally use
   UTF-8 (e.g. Python), which I'd recommend for ‘read_file()’, possibly with
   an appropriate error handler. Depending on how a Lexer implements string
   handling, exotic encodings might even be less broken than before if bytes
   are preserved via ‘surrogateescape’ or ‘backslashreplace’.

IMHO, relying on the respective system default encoding in most places and
an explicit UTF-8 in read_file() is going to improve compatibility and by
side effect helps with unifying code paths between Python 2 and 3.

Best regards,
Marcus

On Thu, May 16, 2024 at 12:42 AM Marcus Harnisch <
[email protected]> wrote:

> Hi Shigio
>
> Glad to hear that it didn't work :-) Thank you for adding this to the
> known bugs list.
>
> Best regards,
> Marcus
>
> On Tue, May 14, 2024 at 8:16 AM Shigio YAMAGUCHI <[email protected]> wrote:
>
>> Hi Marcus,
>> I confirmed that the problem is reproduced.
>> I have made a new entry to the 'Known bugs' list.
>> Thank you for the report.
>>
>> [https://www.gnu.org/software/global/bugs.html]
>> o Pygments plug-in parser with python3 does not work, if 'ctagscom' is
>> not set.
>>   If it is not set, default path obtained by configure script should be
>> used.
>>
>> $ cat > gtags.conf
>> default:\
>>         :ctagscom=:\
>>         :langmap=C\:.c.h:\
>>         :gtags_parser=C\:/usr/local/lib/gtags/pygments-parser.la:
>> $ gtags
>> $ global -x '.*'
>> $ _                             # no tags
>>
>> Regards,
>> Shigio
>>
>> On Mon, May 13, 2024 at 5:04 PM Marcus Harnisch
>> <[email protected]> wrote:
>> >
>> > Hi Shigio
>> >
>> > On Sat, May 11, 2024 at 5:35 AM Shigio YAMAGUCHI <[email protected]>
>> wrote:
>> >>
>> >> $ cat gtags.conf
>> >> default:\
>> >> :ctagscom=/opt/local/bin/uctags:\
>> >> :langmap=C\:.c.h:\
>> >> :gtags_parser=C\:/usr/local/lib/gtags/pygments-parser.la:
>> >
>> >
>> > The important difference, which exposes the bug, is your explicit
>> configuration of ctagscom. Leave it undefined and rely on whatever
>> UNIVERSAL_CTAGS has been configured to. Only if ctagscom is empty, you will
>> see a comparison between b'' (empty bytearray) and '' (empty string).
>> >
>> > Best regards,
>> > Marcus
>>
>>
>>
>> --
>> Shigio YAMAGUCHI <[email protected]>
>> PGP fingerprint:
>> 26F6 31B4 3D62 4A92 7E6F  1C33 969C 3BE3 89DD A6EB
>>
>

Reply via email to