Hi,

I found that if a file contains a specific CJK characters sequence, the parser 
seems fail to continue parsing the file.

See the follow example source file, let’s say `test.c` in encoding of Shift-JIS 
(cp932).
extern void printf(char * msg, ...);

 

void Foo() {

    char msg[] = "機能";

    printf(msg);

}

 

void Hello() {

    return;

}

(In case of mojibake due to encoding issue for Kanji, screenshots are also 
provided below.) 

What was occurred? (as is)
Now if you run `gtags` command in same folder follow by `global -f test.c`, you 
only get one tag, which is `Foo`, but `Hello` shall also be found.

What did you expect from it?
However, if I modify the source a little bit, then tag `Hello` is found. See 
variations I tried in the table below.



Cases Table

Cases

Source Code Screenshot

global -f test.c

Bad Case



(Encoding is cp932, or shift-jis)

Foo                 4 test.cpp         void Foo() {

Good Cases

<image001.png>

(Encoding is utf8)

 



(Encoding is cp932, or shift-jis)

 



(Encoding is cp932, or shift-jis)

Foo                 4 test.cpp         void Foo() {

Hello               9 test.cpp         void Hello() {



My environment

OS

Windows 11 Enterprise 22H2 64bit Build 22621.2428

gtags --version

gtags (Global) 6.6.9

Powered by Berkeley DB 1.85.

Copyright (c) 1996-2022 Tama Communications Corporation

License GPLv3+: GNU GPL version 3 or later http://www.gnu.org/licenses/gpl.html

This is free software; you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.



Possible Solutions

Add a command line encoding option to read the file properly.
Find out why such file cannot be fully parsed, ignore such special error, and 
continue parsing.
Also, if such case happens, at least print out some error message to inform 
user that some files are not fully parsed.

 

 

Johnny Cheng

Reply via email to