> after the patch it still goes through strlen characters and computes the hash 
> (at least up to maxlen)

The key is that it's up to maxLen and not potentially the whole-file-len. 
Assume, for instance, that you have a 1MB file and this problem happens at the 
beginning of the file. Also assume that maxLen (which corresponds to the 
longest keyword) is say 30. With this patch, the code will go through 30 * 1 
000 000 characters (30 in each iteration). Without this patch, it will go 
through 1 character in the first iteration, 2 characters in the second, ..., 
and 1 000 000 in the 1 000 000th iteration, so totally 1 000 000 * 1 000 000 / 
2.

> so computing the hash is unlikely to be the issue since it still happens with 
> the patch,

Computing the hash isn't the problem - the problem is for how many characters 
it is computed unnecessarily when we know in advance that the string cannot be 
inside the hash table because it's too long. But instead of doing `strlen() > 
maxLen` before computing the hash, I compute the length as part of the hashing 
function so it reuses the loop and doesn't have to go through the whole string 
length.

> I am not sure about the characteristics of the hash implementation with 
> unruly input, does it limit the chain lengths, or can they extend up to the 
> maximum load factor?

The patch doesn't try to do anything about the hashing itself. It just 
implements this simple logic: is the length of the string we are checking 
longer than the longest string in the hash table? If so, we can be sure it's 
not in the hash table and say no quickly without doing the full hashing and 
table lookup. Otherwise hashing works as before.

> if the chain lengths can grow long then the cost of 
> [strcasecmp](https://github.com/geany/geany/blob/dea43baf477ab71650cdfc54fa976714def9c170/ctags/main/keyword.c#L171)
>  will dominate since it is slow, having to apply tolower on both strings 
> before comparing, and IIRC SQL is case insensitive.

This line won't be reached because of this check before: 
https://github.com/geany/geany/blob/dea43baf477ab71650cdfc54fa976714def9c170/ctags/main/keyword.c#L162-L163

> why the **** is the problematic code searching the hash each character in the 
> first place? Why not scan input until the next end of word and search for 
> that word once?

This happened in the SQL parser because of the various dialects of SQL:
1. In PostgreSQL you can have ``$$dollar quoted strings delimited with 
dollars$$`
2. In PL/SQL you can have `$$SOME_KEYWORDS_STARGING_WITH_DOUBLE_DOLLAR`

When reaching the `$$`, the parser was checking against the keyword table if 
was one of the PL/SQL keywords as the string grew. Of course, in this case one 
could check for the spaces in this case and if the space was found, it couldn't 
be a keyword. It's just that such problems aren't so obvious when writing 
parsers (I gave the example code above as the illustration only, the whole 
thing is much more hidden in the SQL parser) and there may be similar problems 
in other parsers too and better to prevent complete parser freezes at the hash 
table level for all parsers than relying on the parser logic.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/pull/3433#issuecomment-1479166309
You are receiving this because you are subscribed to this thread.

Message ID: <geany/geany/pull/3433/c1479166...@github.com>

Reply via email to