Re: LexerInput.read() returns null characters after unicode characters.

Tim Boudreau Thu, 27 Dec 2018 03:13:06 -0800

On Wed, Dec 26, 2018 at 11:35 PM venkatram.akkin...@gmail.com <
venkatram.akkin...@gmail.com> wrote:


> I have the following string which I am trying to read through
> LexerInput.read().
>
> quote
> టోకెన్
> quote
>
> ట ో క ె న ్
>
> 5 (quote) + 6(టోకెన్) + 5 (quote) + 2 new line characters that is a total
> of 18 characters. LexerInput.read() returns all the characters as expected.
> But it keeps going and returns null characters for each call to
> LexerInput.read().
>

LexerInput returns a primitive int.  It cannot return null.


> I tried to stop at the first null character, backup one character and
> return a null. But this is the error I see.
>

The editor's lexer plumbing insists that tokens be returned for every
character in a file.  If there is a mismatch, it assumes that is a bug in
the lexer.  The sum of the lengths of all tokens returned while lexing must
match the number of characters actually in the input.  It looks like your
lexer is trying to bail out without consuming all the characters in the
file.

If you are backing up one character, that guarantees the editor
infrastructure does not think it is at the end of the file when you return
null, and you will get an exception.


> I guess it makes sense since LexerInputOperation doesn't allow resetting
> the offsets,
>
> > returned null token but lexerInput.readLength()=1
> > lexer-state: DEFAULT_STATE
> > tokenStartOffset=18, readOffset=19, lookaheadOffset=19
> > Chars: "
>

*Your lexer* is returning a null token - signalling EOF - before the actual
end of the file/document/input.

If you are using ANTLR, does your grammar read the entire file?  You need a
rule that includes EOF explicitly, or it is easy to have a grammar which
looks like it works most of the time, but for some files will hand you an
eof token without giving you tokens for the entire file - it does what you
tell it to, so if you didn't tell it that the content to parse ends only
when the end of the file is encountered, then it once it has satisfied the
rules you gave it, it is "done" as far as it is concerned.  Frequently this
manifests as getting an EOF token from ANTLR which is not zero-length and
represents trailing whitespace - in which case, you need to return a token
for that from the call to nextToken(), and set a boolean (or however you
want to do it) to return null on the *next* call.  Or, make sure your
grammar will always, always return non-EOF tokens for every byte of input
it is given.

In a pinch, to debug this stuff (whether or not you're using ANTLR), add
some printlns in your lexer's nextToken() and see what you're really
getting.  In particular, you must have some code that *thinks* it knows it
is at EOF (prematurely) if your lexer is returning null.  If your lexer is
going to return null, then your LexerInput's read() method should return
-1.  If it doesn't, you are leaving some characters unprocessed and will
get the exception and message you posted.  So, when in that state, read the
remaining characters (if any) into a StringBuilder, log them to stdout, see
what they are and modify your grammar or whatever does the lexing to ensure
they really get processed.

Then write some tests to drive your lexer with various horribly mangled
input (including 0-length) to ensure it never gets re-broken.

-Tim

Re: LexerInput.read() returns null characters after unicode characters.

Reply via email to