On Wed, Dec 26, 2018 at 11:35 PM venkatram.akkin...@gmail.com < venkatram.akkin...@gmail.com> wrote:
> I have the following string which I am trying to read through > LexerInput.read(). > > quote > టోకెన్ > quote > > ట ో క ె న ్ > > 5 (quote) + 6(టోకెన్) + 5 (quote) + 2 new line characters that is a total > of 18 characters. LexerInput.read() returns all the characters as expected. > But it keeps going and returns null characters for each call to > LexerInput.read(). > LexerInput returns a primitive int. It cannot return null. > I tried to stop at the first null character, backup one character and > return a null. But this is the error I see. > The editor's lexer plumbing insists that tokens be returned for every character in a file. If there is a mismatch, it assumes that is a bug in the lexer. The sum of the lengths of all tokens returned while lexing must match the number of characters actually in the input. It looks like your lexer is trying to bail out without consuming all the characters in the file. If you are backing up one character, that guarantees the editor infrastructure does not think it is at the end of the file when you return null, and you will get an exception. > I guess it makes sense since LexerInputOperation doesn't allow resetting > the offsets, > > > returned null token but lexerInput.readLength()=1 > > lexer-state: DEFAULT_STATE > > tokenStartOffset=18, readOffset=19, lookaheadOffset=19 > > Chars: " > *Your lexer* is returning a null token - signalling EOF - before the actual end of the file/document/input. If you are using ANTLR, does your grammar read the entire file? You need a rule that includes EOF explicitly, or it is easy to have a grammar which looks like it works most of the time, but for some files will hand you an eof token without giving you tokens for the entire file - it does what you tell it to, so if you didn't tell it that the content to parse ends only when the end of the file is encountered, then it once it has satisfied the rules you gave it, it is "done" as far as it is concerned. Frequently this manifests as getting an EOF token from ANTLR which is not zero-length and represents trailing whitespace - in which case, you need to return a token for that from the call to nextToken(), and set a boolean (or however you want to do it) to return null on the *next* call. Or, make sure your grammar will always, always return non-EOF tokens for every byte of input it is given. In a pinch, to debug this stuff (whether or not you're using ANTLR), add some printlns in your lexer's nextToken() and see what you're really getting. In particular, you must have some code that *thinks* it knows it is at EOF (prematurely) if your lexer is returning null. If your lexer is going to return null, then your LexerInput's read() method should return -1. If it doesn't, you are leaving some characters unprocessed and will get the exception and message you posted. So, when in that state, read the remaining characters (if any) into a StringBuilder, log them to stdout, see what they are and modify your grammar or whatever does the lexing to ensure they really get processed. Then write some tests to drive your lexer with various horribly mangled input (including 0-length) to ensure it never gets re-broken. -Tim