Github user markobean commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r50990975 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -143,26 +165,12 @@ protected void init(final ProcessorInitializationContext context) { return properties; } - private int readLines(final InputStream in, final int maxNumLines, final OutputStream out, final boolean keepAllNewLines) throws IOException { - int numLines = 0; - for (int i = 0; i < maxNumLines; i++) { - final long bytes = countBytesToSplitPoint(in, out, keepAllNewLines || (i != maxNumLines - 1)); - if (bytes <= 0) { - return numLines; - } - - numLines++; - } - - return numLines; - } - - private long countBytesToSplitPoint(final InputStream in, final OutputStream out, final boolean includeLineDelimiter) throws IOException { + private int readLine(final InputStream in, final OutputStream out, + final boolean includeLineDelimiter) throws IOException { int lastByte = -1; - long bytesRead = 0L; + int bytesRead = 0; while (true) { - in.mark(1); --- End diff -- This "in.mark(1)" was removed because marking and resetting the input stream has changed. Previously, this mark/reset was used to rollback the reading of the character after a \r. (More below.) Now, mark/reset is used at a higher level to potentially rollback an entire line read from the input flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE property. Removing this in.mark() does not generate an IOException in the below (line 206) in.reset() because an in.mark() was previously made prior to the call to readLine() (line 239 or line 323.) Nonetheless, the overall logic is still incorrect. There are two logical in.reset() conditions: character-based and line-based. This must be corrected and made consistent (line-based). The special consideration of the \r character confuses me ('if' block beginning on Line 205.) In the original code, the byte after the \r is rolled back. Why? If the byte after \r is \n, the special consideration for \r is not required as the 'if' on line 194 captures the end of line. Is it valid and intended to have the \r indicate an end of line even when a subsequent \n is not present? (Note: testing of present SplitText processor shows incorrect behavior for only \r without \n; it duplicates the first character of the next "line".) If special consideration for \r is not required and the 'if' block beginning on line 205 is removed, both the in.mark and in.reset within the readLine() method go away, and I believe all will be correct.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---