Github user markobean commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/135#discussion_r50990975
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
    @@ -143,26 +165,12 @@ protected void init(final 
ProcessorInitializationContext context) {
             return properties;
         }
     
    -    private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
    -        int numLines = 0;
    -        for (int i = 0; i < maxNumLines; i++) {
    -            final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
    -            if (bytes <= 0) {
    -                return numLines;
    -            }
    -
    -            numLines++;
    -        }
    -
    -        return numLines;
    -    }
    -
    -    private long countBytesToSplitPoint(final InputStream in, final 
OutputStream out, final boolean includeLineDelimiter) throws IOException {
    +    private int readLine(final InputStream in, final OutputStream out,
    +                          final boolean includeLineDelimiter) throws 
IOException {
             int lastByte = -1;
    -        long bytesRead = 0L;
    +        int bytesRead = 0;
     
             while (true) {
    -            in.mark(1);
    --- End diff --
    
    This "in.mark(1)" was removed because marking and resetting the input 
stream has changed. Previously, this mark/reset was used to rollback the 
reading of the character after a \r. (More below.) Now, mark/reset is used at a 
higher level to potentially rollback an entire line read from the input 
flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE 
property.
    Removing this in.mark() does not generate an IOException in the below (line 
206) in.reset() because an in.mark() was previously made prior to the call to 
readLine() (line 239 or line 323.) Nonetheless, the overall logic is still 
incorrect. There are two logical in.reset() conditions: character-based and 
line-based. This must be corrected and made consistent (line-based).
    The special consideration of the \r character confuses me ('if' block 
beginning on Line 205.) In the original code, the byte after the \r is rolled 
back. Why? If the byte after \r is \n, the special consideration for \r is not 
required as the 'if' on line 194 captures the end of line. Is it valid and 
intended to have the \r indicate an end of line even when a subsequent \n is 
not present? (Note: testing of present SplitText processor shows incorrect 
behavior for only \r without \n; it duplicates the first character of the next 
"line".)
    If special consideration for \r is not required and the 'if' block 
beginning on line 205 is removed, both the in.mark and in.reset within the 
readLine() method go away, and I believe all will be correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to