Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2024-02-01 Thread via GitHub


mattyb149 commented on PR #7952:
URL: https://github.com/apache/nifi/pull/7952#issuecomment-1922577711

   That's a good point, if the CSV file has gone through a record-based 
processor you can also use RouteText to skip the last N lines too using the 
`${record.count}` attribute. Closing this and the Jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2024-02-01 Thread via GitHub


mattyb149 closed pull request #7952: NIFI-8932: Add capability to skip first N 
rows in CSVReader
URL: https://github.com/apache/nifi/pull/7952


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2024-01-31 Thread via GitHub


mattyb149 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1473639064


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}

Review Comment:
   Actually come to think of it, this capability is to skip lines aren't AREN'T 
valid CSV, otherwise you can use SampleRecord. Because we aren't assuming the 
skipped rows to be valid CSV, we should be able to just match on the newline 
character. I'll put in some code anyway but what are your thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2024-01-05 Thread via GitHub


mattyb149 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1443377852


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}

Review Comment:
   I'll look at capturing escape characters, maybe some sort of flag indicating 
that we're in an escape sequence. Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-05 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1416200289


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}

Review Comment:
   @mattyb149 @exceptionfactory A concern I have with this logic is what 
happens when the record separator is escaped in the data? How will you 
distinguish whether you have an end of record or an escaped record separator.  
I did a quick look on line and found that a new line character which usually is 
the end of a CSV record can be embedded in the data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-05 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1416200289


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}

Review Comment:
   @mattyb149 @exceptionfactory A concern I have with this logic is what 
happens when the record separator is escaped in the data? How will you tell 
whether you have an end of record or an escaped record separator.  I did a 
quick look on line and found that a new line character which usually is the end 
of a CSV record can be embedded in the data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414512781


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}
+} else {
+// The character didn't match the expected one in the record 
separator, reset the separator matcher
+// and check if it is the first character of the separator.
+indexIntoSeparator = 0;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+// This character is the beginning of the record 
separator, keep it
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string 
built so far
+return;
+}
+}
+}

Review Comment:
   Instead of all the comments regarding reaching the end of a record, make a 
method which conveys that intention.
   
   
   ```
   char nextChar = (char)code;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) ) {
   return;
   }
   } else {
   // The character didn't match the expected one in the record separator, 
reset the separator matcher
   // and check if it is the first character of the separator.
   indexIntoSeparator = 0;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) 
) {
return;
   }
   }
   }
   
   private boolean hasReachedEndOfRecord(int indexIntoSeparator, int 
recordSeparatorLength) {
   return indexIntoSeparator == recordSeparatorLength;
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414512781


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}
+} else {
+// The character didn't match the expected one in the record 
separator, reset the separator matcher
+// and check if it is the first character of the separator.
+indexIntoSeparator = 0;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+// This character is the beginning of the record 
separator, keep it
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string 
built so far
+return;
+}
+}
+}

Review Comment:
   Instead of all the comments regarding reaching the end of a record, make a 
method which conveys the intention of what you are trying to accomplish.
   
   
   ```
   char nextChar = (char)code;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) ) {
   return;
   }
   } else {
   // The character didn't match the expected one in the record separator, 
reset the separator matcher
   // and check if it is the first character of the separator.
   indexIntoSeparator = 0;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) 
) {
return;
   }
   }
   }
   
   private boolean hasReachedEndOfRecord(int indexIntoSeparator, int 
recordSeparatorLength) {
   return indexIntoSeparator == recordSeparatorLength;
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414512781


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+return;
+}
+} else {
+// The character didn't match the expected one in the record 
separator, reset the separator matcher
+// and check if it is the first character of the separator.
+indexIntoSeparator = 0;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+// This character is the beginning of the record 
separator, keep it
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string 
built so far
+return;
+}
+}
+}

Review Comment:
   Instead of all the comments here, make a method which conveys the intention 
of what you are trying to accomplish.
   
   
   ```
   char nextChar = (char)code;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) ) {
   return;
   }
   } else {
   // The character didn't match the expected one in the record separator, 
reset the separator matcher
   // and check if it is the first character of the separator.
   indexIntoSeparator = 0;
   if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
   ++indexIntoSeparator;
   if (hasReachedEndOfRecord(indexIntoSeparator, recordSeparatorLength) 
) {
return;
   }
   }
   }
   
   private boolean hasReachedEndOfRecord(int indexIntoSeparator, int 
recordSeparatorLength) {
   return indexIntoSeparator == recordSeparatorLength;
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414467199


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far

Review Comment:
   ```suggestion
   // Short circuit as the matched separator indicates a 
record has been read
   ```
   
   Ditto for line 209



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414467199


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far

Review Comment:
   ```suggestion
   // Short circuit as the matched separator indicates a 
record has been read
   ```
   
   Ditto for line 209



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414467199


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far

Review Comment:
   ```suggestion
   // Short circuit as the matched separator indicates a 
record has been read
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414467199


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -158,4 +180,46 @@ protected String trim(String value) {
 public RecordSchema getSchema() {
 return schema;
 }
+
+/**
+ * This method searches using the specified Reader character-by-character 
until the
+ * record separator is found.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @throws IOException if an error occurs during reading, including not 
finding the record separator in the input
+ */
+protected void readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far

Review Comment:
   ```suggestion
   // Short circuit as the separator indicates a record has 
been read
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-12-04 Thread via GitHub


dan-s1 commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r1414338830


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/AbstractCSVRecordReader.java:
##
@@ -74,6 +88,14 @@ abstract public class AbstractCSVRecordReader implements 
RecordReader {
 this.timestampFormat = timestampFormat;
 LAZY_TIMESTAMP_FORMAT = () -> 
DataTypeUtils.getDateFormat(timestampFormat);
 }
+
+final InputStream bomInputStream = 
BOMInputStream.builder().setInputStream(in).get();
+inputStreamReader = new InputStreamReader(bomInputStream, encoding);

Review Comment:
   ```suggestion
   this.inputStreamReader = new InputStreamReader(bomInputStream, 
encoding);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-11-20 Thread via GitHub


exceptionfactory commented on code in PR #7952:
URL: https://github.com/apache/nifi/pull/7952#discussion_r139938


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVRecordReader.java:
##
@@ -155,4 +159,59 @@ private List getRecordFields() {
 public void close() throws IOException {
 csvParser.close();
 }
+
+/**
+ * This method builds a text representation of the CSV record by searching 
character-by-character until the
+ * record separator is found. Because we never want to consume input we 
don't use, the method attempts to match
+ * the separator separately, and as it is not matched, the characters are 
added to the returned string.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @return a String created from the input until the record separator is 
reached.
+ * @throws IOException if an error occurs during reading
+ */
+protected String readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+StringBuilder lineBuilder = new StringBuilder();
+StringBuilder separatorBuilder = new StringBuilder();
+int code = reader.read();
+while (code != -1) {
+char nextChar = (char)code;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+separatorBuilder.append(nextChar);
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string built 
so far
+lineBuilder.append(separatorBuilder);
+return lineBuilder.toString();
+}
+} else {
+// The character didn't match the expected one in the record 
separator, reset the separator matcher
+// and check if it is the first character of the separator.
+indexIntoSeparator = 0;
+if (recordSeparator.charAt(indexIntoSeparator) == nextChar) {
+// This character is the beginning of the record 
separator, keep it
+separatorBuilder = new StringBuilder();
+separatorBuilder.append(nextChar);
+if (++indexIntoSeparator == recordSeparatorLength) {
+// We have matched the separator, return the string 
built so far
+return lineBuilder.toString();
+}
+} else {
+// This character is not the beginning of the record 
separator, add it to the return string
+lineBuilder.append(nextChar);
+}
+}
+// This defensive check limits a record size to 2GB, this prevents 
out-of-memory errors if the record separator
+// is not present in the input (or at least in the first 2GB)
+if (indexIntoSeparator == Integer.MAX_VALUE) {
+throw new IOException("2GB input threshold reached, the record 
is either larger than 2GB or the separator "
++ "is not found in the first 2GB of input. Ensure the 
Record Separator is correct for this FlowFile.");
+}
+code = reader.read();
+}
+
+// The end of input has been reached without the record separator 
being found, throw an exception with the string so far

Review Comment:
   This comment does not appear to match the implementation.



##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVRecordReader.java:
##
@@ -155,4 +159,59 @@ private List getRecordFields() {
 public void close() throws IOException {
 csvParser.close();
 }
+
+/**
+ * This method builds a text representation of the CSV record by searching 
character-by-character until the
+ * record separator is found. Because we never want to consume input we 
don't use, the method attempts to match
+ * the separator separately, and as it is not matched, the characters are 
added to the returned string.
+ * @param reader the Reader providing the input
+ * @param recordSeparator the String specifying the end of a record in the 
input
+ * @return a String created from the input until the record separator is 
reached.
+ * @throws IOException if an error occurs during reading
+ */
+protected String readNextRecord(Reader reader, String recordSeparator) 
throws IOException {
+int indexIntoSeparator = 0;
+int recordSeparatorLength = recordSeparator.length();
+StringBuilder lineBuilder = new StringBuilder();
+

Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-11-17 Thread via GitHub


exceptionfactory commented on PR #7952:
URL: https://github.com/apache/nifi/pull/7952#issuecomment-1816807256

   > Reopening this as I'm actively working it. I realized I hadn't passed in 
the `Character Set` property value into the InputStreamReader so I'm trying to 
fix the code/tests using that first. If not (or if you still object to the 
Reader at all) I can try the PushbackInputStream. The only caveat there is that 
the record separator can be an arbitrary string so I need to create the 
pushback buffer of that size ("n") and push back only "n-1" bytes. I figured 
the reader would do something similar and it makes the code easier to read, so 
if it works with the Reader and the correct encoding I'd like to go with that.
   
   That's a good point about the arbitrary string separator. The Reader will do 
character decoding based on the configured Character Set, so the best approach 
probably depends on how to evaluate the string separator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-11-17 Thread via GitHub


mattyb149 commented on PR #7952:
URL: https://github.com/apache/nifi/pull/7952#issuecomment-1816797181

   Reopening this as I'm actively working it. I realized I hadn't passed in the 
`Character Set` property value into the InputStreamReader so I'm trying to fix 
the code/tests using that first. If not (or if you still object to the Reader 
at all) I can try the PushbackInputStream. The only caveat there is that the 
record separator can be an arbitrary string so I need to create the pushback 
buffer of that size ("n") and push back only "n-1" bytes. I figured the reader 
would do something similar and it makes the code easier to read, so if it works 
with the Reader and the correct encoding I'd like to go with that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-11-15 Thread via GitHub


exceptionfactory closed pull request #7952: NIFI-8932: Add capability to skip 
first N rows in CSVReader
URL: https://github.com/apache/nifi/pull/7952


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-11-15 Thread via GitHub


exceptionfactory commented on PR #7952:
URL: https://github.com/apache/nifi/pull/7952#issuecomment-1813707051

   @mattyb149 Just to streamline the reviews, I am closing this pull request 
for now given the test failures across the board, and the implementation 
concerns. Feel free to reopen whenever you are ready. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NIFI-8932: Add capability to skip first N rows in CSVReader [nifi]

2023-10-29 Thread via GitHub


mattyb149 opened a new pull request, #7952:
URL: https://github.com/apache/nifi/pull/7952

   # Summary
   
   [NIFI-8932](https://issues.apache.org/jira/browse/NIFI-8932) This PR adds 
the capability to skip the first N rows of an incoming file to CSVReader, in 
the case of headers or other invalid records at the top of the FlowFile.
   
   # Tracking
   
   Please complete the following tracking steps prior to pull request creation.
   
   ### Issue Tracking
   
   - [x] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue 
created
   
   ### Pull Request Tracking
   
   - [x] Pull Request title starts with Apache NiFi Jira issue number, such as 
`NIFI-0`
   - [x] Pull Request commit message starts with Apache NiFi Jira issue number, 
as such `NIFI-0`
   
   ### Pull Request Formatting
   
   - [x] Pull Request based on current revision of the `main` branch
   - [x] Pull Request refers to a feature branch with one commit containing 
changes
   
   # Verification
   
   Please indicate the verification steps performed prior to pull request 
creation.
   
   ### Build
   
   - [x] Build completed using `mvn clean install -P contrib-check`
 - [x] JDK 21
   
   ### Licensing
   
   - [ ] New dependencies are compatible with the [Apache License 
2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License 
Policy](https://www.apache.org/legal/resolved.html)
   - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` 
files
   
   ### Documentation
   
   - [x] Documentation formatting appears as expected in rendered files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org