[jira] [Commented] (CAMEL-12698) Unmarshaling a CSV file with the NEL (next line) character will cause Bindy to misread the entire file

ASF GitHub Bot (JIRA) Sun, 09 Sep 2018 23:52:32 -0700


    [ 
https://issues.apache.org/jira/browse/CAMEL-12698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608794#comment-16608794
 ]


ASF GitHub Bot commented on CAMEL-12698:
----------------------------------------

onderson commented on a change in pull request #2454: CAMEL-12698: Use the 
Stream API to read files instead of Scanner
URL: https://github.com/apache/camel/pull/2454#discussion_r216213616
 
 

 ##########
 File path: 
components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/kvp/BindyKeyValuePairDataFormat.java
 ##########
 @@ -88,55 +92,73 @@ public void marshal(Exchange exchange, Object body, 
OutputStream outputStream) t
     }
 
     public Object unmarshal(Exchange exchange, InputStream inputStream) throws 
Exception {
-        BindyKeyValuePairFactory factory = 
(BindyKeyValuePairFactory)getFactory();
+        BindyKeyValuePairFactory factory = (BindyKeyValuePairFactory) 
getFactory();
 
         // List of Pojos
         List<Map<String, Object>> models = new ArrayList<>();
 
-        // Pojos of the model
-        Map<String, Object> model;
-        
         // Map to hold the model @OneToMany classes while binding
         Map<String, List<Object>> lists = new HashMap<>();
 
         InputStreamReader in = new InputStreamReader(inputStream, 
IOHelper.getCharsetName(exchange));
 
-        // Scanner is used to read big file
-        Scanner scanner = new Scanner(in);
+        // Use a Stream to stream a file across
+        try (Stream<String> lines = new BufferedReader(in).lines()) {
+            // Retrieve the pair separator defined to split the record
+            ObjectHelper.notNull(factory.getPairSeparator(), "The pair 
separator property of the annotation @Message");
+            String separator = factory.getPairSeparator();
+            AtomicInteger count = new AtomicInteger(0);
+
+            try {
+                lines.forEachOrdered(line -> {
+                    consumeFile(factory, models, lists, separator, count, 
line);
+                });
+            } catch (WrappedException e) {
+                throw e.getWrappedException();
+            }
+
+            // BigIntegerFormatFactory if models list is empty or not
+            // If this is the case (correspond to an empty stream, ...)
+            if (models.size() == 0) {
+                throw new java.lang.IllegalArgumentException("No records have 
been defined in the CSV");
+            } else {
+                return extractUnmarshalResult(models);
+            }
 
-        // Retrieve the pair separator defined to split the record
-        ObjectHelper.notNull(factory.getPairSeparator(), "The pair separator 
property of the annotation @Message");
-        String separator = factory.getPairSeparator();
+        } finally {
+            IOHelper.close(in, "in", LOG);
+        }
+    }
 
-        int count = 0;
+    private void consumeFile(BindyKeyValuePairFactory factory, 
List<Map<String, Object>> models, Map<String, List<Object>> lists, String 
separator, AtomicInteger count, String line) {
         try {
-            while (scanner.hasNextLine()) {
-                // Read the line
-                String line = scanner.nextLine().trim();
-
-                if (ObjectHelper.isEmpty(line)) {
-                    // skip if line is empty
-                    continue;
-                }
+            // Trim the line coming in to remove any trailing whitespace
+            String trimmedLine = line.trim();
 
+            if (!ObjectHelper.isEmpty(trimmedLine)) {
                 // Increment counter
-                count++;
+                count.incrementAndGet();
+                // Pojos of the model
+                Map<String, Object> model;
 
                 // Create POJO
                 model = factory.factory();
 
                 // Split the message according to the pair separator defined in
                 // annotated class @Message
-                List<String> result = Arrays.asList(line.split(separator));
+                // Explicitly replace any occurrence of the Unicode new line 
character.
+                List<String> result = Arrays.stream(line.split(separator))
+                        .map(x -> x.replace("\u0085", ""))
 
 Review comment:
   could you explain this a little bit more?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unmarshaling a CSV file with the NEL (next line) character will cause Bindy 
> to misread the entire file
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CAMEL-12698
>                 URL: https://issues.apache.org/jira/browse/CAMEL-12698
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-bindy
>    Affects Versions: 2.22.0
>            Reporter: Jason Black
>            Priority: Major
>
> I am using Apache Camel to process a lot of large CSV files, and relying on 
> Bindy to assist with unmarshalling them into POJOs.
> We have an upstream data bug which causes a record of ours to contain the 
> Unicode character 
> [NEL|http://www.fileformat.info/info/unicode/char/85/index.htm], but while 
> we're working through the cause of that, I found it curious as to what Bindy 
> is actually doing with it.  We rely on the unmarshal process to perform a 
> batch insert, and because our POJO is missing certain fields, we started 
> observing that the 
> Bindy is relying on Scanner to read lines in a large file; however, Scanner 
> itself also does some parsing of the line with the assumption that, if it 
> sees the NEL character, it will regard it as a newline character.  The modern 
> Files API does not make this distinction and reads to a newline designation 
> only (e.g \n, \r, or \r\n).
> There are two ways to fix this from what I've been able to smoke test:
>  * Change the Scanner implementation to use a delimeter of the more 
> traditional newline characters
>  * Use Java 8's Files API and stream the file in
> I would personally want to use the Files API to handle this since it's more 
> robust and capable of higher performance, but I'll explore both approaches 
> and see where I end up.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CAMEL-12698) Unmarshaling a CSV file with the NEL (next line) character will cause Bindy to misread the entire file

Reply via email to