[ 
https://issues.apache.org/jira/browse/CAMEL-12698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569784#comment-16569784
 ] 

ASF GitHub Bot commented on CAMEL-12698:
----------------------------------------

MakotoTheKnight commented on a change in pull request #2454: CAMEL-12698: Use 
the Stream API to read files instead of Scanner
URL: https://github.com/apache/camel/pull/2454#discussion_r207784002
 
 

 ##########
 File path: 
components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/csv/BindyCsvDataFormat.java
 ##########
 @@ -138,58 +144,74 @@ public Object unmarshal(Exchange exchange, InputStream 
inputStream) throws Excep
         // List of Pojos
         List<Map<String, Object>> models = new ArrayList<>();
 
-        // Pojos of the model
-        Map<String, Object> model;
         InputStreamReader in = null;
-        Scanner scanner = null;
         try {
             if (checkEmptyStream(factory, inputStream)) {
                 return models;
             }
     
             in = new InputStreamReader(inputStream, 
IOHelper.getCharsetName(exchange));
-    
-            // Scanner is used to read big file
-            scanner = new Scanner(in);
-    
+
             // Retrieve the separator defined to split the record
             String separator = factory.getSeparator();
             String quote = factory.getQuote();
             ObjectHelper.notNull(separator, "The separator has not been 
defined in the annotation @CsvRecord or not instantiated during initModel.");
-    
-            int count = 0;
-            
-            // If the first line of the CSV file contains columns name, then we
-            // skip this line
-            if (factory.getSkipFirstLine()) {
-                // Check if scanner is empty
-                if (scanner.hasNextLine()) {
-                    scanner.nextLine();
+            AtomicInteger count = new AtomicInteger(0);
+
+            // Use a Stream to stream a file across.
+            try (Stream<String> lines = new BufferedReader(in).lines()) {
 
 Review comment:
   Good call on the other classes; I'll take a look at those and follow up with 
another commit to address those deficiencies.  A quick skim suggests that they 
too would suffer from the same bug.
   
   As for adding a boolean option to address this - I'm not entirely convinced 
of the value of it.  The assumption the code is making is that an entire line 
from the file is being read, but it leaves the definition of a line to the 
implementation of `Scanner`, when it really should be the convention we all 
have come to expect and demand of lines from a file (carriage return, line 
feed, or both).  By and large, `Scanner` does catch almost every use case there 
is when it comes to lines, with the exception of this very peculiar instance.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unmarshaling a CSV file with the NEL (next line) character will cause Bindy 
> to misread the entire file
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CAMEL-12698
>                 URL: https://issues.apache.org/jira/browse/CAMEL-12698
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-bindy
>    Affects Versions: 2.22.0
>            Reporter: Jason Black
>            Priority: Major
>
> I am using Apache Camel to process a lot of large CSV files, and relying on 
> Bindy to assist with unmarshalling them into POJOs.
> We have an upstream data bug which causes a record of ours to contain the 
> Unicode character 
> [NEL|http://www.fileformat.info/info/unicode/char/85/index.htm], but while 
> we're working through the cause of that, I found it curious as to what Bindy 
> is actually doing with it.  We rely on the unmarshal process to perform a 
> batch insert, and because our POJO is missing certain fields, we started 
> observing that the 
> Bindy is relying on Scanner to read lines in a large file; however, Scanner 
> itself also does some parsing of the line with the assumption that, if it 
> sees the NEL character, it will regard it as a newline character.  The modern 
> Files API does not make this distinction and reads to a newline designation 
> only (e.g \n, \r, or \r\n).
> There are two ways to fix this from what I've been able to smoke test:
>  * Change the Scanner implementation to use a delimeter of the more 
> traditional newline characters
>  * Use Java 8's Files API and stream the file in
> I would personally want to use the Files API to handle this since it's more 
> robust and capable of higher performance, but I'll explore both approaches 
> and see where I end up.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to