JingsongLi opened a new pull request #9884: [FLINK-14266][table] Introduce 
RowCsvInputFormat to new CSV module
URL: https://github.com/apache/flink/pull/9884
 
 
   ## What is the purpose of the change
   
   Now, we have an old CSV, but that is not standard CSV support. we should 
support the RFC-compliant CSV format for table/sql.
   
   ## Brief change log
   
   Add RowCsvInputFormat and Use jackson ObjectReader.readValues(InputStream). 
We need deal with half-line reading when splitting large files into multiple 
splits. The difficulties are:
   1. ObjectReader do not know current read offset, it has buffer to cache more 
bytes. But we need stop in the right place for reading a FileSplit. We use 
BoundedInputStream.
   2. For the half-line reading, in open, we look for the next delimiter for 
split start, discard the first half of the line; and look for the next 
delimiter for split end to complete the whole line.
   
   ## Verifying this change
   
   RowCsvInputFormatTest and RowCsvInputFormatSplitTest
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? JavaDocs
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to