[GitHub] [flink] JingsongLi opened a new pull request #14192: [FLINK-20295][table][fs-connector] Table File Source lost data when reading from directories with JSON format

GitBox Mon, 23 Nov 2020 22:35:01 -0800


JingsongLi opened a new pull request #14192:
URL: https://github.com/apache/flink/pull/14192



   ## What is the purpose of the change
   
   When testing the compaction functionality of the FileSystemTableSink, I 
found that when using json format, the produced directories could not be read 
correctly by the file source, namely only a part of records are read.
   
   By checking the produced directories, the number of the records in it is the 
same as expected, thus it seems to be the issue of the source side.
   
   The issue only exists for JSON format.
   
   The data is produced by FileCompactionTest and read by  
FileCompactionCheckTest . An example directories tar file of 8000 records are 
also attached.
   
   ## Brief change log
   
   It should be due to that `DeserializationSchemaAdapter` always returned the 
same cached iterator for different batches, thus it might be the data get 
override before it is fully emitted. Returns different iterators should be able 
to solve this issue. 
   
   `HiveBulkFormatAdapter` also has this problem.
   
   ## Verifying this change
   
   *(Please pick either of the following options)*
   
   `JsonBatchFileSystemITCase.bigDataTest`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] JingsongLi opened a new pull request #14192: [FLINK-20295][table][fs-connector] Table File Source lost data when reading from directories with JSON format

Reply via email to