JingsongLi opened a new pull request #14192:
URL: https://github.com/apache/flink/pull/14192
## What is the purpose of the change
When testing the compaction functionality of the FileSystemTableSink, I
found that when using json format, the produced directories could not be read
correctly by the file source, namely only a part of records are read.
By checking the produced directories, the number of the records in it is the
same as expected, thus it seems to be the issue of the source side.
The issue only exists for JSON format.
The data is produced by FileCompactionTest and read by
FileCompactionCheckTest . An example directories tar file of 8000 records are
also attached.
## Brief change log
It should be due to that `DeserializationSchemaAdapter` always returned the
same cached iterator for different batches, thus it might be the data get
override before it is fully emitted. Returns different iterators should be able
to solve this issue.
`HiveBulkFormatAdapter` also has this problem.
## Verifying this change
*(Please pick either of the following options)*
`JsonBatchFileSystemITCase.bigDataTest`
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]