[GitHub] [iceberg] zhangjun0x01 commented on a change in pull request #1936: Flink: Default read parallelism to number of splits

GitBox Wed, 20 Jan 2021 17:49:35 -0800


zhangjun0x01 commented on a change in pull request #1936:
URL: https://github.com/apache/iceberg/pull/1936#discussion_r561461670




##########
File path: 
flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java
##########
@@ -685,4 +782,60 @@ public void testSqlParseError() {
     AssertHelpers.assertThrows("The NaN is not supported by flink now. ",
         NumberFormatException.class, () -> sql(sqlParseErrorLTE));
   }
+
+  /**
+   * The sql can be executed in both streaming and batch mode, in order to get 
the parallelism, we convert the flink
+   * Table to flink DataStream, so we only use streaming mode here.
+   *
+   * @throws TableNotExistException table not exist exception
+   */
+  @Test
+  public void testInferedParallelism() throws TableNotExistException {
+    Assume.assumeTrue("The execute mode should  be streaming mode", 
isStreamingJob);

Review comment:
       I think these are two different concepts.
   In flink, whether using batch mode or streaming mode, we can read batch 
data. Flink treats batch jobs as bounded streaming jobs, so there should be no 
problem whether it is using batch mode or streaming mode to read batch data.  
In addition, flink will use `StreamExecutionEnvironment (DataStream)` to do 
batch tasks and stream tasks uniformly ([the  doc 
link](https://flink.apache.org/news/2020/12/10/release-1.12.0.html#batch-execution-mode-in-the-datastream-api))
 . The batch mode may expire, so I think we should also use 
`StreamExecutionEnvironment (DataStream)` for batch tasks as much as possible.
   
   When we use `StreamExecutionEnvironment` , in `FlinkSource.Builder#build` 
method, 
   the `if` and `else` block in this method are both streaming jobs, `if` block 
is a bounded streaming jobs, maybe we can rename `ScanContext#isStreaming` 
field to `ScanContext#isStreamingRead`, which will be easier to understand. 
`If` code block is bounded stream job (batch), `else` code block to do 
long-running stream job.
   ```
       if (!context.isStreaming()) {
           int parallelism = inferParallelism(format, context);
           return env.createInput(format, typeInfo).setParallelism(parallelism);
         } else {
           StreamingMonitorFunction function = new 
StreamingMonitorFunction(tableLoader, context);
   
           String monitorFunctionName = String.format("Iceberg table (%s) 
monitor", table);
           String readerOperatorName = String.format("Iceberg table (%s) 
reader", table);
   
           return env.addSource(function, monitorFunctionName)
               .transform(readerOperatorName, typeInfo, 
StreamingReaderOperator.factory(format));
         }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhangjun0x01 commented on a change in pull request #1936: Flink: Default read parallelism to number of splits

Reply via email to