Re: [PR] Flink: Document watermark generation feature [iceberg]

via GitHub Mon, 04 Dec 2023 04:03:26 -0800


pvary commented on code in PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#discussion_r1413777549



##########
docs/flink-queries.md:
##########
@@ -277,6 +277,66 @@ DataStream<Row> stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
     "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment),
+or prevent triggering 
[windows](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/windows/)
+too early when reading multiple data files concurrently.
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Iceberg `timestamp` or `timestamptz` inherently contains the time precision. 
So there is no need
+to specify the time unit. But `long` type column doesn't contain time unit 
information. Use  
+`watermarkTimeUnit` to configure the conversion for long columns.
+
+The watermarks are generated based on column metrics stored for data files and 
emitted once per split.
+If multiple smaller files with different time ranges are combined into a 
single split, it can increase
+the out-of-orderliness and extra data buffering in the Flink state. The main 
purpose of watermark alignment
+is to reduce out-of-orderliness and excess data buffering in the Flink state. 
Hence it is recommended to
+set `read.split.open-file-cost` to a very large value to prevent combining 
multiple smaller files into a
+single split. Do not forget to consider the additional memory and CPU load 
caused by having multiple
+splits in this case.
+
+By default, the column metrics are collected for the first 100 columns of the 
table.
+Use [write properties](configuration.md#write-properties) starting with 
`write.metadata.metrics` when needed.
+
+```java
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.createLocalEnvironment();
+TableLoader tableLoader = 
TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
+
+// Ordered data file reads with windowing, using a timestamp column

Review Comment:
   Changed the examples part... could you please check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Flink: Document watermark generation feature [iceberg]

Reply via email to