pvary commented on code in PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#discussion_r1413777549
##########
docs/flink-queries.md:
##########
@@ -277,6 +277,66 @@ DataStream<Row> stream = env.fromSource(source,
WatermarkStrategy.noWatermarks()
"Iceberg Source as Avro GenericRecord", new
GenericRecordAvroTypeInfo(avroSchema));
```
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several
purposes, like harnessing the
+[Flink Watermark
Alignment](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment),
+or prevent triggering
[windows](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/windows/)
+too early when reading multiple data files concurrently.
+
+Enable watermark generation for an `IcebergSource` by setting the
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Iceberg `timestamp` or `timestamptz` inherently contains the time precision.
So there is no need
+to specify the time unit. But `long` type column doesn't contain time unit
information. Use
+`watermarkTimeUnit` to configure the conversion for long columns.
+
+The watermarks are generated based on column metrics stored for data files and
emitted once per split.
+If multiple smaller files with different time ranges are combined into a
single split, it can increase
+the out-of-orderliness and extra data buffering in the Flink state. The main
purpose of watermark alignment
+is to reduce out-of-orderliness and excess data buffering in the Flink state.
Hence it is recommended to
+set `read.split.open-file-cost` to a very large value to prevent combining
multiple smaller files into a
+single split. Do not forget to consider the additional memory and CPU load
caused by having multiple
+splits in this case.
+
+By default, the column metrics are collected for the first 100 columns of the
table.
+Use [write properties](configuration.md#write-properties) starting with
`write.metadata.metrics` when needed.
+
+```java
+StreamExecutionEnvironment env =
StreamExecutionEnvironment.createLocalEnvironment();
+TableLoader tableLoader =
TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
+
+// Ordered data file reads with windowing, using a timestamp column
Review Comment:
Changed the examples part... could you please check?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]