Thanks Jocean and Shammon.

I took a look at Spark code. I think its abstraction is OK for us too.

A big listener interface PaimonListener (Just like SparkListener), and
an implementation is MetricsPaimonListener to report metrics.

Or you can create another listener implementation, I don't know, if
you need you should mention it in the PIP.

Best,
Jingsong

On Mon, Sep 4, 2023 at 10:10 PM Shammon FY <[email protected]> wrote:
>
> Thanks Jocean.
>
> So should we need to introduce trigger-based metrics for some special events 
> such as commit/compaction? Maybe we can hear from others cc @Caizhi Weng , 
> @Jingsong Li , what do you think?
>
> Best,
> Shammon FY
>
> On Fri, Sep 1, 2023 at 11:29 AM Jocean shi <[email protected]> wrote:
>>
>> Hi Shammon FY
>>
>> Continuing from the previous discussion. I would like to discuss the
>> abstract model of "Listener," "Metric," and "MetricReport" further.
>> Firstly, it is important to clarify that the discussed compaction and
>> commit operations have clear boundaries, making them events. These
>> events, in addition to basic information, also include metrics
>> generated during the execution process, such as execution time and CPU
>> consumption. Therefore, an event consists of both base information and
>> base metrics. Users can obtain these events through Listeners, and
>> they can construct the desired new Metric using these events. They can
>> then report the Metric periodically via MetricReport. Thus, I believe
>> the Metric system is a use case of the Listener.
>>
>> Some similar implementations:
>>
>> SparkListener: The SparkListenerTaskEnd event includes TaskInfo and
>> TaskMetrics, and users can subscribe to SparkListenerTaskEnd to obtain
>> the desired metrics.
>> Iceberg: Iceberg's CommitMetric is also generated through the
>> CreateSnapshotEvent and reported via a reporter. The difference is
>> that Iceberg's reporting is trigger-based, while Paimon performs it on
>> a schedule.
>>
>> Best
>> shidayang
>>
>> Jocean shi <[email protected]> 于2023年8月24日周四 17:41写道:
>> >
>> > Hi Shammon FY
>> >
>> > Thanks for your comment.
>> >
>> > 1. DDL events
>> > Many behaviors of the Table service are related to the options of
>> > tables, such as whether the table has enabled full-compaction and the
>> > triggering conditions for compaction. If the options of a table are
>> > changed, the Table service needs to perceive it in a timely manner and
>> > make corresponding adjustments to the behavior of the table. Without a
>> > listener mechanism, the Table service needs to constantly poll the
>> > table to determine if its configuration has changed, which increases
>> > the pressure on Hive and the Table service. If we can listen to the
>> > AlterTableEvent, we won't need to poll the options of the table.
>> >
>> > 2. Why not metric
>> > Metric is mainly processed statistical indicators that are usually
>> > measured at regular intervals, and multiple reported values may be the
>> > same. This is quite different from events. For example, for 'commit',
>> > Metric usually measures the size, quantity, and duration of recently
>> > committed files, and the results obtained from multiple retrievals may
>> > be the same. It can be imagined that replacing the currently existing
>> > CommitCallback with Metric would be very troublesome.
>> >
>> > Best
>> > shidayang
>> >
>> > Shammon FY <[email protected]> 于2023年8月23日周三 10:53写道:
>> > >
>> > > Hi Jocean
>> > >
>> > > Thanks for your answer. I think there are two types of the information 
>> > > you
>> > > want to report: the ddl events and the runtime events such as commit,
>> > > compaction.
>> > >
>> > > For the ddl events, I don't quite understand why you need to poll the 
>> > > table
>> > > information regularly? As we all know that Paimon is really a storage 
>> > > which
>> > > has all meta information in it, and even when you poll the information 
>> > > from
>> > > Paimon, you need to store it somewhere. I think you can just use Paimon 
>> > > as
>> > > the storage itself. If the performance of obtaining Paimon tables is
>> > > relatively low, such as the large number of tables you mentioned, I think
>> > > we should improve this, for example, add a table cache?
>> > >
>> > > For the runtime events, I understand that they are indeed necessary to
>> > > report to a system like `Table Service`. But my issue is: can we do this 
>> > > in
>> > > the existing metrics mechanism? For example, reporting relevant metrics 
>> > > to
>> > > the `Table Service` instead of adding a new `listener`? If the metrics
>> > > information is not complete enough, we can continue to add information in
>> > > it.
>> > >
>> > > Best,
>> > > Shammon FY
>> > >
>> > > On Tue, Aug 22, 2023 at 2:20 PM Jocean shi <[email protected]> wrote:
>> > >
>> > > > Hi Shammon FY,
>> > > >
>> > > > I get your point, but the role of a Listener is more towards
>> > > > notification. For example, as you mentioned, we can query the relevant
>> > > > information through APIs for DDL and commit information. However, when
>> > > > we want to know if there have been any changes to the relevant
>> > > > information, we need to constantly poll the tables. This mechanism can
>> > > > be resource-intensive, especially when there are many tables. With a
>> > > > Listener, we can promptly detect changes in status. Consider a
>> > > > separate Table service that has a requirement to compact all tables,
>> > > > and the compact parameters are stored in the options. When there is a
>> > > > change in the options of a table, the Table Service needs to be
>> > > > notified promptly to determine whether to immediately compact the
>> > > > table. When there is new data committed to a table, it needs to be
>> > > > promptly detected to determine whether to compact it. Also, users need
>> > > > the assistance of CommitEvent to trigger downstream tasks based on the
>> > > > watermark of a table.
>> > > > Querying compact information through SQL or APIs is indeed a good way.
>> > > > It is relatively simple to query historical compact records. However,
>> > > > if you want to know the current compact status of a table, using a
>> > > > Listener may be simpler.
>> > > >
>> > > > Best
>> > > > Shidayang
>> > > >
>> > > > Shammon FY <[email protected]> 于2023年8月21日周一 23:24写道:
>> > > > >
>> > > > > Hi Jocean,
>> > > > >
>> > > > > Thanks for your explanation. I still have some issues
>> > > > >
>> > > > > 1. What are the ddl events for Paimon used for? If you need to show
>> > > > tables
>> > > > > for paimon in your system, I think it's better to define table 
>> > > > > related
>> > > > > interfaces, and then you can implement them for Paimon, Iceberg and 
>> > > > > Hudi
>> > > > > instead of adding a ddl listener in them. It's more general and you 
>> > > > > can
>> > > > > even manage other tables such as databases, mongodb and hive.
>> > > > >
>> > > > > 2. If some system information in `CompactEvent` is currently missing 
>> > > > > or
>> > > > > there's no information about `compact`,  I think a better way is to 
>> > > > > add
>> > > > > this system information in Paimon, rather than adding a listener and
>> > > > > creating an event with the information. Then the external system can 
>> > > > > get
>> > > > > the information by SQL or API directly, this is a more reasonable
>> > > > approach.
>> > > > >
>> > > > > 3. Also what is the `CommitEvent` used for? Currently we have 
>> > > > > metrics for
>> > > > > `Commit` and jobs can report them. How about adding a customized 
>> > > > > reporter
>> > > > > for metrics instead of a listener for `CommitEvent`?
>> > > > >
>> > > > > Best,
>> > > > > Shammon FY
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Mon, Aug 21, 2023 at 5:16 PM Jocean shi <[email protected]> 
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Shammon FY,
>> > > > > >
>> > > > > > Thanks for your comments. I’d like to share my thoughts about your
>> > > > > > comments.
>> > > > > >
>> > > > > > 1. Public Interface
>> > > > > > Thank you for the reminder. I overlooked the correspondence between
>> > > > > > the Public Interface of PIP and the "@Public" annotation.
>> > > > > > My idea was that Event, Listener, and ListenerFactory are public,
>> > > > > > while the others are non-public.
>> > > > > >
>> > > > > > 2.  Add `Factory` to create `Listener`
>> > > > > > Great suggestion, I have already added the ListenerFactory to PIP.
>> > > > > >
>> > > > > > 3. Flink and Spark support meta-data listeners
>> > > > > > It will be very inconvenient for users to obtain DDL information
>> > > > > > through engines. Firstly, there are many implementations of various
>> > > > > > engines that need to be connected. Secondly, in addition to Flink 
>> > > > > > and
>> > > > > > Spark, many engines do not support meta-data listeners. As a 
>> > > > > > general
>> > > > > > data lake, Paimon should have its own mechanism for meta-data
>> > > > > > listeners.
>> > > > > >
>> > > > > > 4. report events such as commit/compact to an external system
>> > > > > > CompactEvent: Currently, the compact state is a black box, and 
>> > > > > > users
>> > > > > > cannot obtain the information through SQL or API.
>> > > > > > CommitEvent: Currently, the methods of querying through SQL or API 
>> > > > > > are
>> > > > > > based on polling, which makes it difficult for users to perceive
>> > > > > > commit operations in a timely manner and consumes a lot of 
>> > > > > > resources.
>> > > > > >
>> > > > > > Best
>> > > > > > Shidayang
>> > > > > >
>> > > > > > Shammon FY <[email protected]> 于2023年8月18日周五 14:07写道:
>> > > > > > >
>> > > > > > > Thanks @Jocean for starting this discussion, I have some comments
>> > > > > > >
>> > > > > > > 1. About the public interfaces in the PIP, we should add @Public 
>> > > > > > > for
>> > > > them
>> > > > > > > such as `Event`, `Listener` and even `CommitEvent` and other 
>> > > > > > > events.
>> > > > But
>> > > > > > > for `Listeners`, I don't think it should be a public interface. 
>> > > > > > > All
>> > > > > > fields
>> > > > > > > in the public interface for users should be `Public` too, but I
>> > > > found the
>> > > > > > > information such as `ManifestEntry` in `CommitEvent` is not a 
>> > > > > > > public
>> > > > > > > interface. I think you may need to reconsider which interfaces 
>> > > > > > > need
>> > > > to be
>> > > > > > > marked with @Public and which are not.
>> > > > > > >
>> > > > > > > 2. In general, it is better to give a `Factory` to create 
>> > > > > > > `Listener`
>> > > > > > which
>> > > > > > > should be all marked as `@Public` and you can see
>> > > > > > > `CatalogFactory`->`Catalog` as an example.
>> > > > > > >
>> > > > > > > 3. Currently Flink and Spark support meta-data listeners and we 
>> > > > > > > can
>> > > > > > support
>> > > > > > > reporting ddl information there, should we need to add the same
>> > > > listener
>> > > > > > in
>> > > > > > > Paimon?
>> > > > > > >
>> > > > > > > 4. Should we need to report the events such as commit/compact to 
>> > > > > > > an
>> > > > > > > external system? Currently we have some system tables and users 
>> > > > > > > can
>> > > > get
>> > > > > > > these information by SQL or API, should the external system query
>> > > > these
>> > > > > > > information regularly instead of a listener to push them?
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Shammon FY
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Aug 15, 2023 at 11:08 AM Jocean shi 
>> > > > > > > <[email protected]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi devs:
>> > > > > > > >
>> > > > > > > > We would like to start a discussion about PIP-8: Introduce
>> > > > listeners
>> > > > > > > > for Paimon[1].
>> > > > > > > >
>> > > > > > > > In production environments, users often need to perceive the 
>> > > > > > > > state
>> > > > > > > > changes of Paimon table,
>> > > > > > > > such as whether a new file has been committed to the table, in
>> > > > which
>> > > > > > > > partitions the committed files are,
>> > > > > > > > the size and number of the committed files, the status and 
>> > > > > > > > type of
>> > > > > > > > compaction, operations like table creation, deletion, and 
>> > > > > > > > schema
>> > > > > > > > changes, etc.
>> > > > > > > > So, we introduce a Listener system for Paimon.
>> > > > > > > > Looking forward to hearing from you.
>> > > > > > > >
>> > > > > > > > [1]
>> > > > > > > >
>> > > > > >
>> > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-8%3A+Introduce+listeners+for+Paimon
>> > > > > > > >
>> > > > > > > > Best
>> > > > > > > > shidayang
>> > > > > > > >
>> > > > > >
>> > > >

Reply via email to