Hi Jingsong,

Thanks for bring up this discussion. This is exactly what we want for
Paimon and we have met real user cases internally in ByteDance.

Let me introduce our situation first, we have a search business partner
that needs to perform joins between large tables and small tables
periodically. The large table size is around 100TB and unshuffled,  the
small table size is around 100GB. Our users don't want to shuffle and sort
the big table in the first place since it's very resource and
time consuming. Meanwhile, the small table is also too large to be
broadcasted.

To solve this problem, we have launched a long running Flink job as lookup
service. In this job, each subtask will initiate a LevelDB locally within
partitioned small table files and register the meta information to ZK for
service discovery and provide lookup grpc service. Also a lookup client
will be offered for users to call this rpc service.Then we will use a
separate map-only job to scan large table and perform the lookup join by
client. In this way, our users can finish the join operation in hours.

This architecture is working well for our users in years and recently they
are trying to upgrade this architecture to improve the overall performance
and usability. Within the QueryService provided by Paimon, I believe we can
solve this problem in a more general way.

So overall I'm big +1 for this new feature. Also my colleagues and I are
more than willing to participate in the development and help evolving this
feature in production.

For the design doc, I'm curious about how will the Snapshot File Scanner be
designed and implemented. It will be great if we can get more information
about this.

Regards,
Xiangyu

Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道:

> Hi all,
>
> I want to bring up a discussion about Paimon QueryService [1].
>
> Paimon primary key table already provides LSM file structure, it is a
> pity that the paimon can not provide a queryable service for lookup.
>
> A distributed service can download Paimon files locally and provide a
> Lookup service. It does not affect the write process and read process,
> it is a separate server. It can be used as:
>
> 1. Flink Lookup Join, reuse by multiple Flink Jobs.
> 2. Online Service Lookup, this requires high stability. (it may not be
> so stable in the first version)
>
> See more in PIP [1].
>
> This PIP is a high-level design for Paimon QueryService, not including
> all details.
>
> [1]
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService
>
> Best,
> Jingsong
>

Reply via email to