Hi xiangyu, Very glad to hear from you. Very welcome to participate in the development, and I believe we can have many technologies to share.
Snapshot File Scanner should be something like `MultiTablesStreamingCompactorSourceFunction`. Using a StreamTableScan can continuously read files from Table. Best, Jingsong On Sun, Oct 8, 2023 at 9:20 PM xiangyu feng <[email protected]> wrote: > > Hi Jingsong, > > Thanks for bring up this discussion. This is exactly what we want for Paimon > and we have met real user cases internally in ByteDance. > > Let me introduce our situation first, we have a search business partner that > needs to perform joins between large tables and small tables periodically. > The large table size is around 100TB and unshuffled, the small table size is > around 100GB. Our users don't want to shuffle and sort the big table in the > first place since it's very resource and time consuming. Meanwhile, the small > table is also too large to be broadcasted. > > To solve this problem, we have launched a long running Flink job as lookup > service. In this job, each subtask will initiate a LevelDB locally within > partitioned small table files and register the meta information to ZK for > service discovery and provide lookup grpc service. Also a lookup client will > be offered for users to call this rpc service.Then we will use a separate > map-only job to scan large table and perform the lookup join by client. In > this way, our users can finish the join operation in hours. > > This architecture is working well for our users in years and recently they > are trying to upgrade this architecture to improve the overall performance > and usability. Within the QueryService provided by Paimon, I believe we can > solve this problem in a more general way. > > So overall I'm big +1 for this new feature. Also my colleagues and I are more > than willing to participate in the development and help evolving this feature > in production. > > For the design doc, I'm curious about how will the Snapshot File Scanner be > designed and implemented. It will be great if we can get more information > about this. > > Regards, > Xiangyu > > Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道: >> >> Hi all, >> >> I want to bring up a discussion about Paimon QueryService [1]. >> >> Paimon primary key table already provides LSM file structure, it is a >> pity that the paimon can not provide a queryable service for lookup. >> >> A distributed service can download Paimon files locally and provide a >> Lookup service. It does not affect the write process and read process, >> it is a separate server. It can be used as: >> >> 1. Flink Lookup Join, reuse by multiple Flink Jobs. >> 2. Online Service Lookup, this requires high stability. (it may not be >> so stable in the first version) >> >> See more in PIP [1]. >> >> This PIP is a high-level design for Paimon QueryService, not including >> all details. >> >> [1] >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService >> >> Best, >> Jingsong
