Re: [DISCUSS] PIP-10: Introduce Paimon QueryService

Jingsong Li Sun, 08 Oct 2023 18:42:31 -0700

Hi xiangyu,

Very glad to hear from you.
Very welcome to participate in the development, and I believe we can
have many technologies to share.


Snapshot File Scanner should be something like
`MultiTablesStreamingCompactorSourceFunction`. Using a StreamTableScan
can continuously read files from Table.

Best,
Jingsong

On Sun, Oct 8, 2023 at 9:20 PM xiangyu feng <[email protected]> wrote:
>
> Hi Jingsong,
>
> Thanks for bring up this discussion. This is exactly what we want for Paimon 
> and we have met real user cases internally in ByteDance.
>
> Let me introduce our situation first, we have a search business partner that 
> needs to perform joins between large tables and small tables periodically. 
> The large table size is around 100TB and unshuffled,  the small table size is 
> around 100GB. Our users don't want to shuffle and sort the big table in the 
> first place since it's very resource and time consuming. Meanwhile, the small 
> table is also too large to be broadcasted.
>
> To solve this problem, we have launched a long running Flink job as lookup 
> service. In this job, each subtask will initiate a LevelDB locally within 
> partitioned small table files and register the meta information to ZK for 
> service discovery and provide lookup grpc service. Also a lookup client will 
> be offered for users to call this rpc service.Then we will use a separate 
> map-only job to scan large table and perform the lookup join by client. In 
> this way, our users can finish the join operation in hours.
>
> This architecture is working well for our users in years and recently they 
> are trying to upgrade this architecture to improve the overall performance 
> and usability. Within the QueryService provided by Paimon, I believe we can 
> solve this problem in a more general way.
>
> So overall I'm big +1 for this new feature. Also my colleagues and I are more 
> than willing to participate in the development and help evolving this feature 
> in production.
>
> For the design doc, I'm curious about how will the Snapshot File Scanner be 
> designed and implemented. It will be great if we can get more information 
> about this.
>
> Regards,
> Xiangyu
>
> Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道：
>>
>> Hi all,
>>
>> I want to bring up a discussion about Paimon QueryService [1].
>>
>> Paimon primary key table already provides LSM file structure, it is a
>> pity that the paimon can not provide a queryable service for lookup.
>>
>> A distributed service can download Paimon files locally and provide a
>> Lookup service. It does not affect the write process and read process,
>> it is a separate server. It can be used as:
>>
>> 1. Flink Lookup Join, reuse by multiple Flink Jobs.
>> 2. Online Service Lookup, this requires high stability. (it may not be
>> so stable in the first version)
>>
>> See more in PIP [1].
>>
>> This PIP is a high-level design for Paimon QueryService, not including
>> all details.
>>
>> [1] 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService
>>
>> Best,
>> Jingsong

Re: [DISCUSS] PIP-10: Introduce Paimon QueryService

Reply via email to