I just created https://github.com/apache/incubator-paimon/pull/2101
for file scanning.

Best,
Jingsong

On Mon, Oct 9, 2023 at 9:42 AM Jingsong Li <[email protected]> wrote:
>
> Hi xiangyu,
>
> Very glad to hear from you.
> Very welcome to participate in the development, and I believe we can
> have many technologies to share.
>
> Snapshot File Scanner should be something like
> `MultiTablesStreamingCompactorSourceFunction`. Using a StreamTableScan
> can continuously read files from Table.
>
> Best,
> Jingsong
>
> On Sun, Oct 8, 2023 at 9:20 PM xiangyu feng <[email protected]> wrote:
> >
> > Hi Jingsong,
> >
> > Thanks for bring up this discussion. This is exactly what we want for 
> > Paimon and we have met real user cases internally in ByteDance.
> >
> > Let me introduce our situation first, we have a search business partner 
> > that needs to perform joins between large tables and small tables 
> > periodically. The large table size is around 100TB and unshuffled,  the 
> > small table size is around 100GB. Our users don't want to shuffle and sort 
> > the big table in the first place since it's very resource and time 
> > consuming. Meanwhile, the small table is also too large to be broadcasted.
> >
> > To solve this problem, we have launched a long running Flink job as lookup 
> > service. In this job, each subtask will initiate a LevelDB locally within 
> > partitioned small table files and register the meta information to ZK for 
> > service discovery and provide lookup grpc service. Also a lookup client 
> > will be offered for users to call this rpc service.Then we will use a 
> > separate map-only job to scan large table and perform the lookup join by 
> > client. In this way, our users can finish the join operation in hours.
> >
> > This architecture is working well for our users in years and recently they 
> > are trying to upgrade this architecture to improve the overall performance 
> > and usability. Within the QueryService provided by Paimon, I believe we can 
> > solve this problem in a more general way.
> >
> > So overall I'm big +1 for this new feature. Also my colleagues and I are 
> > more than willing to participate in the development and help evolving this 
> > feature in production.
> >
> > For the design doc, I'm curious about how will the Snapshot File Scanner be 
> > designed and implemented. It will be great if we can get more information 
> > about this.
> >
> > Regards,
> > Xiangyu
> >
> > Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道:
> >>
> >> Hi all,
> >>
> >> I want to bring up a discussion about Paimon QueryService [1].
> >>
> >> Paimon primary key table already provides LSM file structure, it is a
> >> pity that the paimon can not provide a queryable service for lookup.
> >>
> >> A distributed service can download Paimon files locally and provide a
> >> Lookup service. It does not affect the write process and read process,
> >> it is a separate server. It can be used as:
> >>
> >> 1. Flink Lookup Join, reuse by multiple Flink Jobs.
> >> 2. Online Service Lookup, this requires high stability. (it may not be
> >> so stable in the first version)
> >>
> >> See more in PIP [1].
> >>
> >> This PIP is a high-level design for Paimon QueryService, not including
> >> all details.
> >>
> >> [1] 
> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService
> >>
> >> Best,
> >> Jingsong

Reply via email to