Re: [DISCUSS] PIP-10: Introduce Paimon QueryService

Ming Li Wed, 11 Oct 2023 01:07:45 -0700

>
> Yes, I think there must be a primary key, we can compute the bucket
> from the primary key, and find which executor to visit.
> This is the primary key Query Service.
>


hi, Jingsong, thank you for providing this explanation.  It looks good to
me for the first version only supports lookup based on primary keys, and
most of our scenarios also use lookups based on primary keys.

But there are other problems, when the number of executors changes, the
> service needs to be restarted and the data needs to be loaded.
>

hi, jufang, I think this is not a big problem. For the Query Service, the
first version may be embedded in flink job, and high availability depends
on the implementation of Flink.

Best,
Ming Li


jufang he <[email protected]> 于2023年10月10日周二 17:20写道：

> Hi, Ming.
>
> As Xiangyu mentioned, we encountered the same problem when implementing a
> similar solution in ByteDance, maybe I can share some experience.
>
> The small table has more than 100g data, so it needs to be placed on
> separate nodes.  To solve the problem of getting Executor addresses during
> RPC queries, The same hash rules are used in data generation, loading and
> querying. When the data is generated, the data is written to different
> directories according to the hash algorithm. When data is loaded into
> executors, the same hash algorithm is used, and the number of executors is
> set in advance, and the data is loaded into different executors. Since a
> single Executor can still exceed the memory limit, we put the data into a
> local KV store. When dealing with large tables, the Executor number of the
> current key to be queried can be calculated according to the same hash
> algorithm and the number of executors set in advance. Based on the Executor
> number we can get the Executor RPC address from ZK.
>
>
> But there are other problems, when the number of executors changes, the
> service needs to be restarted and the data needs to be loaded.
>
> Best,
>
> Jufang
>
> Jingsong Li <[email protected]> 于2023年10月10日周二 16:40写道：
>
> > Hi Ming.
> >
> > Yes, I think there must be a primary key, we can compute the bucket
> > from the primary key, and find which executor to visit.
> > This is the primary key Query Service.
> >
> > And then, maybe we can introduce more Query Service types, maybe
> > another service can be Secondary indexed Query Service, it can be
> > queried by another field to get primary key, (maybe use RocksDB to
> > maintain the index) and query primary key Query Service to get the
> > whole value.
> >
> > The Secondary indexed Query Service and Primary Key Query Service are
> > independent and unrelated, but then, we can use Snapshot Id to do some
> > consistent alignment work. But this should be more complicated.
> >
> > These things can be imaged, but need lots of work.
> >
> > I just created a POC for first version, it is very rough:
> > https://github.com/apache/incubator-paimon/pull/2110
> >
> > Best,
> > Jingsong
> >
> > On Tue, Oct 10, 2023 at 3:36 PM Ming Li <[email protected]> wrote:
> > >
> > > Thanks for the proposal!
> > > It is a common scenario for multiple applications to share the same
> > > dimension table. As described in the design document, the TableQuery
> > client
> > > will obtain the addresses of all Executors from the AddressServer and
> > then
> > > request them through RPC. I have a question about this: How does the
> > > TableQuery client decide which Executor to request?  Request all
> > Executors
> > > in turn? Or is it restricted that the key of lookup must contain
> > bucket-key?
> > >
> > > Best,
> > > Ming Li
> > >
> > >
> > > Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道：
> > >
> > > > Hi all,
> > > >
> > > > I want to bring up a discussion about Paimon QueryService [1].
> > > >
> > > > Paimon primary key table already provides LSM file structure, it is a
> > > > pity that the paimon can not provide a queryable service for lookup.
> > > >
> > > > A distributed service can download Paimon files locally and provide a
> > > > Lookup service. It does not affect the write process and read
> process,
> > > > it is a separate server. It can be used as:
> > > >
> > > > 1. Flink Lookup Join, reuse by multiple Flink Jobs.
> > > > 2. Online Service Lookup, this requires high stability. (it may not
> be
> > > > so stable in the first version)
> > > >
> > > > See more in PIP [1].
> > > >
> > > > This PIP is a high-level design for Paimon QueryService, not
> including
> > > > all details.
> > > >
> > > > [1]
> > > >
> >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> >
>

Re: [DISCUSS] PIP-10: Introduce Paimon QueryService

Reply via email to