Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Jingsong Li
Thanks Jing, Looks good~ Best, Jingsong On Fri, Jan 21, 2022 at 2:00 PM Lincoln Lee wrote: > > Hi, Jing > Glad to hear the agreement on the hint syntax, let's keep going! > > Best, > Lincoln Lee > > > Jing Zhang 于2022年1月20日周四 16:52写道: > > > Hi Jingsong, > > Thanks for the feedback. > > > > >

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Lincoln Lee
Hi, Jing Glad to hear the agreement on the hint syntax, let's keep going! Best, Lincoln Lee Jing Zhang 于2022年1月20日周四 16:52写道: > Hi Jingsong, > Thanks for the feedback. > > > Is there a conclusion about naming here? (Maybe I missed something?) > Use USE_HASH or some other names. Slightly conf

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Jing Zhang
Hi Jingsong, Thanks for the feedback. > Is there a conclusion about naming here? (Maybe I missed something?) Use USE_HASH or some other names. Slightly confusing in the FLIP. 'SHUFFLE_HASH' is final hint name, 'USE_HASH' is rejected. I've updated the FLIP. > And the problem of what to write insi

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Jing Zhang
Hi Lincoln, Thanks for the feedback. > The previous discussion for the hint syntax left a minor difference there: whether to use both sides of join table names or just one 'build' side table name only. I would prefer the later one. Users only need to pass the `build` side table(usually the smalle

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-19 Thread Jingsong Li
Hi Jing, About `SHUFFLE_HASH(left_table, right_table)`, one case can be shared: SELECT * FROM left_t JOIN right_1 ON ... JOIN right_2 ON ... JOIN right_3 ON ... What if we want to use shuffle_hash for all three joints? SELECT /*+ SHUFFLE_HASH('left_t', 'right_1', 'right_2', 'right_3') */

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-19 Thread Jingsong Li
Hi Jing, Sorry for the late reply! Is there a conclusion about naming here? (Maybe I missed something?) Use USE_HASH or some other names. Slightly confusing in the FLIP. And the problem of what to write inside the hint, as mentioned by lincoln. I think maybe we can list the grammars of other di

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-19 Thread Lincoln Lee
Hi, Jing, Sorry for the late reply! The previous discussion for the hint syntax left a minor difference there: whether to use both sides of join table names or just one 'build' side table name only. I would prefer the later one. Users only need to pass the `build` side table(usually the smalle

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-15 Thread Jing Zhang
Hi all, Thanks for all the feedback so far. If there is no more suggestions, I would like to drive a vote in Tuesday next week (18 Jan). Best, Jing Zhang Jing Zhang 于2022年1月5日周三 11:33写道: > Hi Francesco, > Thanks a lot for the feedback. > > > does it makes sense for a lookup join to use hash dis

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-04 Thread Jing Zhang
Hi Francesco, Thanks a lot for the feedback. > does it makes sense for a lookup join to use hash distribution whenever is possible by default? I prefer to enable the hash lookup join only find the hint in the query for the following reason: 1. Plan compatibility Add a hash shuffle by default w

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-03 Thread Francesco Guardiani
Hi Jing, Thanks for the FLIP. I'm not very knowledgeable about the topic, but going through both the FLIP and the discussion here, I wonder, does it makes sense for a lookup join to use hash distribution whenever is possible by default? The point you're explaining here: > Many Lookup table sourc

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-31 Thread Jing Zhang
Hi Lincoln, Thanks for the feedback. > 1. For the hint name, +1 for WenLong's proposal. I've added add 'SHUFFLE_HASH' to other alternatives in FLIP. Let's waiting for more voices here. > Regarding the `SKEW` hint, agree with you that it can be used widely, and I prefer to treat it as a metadata

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-31 Thread Jing Zhang
Hi Martijn, Thanks for the feedback. Glad to hear that we reached a consensus on the first and second point. About whether to use `use_hash` as a term, I think your concern makes sense. Although the hash lookup join is similar to Hash join in oracle that they all require hash distribution on inpu

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Lincoln Lee
Hi Jing, Thanks for your explanation! 1. For the hint name, +1 for WenLong's proposal. I think the `SHUFFLE` keyword is important in a classic distributed computing system, a hash-join usually means there's a shuffle stage(include shuffle hash-join, broadcast hash-join). Users only need to pas

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Martijn Visser
Hi Jing, Thanks for explaining this in more detail and also to others participating. > I think using query hints in this case is more natural for users, WDYT? Yes, I agree. As long as we properly explain in our documentation that we support both Query Hints and Table Hints, what's the difference

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Jing Zhang
Hi Jian gang, Thanks for the feedback. > When it comes to hive, how do you load partial data instead of the whole data? Any change related with hive? The question is same as Yuan mentioned before. I prefer to drive another FLIP on this topic to further discussion individually because this poin

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Jing Zhang
Hi Wenlong, Thanks for the feedback. I've checked similar syntax in other systems, they are all different from each other. It seems to be without consensus. As mentioned in FLIP-204, oracle uses a query hint, the hint name is 'use_hash' [1]. Spark also uses a query hint, its name is 'SHUFFLE_HASH'

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread 刘建刚
Thank you for the proposal, Jing. I like the idea to partition data by some key to improve the cache hit. I have some questions: 1. When it comes to hive, how do you load partial data instead of the whole data? Any change related with hive? 2. How to define the cache configuration? For ex

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread wenlong.lwl
Hi, Jing, thanks for driving the discussion. Have you made some investigation on the syntax of join hint? Why do you choose USE_HASH from oracle instead of the style of spark SHUFFLE_HASH, they are quite different. People in the big data world may be more familiar with spark/hive, if we need to ch

Re:Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread zst...@163.com
Hi Jing, Thanks for your detail reply. 1) In the last suggestion, hash by primary key is not use for raising the cache hit, but handling with skew of left source. Now that you have 'skew' hint and other discussion about it, I'm looking forward to it. 2) I mean to support user defined partitio

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi, Lincoln Thanks a lot for the feedback. > Regarding the hint name ‘USE_HASH’, could we consider more candidates? Things are a little different from RDBMS in the distributed world, and we also aim to solve the data skew problem, so all these incoming hints names should be considered together.

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi Yuan and Lincoln, thanks a lot for the attention. I would answer the email one by one. To Yuan > How shall we deal with CDC data? If there is CDC data in the pipeline, IMHO, shuffle by join key will cause CDC data disorder. Will it be better to use primary key in this case? Good question. The

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Lincoln Lee
Hi Jing, Thanks for bringing up this discussion! Agree that this join hints should benefit both bounded and unbounded cases as Martin mentioned. I also agree that implementing the query hint is the right way for a more general purpose since the dynamic table options has a limited scope. Som

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread zst...@163.com
Hi Jing, Thanks very much for your FLIP. I have some points: - How shall we deal with CDC data? If there is CDC data in the pipeline, IMHO, shuffle by join key will cause CDC data disorder. Will it be better to use primary key in this case? - If the shuffle keys can be customized when user

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi Martijn, Thanks a lot for your attention. I'm sorry I didn't explain the motivation clearly. I would like to explain it in detail, and then give response on your questions. A lookup join is typically used to enrich a table with data that is queried from an external system. Many Lookup table sour

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Martijn Visser
Hi Jing, Thanks a lot for the explanation and the FLIP. I definitely learned something when reading more about `use_hash`. My interpretation would be that the primary benefit of a hash lookup join would be improved performance by allowing the user to explicitly optimise the planner. I have a coup

[DISCUSS] Introduce Hash Lookup Join

2021-12-27 Thread Jing Zhang
Hi everyone, Look up join [1] is commonly used feature in Flink SQL. We have received many optimization requirements on look up join. For example: 1. Enforces left side of lookup join do a hash parti