Re: Flink SQL and data shuffling (keyBy)
Happy to help, Let us know if it helped in your use case. On Tue, Apr 5, 2022 at 1:34 AM Yaroslav Tkachenko wrote: > Hi Marios, > > Thank you, this looks very promising! > > On Mon, Apr 4, 2022 at 2:42 AM Marios Trivyzas wrote: > >> Hi again, >> >> Maybe you can use the >> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-sink-keyed-shuffle >> *table.exec.sink.keyed-shuffle* and set it to *FORCE, *which will use >> the primary key column(s) to partition and distribute the data. >> >> On Fri, Apr 1, 2022 at 6:52 PM Marios Trivyzas wrote: >> >>> Hi! >>> >>> I don't think there is a way to achieve that without resorting to >>> DataStream API. >>> I don't know if using the PARTITIONED BY clause in the create statement >>> of the table can help to "balance" the data, see >>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#partitioned-by >>> . >>> >>> >>> On Thu, Mar 31, 2022 at 7:18 AM Yaroslav Tkachenko >>> wrote: >>> Hey everyone, I'm trying to use Flink SQL to construct a set of transformations for my application. Let's say the topology just has three steps: - SQL Source - SQL SELECT statement - SQL Sink (via INSERT) The sink I'm using (JDBC) would really benefit from data partitioning (by PK ID) to avoid conflicting transactions and deadlocks. I can force Flink to partition the data by the PK ID before the INSERT by resorting to DataStream API and leveraging the keyBy method, then transforming DataStream back to the Table again... Is there a simpler way to do this? I understand that, for example, a GROUP BY statement will probably perform similar data shuffling, but what if I have a simple SELECT followed by INSERT? Thank you! >>> >>> >>> -- >>> Marios >>> >> >> >> Best, >> Marios >> > -- Marios
Re: Flink SQL and data shuffling (keyBy)
Hi Marios, Thank you, this looks very promising! On Mon, Apr 4, 2022 at 2:42 AM Marios Trivyzas wrote: > Hi again, > > Maybe you can use the > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-sink-keyed-shuffle > *table.exec.sink.keyed-shuffle* and set it to *FORCE, *which will use the > primary key column(s) to partition and distribute the data. > > On Fri, Apr 1, 2022 at 6:52 PM Marios Trivyzas wrote: > >> Hi! >> >> I don't think there is a way to achieve that without resorting to >> DataStream API. >> I don't know if using the PARTITIONED BY clause in the create statement >> of the table can help to "balance" the data, see >> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#partitioned-by >> . >> >> >> On Thu, Mar 31, 2022 at 7:18 AM Yaroslav Tkachenko >> wrote: >> >>> Hey everyone, >>> >>> I'm trying to use Flink SQL to construct a set of transformations for my >>> application. Let's say the topology just has three steps: >>> >>> - SQL Source >>> - SQL SELECT statement >>> - SQL Sink (via INSERT) >>> >>> The sink I'm using (JDBC) would really benefit from data partitioning >>> (by PK ID) to avoid conflicting transactions and deadlocks. I can force >>> Flink to partition the data by the PK ID before the INSERT by resorting to >>> DataStream API and leveraging the keyBy method, then transforming >>> DataStream back to the Table again... >>> >>> Is there a simpler way to do this? I understand that, for example, a >>> GROUP BY statement will probably perform similar data shuffling, but what >>> if I have a simple SELECT followed by INSERT? >>> >>> Thank you! >>> >> >> >> -- >> Marios >> > > > Best, > Marios >
Re: Flink SQL and data shuffling (keyBy)
Hi again, Maybe you can use the https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-sink-keyed-shuffle *table.exec.sink.keyed-shuffle* and set it to *FORCE, *which will use the primary key column(s) to partition and distribute the data. On Fri, Apr 1, 2022 at 6:52 PM Marios Trivyzas wrote: > Hi! > > I don't think there is a way to achieve that without resorting to > DataStream API. > I don't know if using the PARTITIONED BY clause in the create statement of > the table can help to "balance" the data, see > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#partitioned-by > . > > > On Thu, Mar 31, 2022 at 7:18 AM Yaroslav Tkachenko > wrote: > >> Hey everyone, >> >> I'm trying to use Flink SQL to construct a set of transformations for my >> application. Let's say the topology just has three steps: >> >> - SQL Source >> - SQL SELECT statement >> - SQL Sink (via INSERT) >> >> The sink I'm using (JDBC) would really benefit from data partitioning (by >> PK ID) to avoid conflicting transactions and deadlocks. I can force Flink >> to partition the data by the PK ID before the INSERT by resorting to >> DataStream API and leveraging the keyBy method, then transforming >> DataStream back to the Table again... >> >> Is there a simpler way to do this? I understand that, for example, a >> GROUP BY statement will probably perform similar data shuffling, but what >> if I have a simple SELECT followed by INSERT? >> >> Thank you! >> > > > -- > Marios > Best, Marios
Re: Flink SQL and data shuffling (keyBy)
Hi! I don't think there is a way to achieve that without resorting to DataStream API. I don't know if using the PARTITIONED BY clause in the create statement of the table can help to "balance" the data, see https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#partitioned-by . On Thu, Mar 31, 2022 at 7:18 AM Yaroslav Tkachenko wrote: > Hey everyone, > > I'm trying to use Flink SQL to construct a set of transformations for my > application. Let's say the topology just has three steps: > > - SQL Source > - SQL SELECT statement > - SQL Sink (via INSERT) > > The sink I'm using (JDBC) would really benefit from data partitioning (by > PK ID) to avoid conflicting transactions and deadlocks. I can force Flink > to partition the data by the PK ID before the INSERT by resorting to > DataStream API and leveraging the keyBy method, then transforming > DataStream back to the Table again... > > Is there a simpler way to do this? I understand that, for example, a GROUP > BY statement will probably perform similar data shuffling, but what if I > have a simple SELECT followed by INSERT? > > Thank you! > -- Marios
Flink SQL and data shuffling (keyBy)
Hey everyone, I'm trying to use Flink SQL to construct a set of transformations for my application. Let's say the topology just has three steps: - SQL Source - SQL SELECT statement - SQL Sink (via INSERT) The sink I'm using (JDBC) would really benefit from data partitioning (by PK ID) to avoid conflicting transactions and deadlocks. I can force Flink to partition the data by the PK ID before the INSERT by resorting to DataStream API and leveraging the keyBy method, then transforming DataStream back to the Table again... Is there a simpler way to do this? I understand that, for example, a GROUP BY statement will probably perform similar data shuffling, but what if I have a simple SELECT followed by INSERT? Thank you!