date:20221017

[jira] [Created] (FLINK-29655) Split Flink CRD from flink-kubernates-operator module

2022-10-17 Thread Zhenqiu Huang (Jira)

Zhenqiu Huang created FLINK-29655:
-

 Summary: Split Flink CRD from flink-kubernates-operator module
 Key: FLINK-29655
 URL: https://issues.apache.org/jira/browse/FLINK-29655
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.2.0
Reporter: Zhenqiu Huang


To launch a flink job that managed by flink operator, it is required to 
introduce Flink CRD in backend service as dependency to create CR. In current 
model, all of the flink libs  will be introduced to service. But actually what 
is required are model classes in the package in 
org.apache.flink.kubernetes.operator.crd.

It will be user friendly to split org.apache.flink.kubernetes.operator.crd to a 
separate module as flink-kubernetes-crd. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29656) Modify all join-related sql tests to make having statistics as the default choice

2022-10-17 Thread Yunhong Zheng (Jira)

Yunhong Zheng created FLINK-29656:
-

 Summary: Modify all join-related sql tests to make having 
statistics as the default choice
 Key: FLINK-29656
 URL: https://issues.apache.org/jira/browse/FLINK-29656
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Affects Versions: 1.17.0
Reporter: Yunhong Zheng
 Fix For: 1.17.0


Modify all join-related sql tests to make having statistics as the default 
choice. This issue is related to FLINK-29559 to make tests stable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29657) Hive SQL

2022-10-17 Thread Runkang He (Jira)

Runkang He created FLINK-29657:
--

 Summary: Hive SQL
 Key: FLINK-29657
 URL: https://issues.apache.org/jira/browse/FLINK-29657
 Project: Flink
  Issue Type: Sub-task
Reporter: Runkang He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [DISCUSS] Changing the minimal supported version of Hadoop to 2.10.2

2022-10-17 Thread Márton Balassi

Hi Martjin,

+1 for 2.10.2. Do you expect to have bandwidth in the near term to
implement the bump?

On Wed, Oct 5, 2022 at 5:00 PM Gabor Somogyi 
wrote:

> Hi Martin,
>
> Thanks for bringing this up! Lately I was thinking about to bump the hadoop
> version to at least 2.6.1 to clean up issues like this:
>
> https://github.com/apache/flink/blob/8d05393f5bcc0a917b2dab3fe81a58acaccabf13/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/util/HadoopUtils.java#L157-L159
>
> All in all +1 from my perspective.
>
> Just a question here. Are we stating the minimum Hadoop version for users
> somewhere in the doc or they need to find it out from source code like
> this?
>
> https://github.com/apache/flink/blob/3a4c11371e6f2aacd641d86c1d5b4fd86435f802/tools/azure-pipelines/build-apache-repo.yml#L113
>
> BR,
> G
>
>
> On Wed, Oct 5, 2022 at 5:02 AM Martijn Visser 
> wrote:
>
> > Hi everyone,
> >
> > Little over a year ago a discussion thread was opened on changing the
> > minimal supported version of Hadoop and bringing that to 2.8.5. [1] In
> this
> > discussion thread, I would like to propose to bring that minimal
> supported
> > version of Hadoop to 2.10.2.
> >
> > Hadoop 2.8.5 is vulnerable for multiple CVEs which are classified as
> > Critical. [2] [3]. While Flink is not directly impacted by those, we do
> see
> > vulnerability scanners flag Flink as being vulnerable. We could easily
> > mitigate that by bumping the minimal supported version of Hadoop to
> 2.10.2.
> >
> > I'm looking forward to your opinions on this topic.
> >
> > Best regards,
> >
> > Martijn
> > https://twitter.com/MartijnVisser82
> > https://github.com/MartijnVisser
> >
> > [1] https://lists.apache.org/thread/81fhnwfxomjhyy59f9bbofk9rxpdxjo5
> > [2] https://nvd.nist.gov/vuln/detail/CVE-2022-25168
> > [3] https://nvd.nist.gov/vuln/detail/CVE-2022-26612
> >
>

Re: [VOTE] Drop Gelly

2022-10-17 Thread Yun Tang

 +1 for dropping Gelly as a prerequisite for dropping deprecated DataSet API.

Apart from asking contributors to join the flink-ML development, I'm not sure 
whether some old repo [1], which is already based on Flink's DataStream API, 
also wants to join the migration guide.

cc @Vasia

[1] https://github.com/vasia/gelly-streaming

Best
Yun Tang


From: Sergey Nuyanzin 
Sent: Monday, October 17, 2022 14:19
To: dev@flink.apache.org ; Yun Gao 
Subject: Re: [VOTE] Drop Gelly

+1 (non-binding)

On Mon, Oct 17, 2022 at 5:50 AM Yun Gao 
wrote:

> Hi Yu,
> Very thanks for the suggestions and I also strongly
> agree with that. Perhaps we could update the Gelly
> documentation pages to also express the information.
> Best,
> Yun Gao
> --
> From:Yu Li 
> Send Time:2022 Oct. 16 (Sun.) 18:30
> To:dev 
> Subject:Re: [VOTE] Drop Gelly
> +1
> Thanks for the clarification Martijn and Yun. Hopefully we can document
> somewhere that there is a plan (maybe long term) to introduce a replacement
> based on the new iteration framework introduced in flink-ml (based on
> DataStream API) after dropping the old Gelly library, and let our users
> know that they could either use the old one, or join the community efforts
> to build up the new one.
> Personally I hope that we are not delivering a message that Flink won't
> support graph processing anymore, unless this is exactly what we mean.
> Best Regards,
> Yu
> On Thu, 13 Oct 2022 at 21:42, Maximilian Michels  wrote:
> > +1
> >
> > On Thu, Oct 13, 2022 at 12:00 PM Konstantin Knauf 
> > wrote:
> >
> > > +1
> > >
> > > Am Do., 13. Okt. 2022 um 10:56 Uhr schrieb Niels Basjes <
> ni...@basjes.nl
> > >:
> > >
> > > > +1
> > > >
> > > > On Wed, Oct 12, 2022 at 11:00 PM Martijn Visser <
> > > martijnvis...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I would like to open a vote for dropping Gelly, which was
> discussed a
> > > > long
> > > > > time ago but never put to a vote [1].
> > > > >
> > > > > Voting will be open for at least 72 hours.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > > https://twitter.com/MartijnVisser82
> > > > > https://github.com/MartijnVisser
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/2m6wtgjvxcogbf9d5q7mqt4ofqjf2ojc
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards / Met vriendelijke groeten,
> > > >
> > > > Niels Basjes
> > > >
> > >
> > >
> > > --
> > > https://twitter.com/snntrable
> > > https://github.com/knaufk
> > >
> >
>


--
Best regards,
Sergey

Re: [VOTE] Reverting sink metric name changes made in 1.15

2022-10-17 Thread Yun Tang

+1 and thanks for Qingsheng's driving.

BTW, shall we consider to create a ticket to ensure backward compatibility of 
the metric names?


Best,
Yun Tang

From: Jing Ge 
Sent: Friday, October 14, 2022 16:55
To: dev@flink.apache.org 
Cc: Chesnay Schepler 
Subject: Re: [VOTE] Reverting sink metric name changes made in 1.15

+1

The voting title might cause confusion. Technically to say, it is a further
modification on top of the current status rather than a reverting after
considering the backward compatibility.

Best regards,
Jing

On Fri, Oct 14, 2022 at 10:41 AM Qingsheng Ren  wrote:

> Thanks for the reply Chesnay!
>
> I made a POC [1] just now, and I created a draft PR [2] so that it's easier
> for everyone to leave comments on it.
>
> [1] https://github.com/PatrickRen/flink/tree/FLINK-29567-POC
> [2] https://github.com/apache/flink/pull/21065
>
> Best,
> Qingsheng
>
>
> On Fri, Oct 14, 2022 at 1:56 AM Chesnay Schepler 
> wrote:
>
> > Do we have a PoC that achieves this without re-introducing the bug where
> > the numRecordsOut was simply wrong because it counted both records
> > written to the external system and the downstream committer?
> > It's gonna be quite the dirty hack I assume.
> >
> > On 13/10/2022 19:24, Qingsheng Ren wrote:
> > > Hi devs,
> > >
> > > I'd like to start a vote about reverting sink metric name changes made
> in
> > > 1.15 considering compatibility issues. These metrics include:
> > >
> > > - numRecordsSend -> numRecordsOut
> > > - numRecordsSendPerSecond -> numRecordsOutPerSecond
> > > - numBytesSend -> numBytesOut
> > > - numBytesSendPerSecond -> numBytesOutPerSecond
> > > - numRecordsSendError -> numRecordsOutError
> > >
> > > which reflect the output of the sink to the external system. "send"
> > metric
> > > series will be kept with the same value as "out" metric series. This
> > change
> > > will be applied to 1.15 and 1.16. More details could be found in the
> > > discussion thread [1].
> > >
> > > The vote will open for at least 72 hours.
> > >
> > > Looking forward to your feedback!
> > >
> > > [1] https://lists.apache.org/thread/vxhty3q97s7pw2zn0jhkyd6sxwwodzbv
> > >
> > > Best,
> > > Qingsheng
> > >
> >
> >
>

[jira] [Created] (FLINK-29658) LocalTimeTypeInfo support in DataStream API in PyFlink

2022-10-17 Thread Juntao Hu (Jira)

Juntao Hu created FLINK-29658:
-

 Summary: LocalTimeTypeInfo support in DataStream API in PyFlink
 Key: FLINK-29658
 URL: https://issues.apache.org/jira/browse/FLINK-29658
 Project: Flink
  Issue Type: New Feature
  Components: API / Python
Affects Versions: 1.15.2
Reporter: Juntao Hu
 Fix For: 1.15.3


LocalTimeTypeInfo is needed when calling `to_data_stream` on tables containing 
DATE/TIME/TIMESTAMP fields in PyFlink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29659) Deduplicate SavepointWriter factory method code

2022-10-17 Thread Chesnay Schepler (Jira)

Chesnay Schepler created FLINK-29659:


 Summary: Deduplicate SavepointWriter factory method code
 Key: FLINK-29659
 URL: https://issues.apache.org/jira/browse/FLINK-29659
 Project: Flink
  Issue Type: Technical Debt
  Components: API / State Processor
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.17.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29660) Show all attempts of subtasks in WebUI

2022-10-17 Thread LI Mingkun (Jira)

LI Mingkun created FLINK-29660:
--

 Summary: Show all attempts of subtasks in WebUI
 Key: FLINK-29660
 URL: https://issues.apache.org/jira/browse/FLINK-29660
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Web Frontend
Reporter: LI Mingkun


Web UI only show subtask metric and TM log now.



For batch jobs, It's very important to track the metric of fail attempt and 
jump into log stream of every attemp.

 

Feature needed: enable expanded rows of all attempts for subtasks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29661) DatabaseCalciteSchema$getTable() cannot get statistics for partition table

2022-10-17 Thread Yunhong Zheng (Jira)

Yunhong Zheng created FLINK-29661:
-

 Summary: DatabaseCalciteSchema$getTable() cannot get statistics 
for partition table
 Key: FLINK-29661
 URL: https://issues.apache.org/jira/browse/FLINK-29661
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Affects Versions: 1.17.0
Reporter: Yunhong Zheng
 Fix For: 1.17.0


DatabaseCalciteSchema$getTable() cannot get statistics for partition table.  
DatabaseCalciteShema$extractTableStats() don't consider the situation that the 
table is partition table, and it's stats need to be collected by 
catalog.getPartitionStatistics() and catalog.getPartitionColumnStatistics()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

dev@flink.apache.org

2022-10-17 Thread Chesnay Schepler


The vote has passed unanimously.

+1 Votes:
- Danny (binding)
- Martijn (binding)
- Ferenc (non-binding)
- Thomas (binding)
- Ryan (non-binding)
- Jing (non-binding)
- Matthias (binding)

I will now document this in the wiki and start working on the release 
scripts.


On 12/10/2022 15:12, Chesnay Schepler wrote:
Since the discussion 
(https://lists.apache.org/thread/mpzzlpob9ymkjfybm96vz2y2m5fjyvfo) has 
stalled a bit but we need a conclusion to move forward I'm opening a 
vote.


Proposal summary:

1) Branch model
1.1) The default branch is called "main" and used for the next major 
iteration.

1.2) Remaining branches are called "vmajor.minor". (e.g., v3.2)
1.3) Branches are not specific to a Flink version. (i.e., no v3.2-1.15)

2) Versioning
2.1) Source releases: major.minor.patch
2.2) Jar artifacts: major.minor.match-flink-major.flink-minor
(This may imply releasing the exact same connector jar multiple times 
under different versions)


3) Flink compatibility
3.1) The Flink versions supported by the project (last 2 major Flink 
versions) must be supported.
3.2) How this is achived is left to the connector, as long as it 
conforms to the rest of the proposal.


4) Support
4.1) The last 2 major connector releases are supported with only the 
latter receiving additional features, with the following exceptions:
4.1.a) If the older major connector version does not support any 
currently supported Flink version, then it is no longer supported.
4.1.b) If the last 2 major versions do not cover all supported Flink 
versions, then the latest connector version that supports the older 
Flink version /additionally /gets patch support.
4.2) For a given major connector version only the latest minor version 
is supported.

(This means if 1.1.x is released there will be no more 1.0.x release)


I'd like to clarify that these won't be set in stone for eternity.
We should re-evaluate how well this model works over time and adjust 
it accordingly, consistently across all connectors.
I do believe that as is this strikes a good balance between 
maintainability for us and clarity to users.



Voting schema:

Consensus, committers have binding votes, open for at least 72 hours.

[jira] [Created] (FLINK-29662) PojoSerializerSnapshot is using incorrect ExecutionConfig when restoring serializer

2022-10-17 Thread Piotr Nowojski (Jira)

Piotr Nowojski created FLINK-29662:
--

 Summary: PojoSerializerSnapshot is using incorrect ExecutionConfig 
when restoring serializer
 Key: FLINK-29662
 URL: https://issues.apache.org/jira/browse/FLINK-29662
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System
Affects Versions: 1.17.0
Reporter: Piotr Nowojski


{{org.apache.flink.api.java.typeutils.runtime.PojoSerializerSnapshot#restoreSerializer}}
 is using freshly created {{new ExecutionConfig()}} execution config for the 
restored serializer, which overrides any configuration choices made by the 
user. Most likely this is a dormant bug, since restored serializer shouldn't be 
used for serializing any new data, and the {{ExecutionConfig}} is only used for 
subclasses serializations that haven't been cached. If this is indeed the case, 
I would propose to change it's value to {{null}} and safeguard accesses to that 
field with an {{IllegalStateException}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29663) Further improvements of adaptive batch scheduler

2022-10-17 Thread Lijie Wang (Jira)

Lijie Wang created FLINK-29663:
--

 Summary: Further improvements of adaptive batch scheduler
 Key: FLINK-29663
 URL: https://issues.apache.org/jira/browse/FLINK-29663
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Lijie Wang
Assignee: Lijie Wang
 Fix For: 1.17.0


In Flink 1.15, we introduced the adaptive batch scheduler to automatically 
decide parallelisms of job vertices for batch jobs.  In this issue, we will  
further optimize it by changing the subpartition range division algorithm: 
change it from dividing according to the number of subpartitions(the number of 
subpartitions within each subpartition range is basically the same) to dividing 
according to the amount of data in subpartition ranges (the amount of data 
within each subpartition range is basically the same).

More details see 
[https://docs.google.com/document/d/1Qyq3qDkBCUNupajVJpFTKp3fHQJtwIu9luM7T52k1Oo]

 
This is the umbrella ticket for the improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29664) Collect subpartition sizes of BLOCKING result partitions

2022-10-17 Thread Lijie Wang (Jira)

Lijie Wang created FLINK-29664:
--

 Summary: Collect subpartition sizes of BLOCKING result partitions
 Key: FLINK-29664
 URL: https://issues.apache.org/jira/browse/FLINK-29664
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Lijie Wang
Assignee: Lijie Wang
 Fix For: 1.17.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29665) Support flexible subpartition range divisions

2022-10-17 Thread Lijie Wang (Jira)

Lijie Wang created FLINK-29665:
--

 Summary: Support flexible subpartition range divisions
 Key: FLINK-29665
 URL: https://issues.apache.org/jira/browse/FLINK-29665
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Lijie Wang
Assignee: Lijie Wang
 Fix For: 1.17.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29666) Let adaptive batch scheduler divide subpartition range according to amount of data

2022-10-17 Thread Lijie Wang (Jira)

Lijie Wang created FLINK-29666:
--

 Summary: Let adaptive batch scheduler divide subpartition range 
according to amount of data
 Key: FLINK-29666
 URL: https://issues.apache.org/jira/browse/FLINK-29666
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Lijie Wang
Assignee: Lijie Wang
 Fix For: 1.17.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29667) Upgrade DynamoDB Connector dependencies from SNAPSHOT to release 1.16

2022-10-17 Thread Daren Wong (Jira)

Daren Wong created FLINK-29667:
--

 Summary: Upgrade DynamoDB Connector dependencies from SNAPSHOT to 
release 1.16
 Key: FLINK-29667
 URL: https://issues.apache.org/jira/browse/FLINK-29667
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / DynamoDB
Affects Versions: dynamodb-1.0.0
Reporter: Daren Wong
 Fix For: dynamodb-1.0.0


DynamoDB connector currently depends on flink-ci-tools and flink-test-utils for 
licenseCheck and PackagingTest respectively. The working version for both are 
only available for SNAPSHOT and not release.

This task is to upgrade the dependencies to release 1.16 version when available



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

dev@flink.apache.org

2022-10-17 Thread Chesnay Schepler


https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development

On 17/10/2022 13:13, Chesnay Schepler wrote:

The vote has passed unanimously.

+1 Votes:
- Danny (binding)
- Martijn (binding)
- Ferenc (non-binding)
- Thomas (binding)
- Ryan (non-binding)
- Jing (non-binding)
- Matthias (binding)

I will now document this in the wiki and start working on the release 
scripts.


On 12/10/2022 15:12, Chesnay Schepler wrote:
Since the discussion 
(https://lists.apache.org/thread/mpzzlpob9ymkjfybm96vz2y2m5fjyvfo) 
has stalled a bit but we need a conclusion to move forward I'm 
opening a vote.


Proposal summary:

1) Branch model
1.1) The default branch is called "main" and used for the next major 
iteration.

1.2) Remaining branches are called "vmajor.minor". (e.g., v3.2)
1.3) Branches are not specific to a Flink version. (i.e., no v3.2-1.15)

2) Versioning
2.1) Source releases: major.minor.patch
2.2) Jar artifacts: major.minor.match-flink-major.flink-minor
(This may imply releasing the exact same connector jar multiple times 
under different versions)


3) Flink compatibility
3.1) The Flink versions supported by the project (last 2 major Flink 
versions) must be supported.
3.2) How this is achived is left to the connector, as long as it 
conforms to the rest of the proposal.


4) Support
4.1) The last 2 major connector releases are supported with only the 
latter receiving additional features, with the following exceptions:
4.1.a) If the older major connector version does not support any 
currently supported Flink version, then it is no longer supported.
4.1.b) If the last 2 major versions do not cover all supported Flink 
versions, then the latest connector version that supports the older 
Flink version /additionally /gets patch support.
4.2) For a given major connector version only the latest minor 
version is supported.

(This means if 1.1.x is released there will be no more 1.0.x release)


I'd like to clarify that these won't be set in stone for eternity.
We should re-evaluate how well this model works over time and adjust 
it accordingly, consistently across all connectors.
I do believe that as is this strikes a good balance between 
maintainability for us and clarity to users.



Voting schema:

Consensus, committers have binding votes, open for at least 72 hours.

[RESULT][VOTE] Drop Gelly

2022-10-17 Thread Martijn Visser

Hi everyone,

Happy to announce that the vote to drop Gelly has passed [1].

+1 Votes:
- Yun Gao (binding)
- Gyula Fóra (binding)
- Niels Basjes (binding)
- Konstantin Knauf (binding)
- Maximilian Michels (binding)
- Yu Li (binding)
- Sergey Nuyanzin (non-binding)
- Yun Tang (binding)

Best regards,

Martijn
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser

[1] https://lists.apache.org/thread/to1nfxbyzfrgn4son6p4mxro7ns7343c

[jira] [Created] (FLINK-29668) Remove Gelly

2022-10-17 Thread Martijn Visser (Jira)

Martijn Visser created FLINK-29668:
--

 Summary: Remove Gelly
 Key: FLINK-29668
 URL: https://issues.apache.org/jira/browse/FLINK-29668
 Project: Flink
  Issue Type: Technical Debt
  Components: Library / Graph Processing (Gelly)
Reporter: Martijn Visser
Assignee: Martijn Visser


- Remove all Gelly code from {{master}}
- Remove Gelly documentation
- Update feature radar to include remark that as a successor of Gelly the 
community wants to deliver a replacement based on the new iteration framework 
from Flink-ML



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[RESULT][VOTE] Remove HCatalog

2022-10-17 Thread Martijn Visser

Hi everyone,

Happy to announce that the vote to remove HCatalog has passed [1]

+1 votes:
- Jingsong Li (binding)
- Hang Ruan (non-binding)
- yuxia (non-binding)
- Gyula Fóra (binding)
- Leonard Xu (binding)
- Qingsheng Ren (binding)
- Sergey Nuyanzin (non-binding)
- Chesnay Schepler (binding)
- Samrat Deb (non-binding)
- Roc Marshal (non-binding)

Best regards,

Martijn
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser

[1] https://lists.apache.org/thread/w3jfgdk6lq846oh356qnwczydm9oszp9

[jira] [Created] (FLINK-29669) Remove HCatalog

2022-10-17 Thread Martijn Visser (Jira)

Martijn Visser created FLINK-29669:
--

 Summary: Remove HCatalog
 Key: FLINK-29669
 URL: https://issues.apache.org/jira/browse/FLINK-29669
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / Common
Reporter: Martijn Visser
Assignee: Martijn Visser


Remove HCatalog from the codebase as voted on in 
https://lists.apache.org/thread/w3jfgdk6lq846oh356qnwczydm9oszp9



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[VOTE] FLIP-265 Deprecate and remove Scala API support

2022-10-17 Thread Martijn Visser

Hi everyone,

I'm hereby opening a vote for FLIP-265 Deprecate and remove Scala API
support. The related discussion can be found here [1].

Voting will be open for at least 72 hours.

Best regards,

Martijn
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser

[1] https://lists.apache.org/thread/d3borhdzj496nnggohq42fyb6zkwob3h

Re: [VOTE] FLIP-265 Deprecate and remove Scala API support

2022-10-17 Thread Márton Balassi

+1 (binding)

On Mon, Oct 17, 2022 at 3:39 PM Martijn Visser 
wrote:

> Hi everyone,
>
> I'm hereby opening a vote for FLIP-265 Deprecate and remove Scala API
> support. The related discussion can be found here [1].
>
> Voting will be open for at least 72 hours.
>
> Best regards,
>
> Martijn
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
> [1] https://lists.apache.org/thread/d3borhdzj496nnggohq42fyb6zkwob3h
>

Re: [DISCUSS] FLIP-263: Improve resolving schema compatibility

2022-10-17 Thread Dawid Wysakowicz

Hi Han,

I think in principle your proposal makes sense and the compatibility
check indeed should be done in the opposite direction.

However, I have two suggestions:

1. Should we introduce the new method to the TypeSerializerSnapshot
instead? E.g.

TypeSerializerSnapshot {

TypeSerializerSchemaCompatibility resolveSchemaCompatibility(
TypeSerializerSnapshot oldSnapshot);

}

I know it has the downside that we'd need to create a snapshot for the
new serializer during a restore, but the there is a lot of tooling and
subclasses in the *Snapshot stack that it won't be possible to migrate
to TypeSerializer stack. I have classes such as
CompositeTypeSerializerSnapshot and alike in mind.

2. I'd actually be fine with breaking some external serializers and not
provide a default implementation of the new method. If we don't do that
we will have two methods with default implementation which call each
other. It makes it also a bit harder to figure out which methods are
mandatory to be implemented going forward.

What do you think?

Best,

Dawid

On 17/10/2022 04:43, Hangxiang Yu wrote:

Hi, Han.

Both the old method and the new method can get previous and new inner
information.

The new serializer will decide it just like the old serializer did before.

The method just specify the schema compatibility result so that other
behaviours is same as before.

On Fri, Oct 14, 2022 at 11:40 AM Han Yin wrote:

Hi Hangxiang,

Thanks for the proposal. It seems more reasonable to let the new
serializer claim the compatibility in the cases you mentioned.

I have but one question here. What happens in the case of
“compatibleAfterMigration” after we completely reverse the direction (in
step 3)? To be specific, migration from an old schema calls for the
previous serializer to read bytes into state objects. How should a new
serializer decide whether the migration is possible?

Best,
Han

On 2022/10/12 12:41:07 Hangxiang Yu wrote:

Dear Flink developers,

I would like to start a discussion thread on FLIP-263[1] proposing to
improve the usability of resolving schema compatibility.

Currently, the place for compatibility checks is
TypeSerializerSnapshot#resolveSchemaCompatibility
which belongs to the old serializer, There are no ways for users to

specify the

compatibility with the old serializer in the new customized serializer.

The FLIP hopes to reverse the direction of resolving schema compatibility
to improve the usability of resolving schema compatibility.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-263%3A+Improve+resolving+schema+compatibility

OpenPGP_0x31D2DD10BFC15A2D.asc
Description: OpenPGP public key

OpenPGP_signature
Description: OpenPGP digital signature

SplitEnumerator for Bigquery Source.

2022-10-17 Thread Lavkesh Lahngir

Hii Everybody,
we are trying to implement a google bigquery source on flink. We were
thinking of taking time partition and column information as config. I was
thinking of how to parallelize the source and how to generate splits. I
read the code of Hive source, where we could generate hadoop file splits
based on partitions. There is no way to access file level information on BQ.
What would be a solution to generate splits for BQ source?

Currently, most of our tables are partitioned daily. Assuming the columns
and time range are taken as config.
Some ideas from me to generate splits:
1. Calculate approximate number of rows and size and divide them equally.
This will require some way to add a marker for division.
2. For each daily partition create one split.
3. We can take the time partition granularity of minute/hour/day as config
and make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. Scan all the data into a distributed file system like hadoop or gcs.
Then just use file splitter.

I am thinking of going with approach number three. Because calculation of
splits is just config based, it doesn't require reading any data to
calculate, for example option four.

Any suggestions are welcome.

Thank you!
~lav

[jira] [Created] (FLINK-29670) Add CRD compatibility check against 1.2.0

2022-10-17 Thread Gyula Fora (Jira)

Gyula Fora created FLINK-29670:
--

 Summary: Add CRD compatibility check against 1.2.0
 Key: FLINK-29670
 URL: https://issues.apache.org/jira/browse/FLINK-29670
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.3.0
Reporter: Gyula Fora
 Fix For: kubernetes-operator-1.3.0


We must extend the compatibility check against 1.2.0 too in the 
flink-kubernetes-operator module



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29671) Kubernetes e2e test fails during test initialization

2022-10-17 Thread Matthias Pohl (Jira)

Matthias Pohl created FLINK-29671:
-

 Summary: Kubernetes e2e test fails during test initialization
 Key: FLINK-29671
 URL: https://issues.apache.org/jira/browse/FLINK-29671
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes, Test Infrastructure
Affects Versions: 1.15.2, 1.16.0
Reporter: Matthias Pohl
 Attachments: kubernetes_test_failure.log

There are two build failures ([branch based on 
release-1.16|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=42038&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=5377]
 and [a release-1.15 
build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=42073&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=4684])
 that are (not exclusively) caused by the e2e test {{Kubernetes test}} failing 
in the initialization phase (i.e. when installing all artifacts):

See the attached logs from the {{release-1.15}} build's CI output. No logs are 
present as artifacts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [Discuss]- Donate Iceberg Flink Connector

2022-10-17 Thread Péter Váry

Hi all,

I think the main advantage of having the Flink-Iceberg connector in its own
repo is that we could release it independently of both the Iceberg and
Flink releases.
Previously having the Flink connector in the Iceberg repo made sense, since
the Iceberg project was rapidly moving forward which often required changes
in the Flink/Spark connectors.
Since Iceberg 1.0 is on the way, I think we can expect a more stable
Iceberg now, and we can start building on the released versions.

I can volunteer as one of the initial maintainers of the Flink/Iceberg
connector (There are still plenty I need to learn about Flink, but have
valuable experience in Iceberg)

Thanks,
Peter

Jark Wu  ezt írta (időpont: 2022. okt. 14., P, 5:37):

> Thank Abid for the discussion,
>
> I'm also fine with maintaining it under the Flink project.
> But I'm also interested in the response to Martijn's question.
>
> Besides, once the code is moved to the Flink project, are there any initial
> maintainers for the connector we can find?
> In addition, do we still maintain documentation under Iceberg
> https://iceberg.apache.org/docs/latest/flink/ ?
>
> Best,
> Jark
>
>
> On Thu, 13 Oct 2022 at 17:52, yuxia  wrote:
>
> > +1. Thanks for driving it. Hope I can find some chances to take part in
> > the future development of Iceberg Flink Connector.
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "Zheng Yu Chen" 
> > 收件人: "dev" 
> > 发送时间: 星期四, 2022年 10 月 13日 上午 11:26:29
> > 主题: Re: [Discuss]- Donate Iceberg Flink Connector
> >
> > +1, thanks to drive it
> >
> > Abid Mohammed  于2022年10月10日周一 09:22写道：
> >
> > > Hi,
> > >
> > > I would like to start a discussion about contributing Iceberg Flink
> > > Connector to Flink.
> > >
> > > I created a doc <
> > >
> >
> https://docs.google.com/document/d/1WC8xkPiVdwtsKL2VSPAUgzm9EjrPs8ZRjEtcwv93ISI/edit?usp=sharing
> > >
> > > with all the details following the Flink Connector template as I don’t
> > have
> > > permissions to create a FLIP yet.
> > > High level details are captured below:
> > >
> > > Motivation:
> > >
> > > This FLIP aims to contribute the existing Apache Iceberg Flink
> Connector
> > > to Flink.
> > >
> > > Apache Iceberg is an open table format for huge analytic datasets.
> > Iceberg
> > > adds tables to compute engines including Spark, Trino, PrestoDB, Flink,
> > > Hive and Impala using a high-performance table format that works just
> > like
> > > a SQL table.
> > > Iceberg avoids unpleasant surprises. Schema evolution works and won’t
> > > inadvertently un-delete data. Users don’t need to know about
> partitioning
> > > to get fast queries. Iceberg was designed to solve correctness problems
> > in
> > > eventually-consistent cloud object stores.
> > >
> > > Iceberg supports both Flink’s DataStream API and Table API. Based on
> the
> > > guideline of the Flink community, only the latest 2 minor versions are
> > > actively maintained. See the Multi-Engine Support#apache-flink for
> > further
> > > details.
> > >
> > >
> > > Iceberg connector supports:
> > >
> > > • Source: detailed Source design <
> > >
> >
> https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit#
> > >,
> > > based on FLIP-27
> > > • Sink: detailed Sink design and interfaces used <
> > >
> >
> https://docs.google.com/document/d/1O-dPaFct59wUWQECXEEYIkl9_MOoG3zTbC2V-fZRwrg/edit#
> > > >
> > > • Usable in both DataStream and Table API/SQL
> > > • DataStream read/append/overwrite
> > > • SQL create/alter/drop table, select, insert into, insert
> > > overwrite
> > > • Streaming or batch read in Java API
> > > • Support for Flink’s Python API
> > >
> > > See Iceberg Flink  <
> https://iceberg.apache.org/docs/latest/flink/#flink
> > >for
> > > detailed usage instructions.
> > >
> > > Looking forward to the discussion!
> > >
> > > Thanks
> > > Abid
> >
>

RE: Re: [Discuss]- Donate Iceberg Flink Connector

2022-10-17 Thread abmo . work

Hi Martijn,

Yes, It is considered a connector in Flink terms. 

We wanted to join the Flink connector externalization effort so that we can 
bring the Iceberg connector closer to the Flink community. We are hoping any 
issues with the APIs for Iceberg connector will surface sooner and get more 
attention from the Flink community when the connector is within Flink umbrella 
rather than in Iceberg repo. Also to get better feedback from Flink experts 
when it comes to things related to adding things in a connector vs Flink 
itself. 

Thanks everyone for all your responses! Looking forward to the next steps. 

Thanks
Abid

On 2022/10/14 03:37:09 Jark Wu wrote:
> Thank Abid for the discussion,
> 
> I'm also fine with maintaining it under the Flink project.
> But I'm also interested in the response to Martijn's question.
> 
> Besides, once the code is moved to the Flink project, are there any initial
> maintainers for the connector we can find?
> In addition, do we still maintain documentation under Iceberg
> https://iceberg.apache.org/docs/latest/flink/ ?
> 
> Best,
> Jark
> 
> 
> On Thu, 13 Oct 2022 at 17:52, yuxia  wrote:
> 
> > +1. Thanks for driving it. Hope I can find some chances to take part in
> > the future development of Iceberg Flink Connector.
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "Zheng Yu Chen" 
> > 收件人: "dev" 
> > 发送时间: 星期四, 2022年 10 月 13日 上午 11:26:29
> > 主题: Re: [Discuss]- Donate Iceberg Flink Connector
> >
> > +1, thanks to drive it
> >
> > Abid Mohammed  于2022年10月10日周一 09:22写道：
> >
> > > Hi,
> > >
> > > I would like to start a discussion about contributing Iceberg Flink
> > > Connector to Flink.
> > >
> > > I created a doc <
> > >
> > https://docs.google.com/document/d/1WC8xkPiVdwtsKL2VSPAUgzm9EjrPs8ZRjEtcwv93ISI/edit?usp=sharing
> > >
> > > with all the details following the Flink Connector template as I don’t
> > have
> > > permissions to create a FLIP yet.
> > > High level details are captured below:
> > >
> > > Motivation:
> > >
> > > This FLIP aims to contribute the existing Apache Iceberg Flink Connector
> > > to Flink.
> > >
> > > Apache Iceberg is an open table format for huge analytic datasets.
> > Iceberg
> > > adds tables to compute engines including Spark, Trino, PrestoDB, Flink,
> > > Hive and Impala using a high-performance table format that works just
> > like
> > > a SQL table.
> > > Iceberg avoids unpleasant surprises. Schema evolution works and won’t
> > > inadvertently un-delete data. Users don’t need to know about partitioning
> > > to get fast queries. Iceberg was designed to solve correctness problems
> > in
> > > eventually-consistent cloud object stores.
> > >
> > > Iceberg supports both Flink’s DataStream API and Table API. Based on the
> > > guideline of the Flink community, only the latest 2 minor versions are
> > > actively maintained. See the Multi-Engine Support#apache-flink for
> > further
> > > details.
> > >
> > >
> > > Iceberg connector supports:
> > >
> > > • Source: detailed Source design <
> > >
> > https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit#
> > >,
> > > based on FLIP-27
> > > • Sink: detailed Sink design and interfaces used <
> > >
> > https://docs.google.com/document/d/1O-dPaFct59wUWQECXEEYIkl9_MOoG3zTbC2V-fZRwrg/edit#
> > > >
> > > • Usable in both DataStream and Table API/SQL
> > > • DataStream read/append/overwrite
> > > • SQL create/alter/drop table, select, insert into, insert
> > > overwrite
> > > • Streaming or batch read in Java API
> > > • Support for Flink’s Python API
> > >
> > > See Iceberg Flink   > >for
> > > detailed usage instructions.
> > >
> > > Looking forward to the discussion!
> > >
> > > Thanks
> > > Abid
> >
>

Re: Re: [Discuss]- Donate Iceberg Flink Connector

2022-10-17 Thread Steven Wu

I was one of the maintainers for the Flink Iceberg connector in Iceberg
repo. I can volunteer as one of the initial maintainers if we decide to
move forward.

On Mon, Oct 17, 2022 at 3:26 PM  wrote:

> Hi Martijn,
>
> Yes, It is considered a connector in Flink terms.
>
> We wanted to join the Flink connector externalization effort so that we
> can bring the Iceberg connector closer to the Flink community. We are
> hoping any issues with the APIs for Iceberg connector will surface sooner
> and get more attention from the Flink community when the connector is
> within Flink umbrella rather than in Iceberg repo. Also to get better
> feedback from Flink experts when it comes to things related to adding
> things in a connector vs Flink itself.
>
> Thanks everyone for all your responses! Looking forward to the next steps.
>
> Thanks
> Abid
>
> On 2022/10/14 03:37:09 Jark Wu wrote:
> > Thank Abid for the discussion,
> >
> > I'm also fine with maintaining it under the Flink project.
> > But I'm also interested in the response to Martijn's question.
> >
> > Besides, once the code is moved to the Flink project, are there any
> initial
> > maintainers for the connector we can find?
> > In addition, do we still maintain documentation under Iceberg
> > https://iceberg.apache.org/docs/latest/flink/ ?
> >
> > Best,
> > Jark
> >
> >
> > On Thu, 13 Oct 2022 at 17:52, yuxia  wrote:
> >
> > > +1. Thanks for driving it. Hope I can find some chances to take part in
> > > the future development of Iceberg Flink Connector.
> > >
> > > Best regards,
> > > Yuxia
> > >
> > > - 原始邮件 -
> > > 发件人: "Zheng Yu Chen" 
> > > 收件人: "dev" 
> > > 发送时间: 星期四, 2022年 10 月 13日 上午 11:26:29
> > > 主题: Re: [Discuss]- Donate Iceberg Flink Connector
> > >
> > > +1, thanks to drive it
> > >
> > > Abid Mohammed  于2022年10月10日周一 09:22写道：
> > >
> > > > Hi,
> > > >
> > > > I would like to start a discussion about contributing Iceberg Flink
> > > > Connector to Flink.
> > > >
> > > > I created a doc <
> > > >
> > >
> https://docs.google.com/document/d/1WC8xkPiVdwtsKL2VSPAUgzm9EjrPs8ZRjEtcwv93ISI/edit?usp=sharing
> > > >
> > > > with all the details following the Flink Connector template as I
> don’t
> > > have
> > > > permissions to create a FLIP yet.
> > > > High level details are captured below:
> > > >
> > > > Motivation:
> > > >
> > > > This FLIP aims to contribute the existing Apache Iceberg Flink
> Connector
> > > > to Flink.
> > > >
> > > > Apache Iceberg is an open table format for huge analytic datasets.
> > > Iceberg
> > > > adds tables to compute engines including Spark, Trino, PrestoDB,
> Flink,
> > > > Hive and Impala using a high-performance table format that works just
> > > like
> > > > a SQL table.
> > > > Iceberg avoids unpleasant surprises. Schema evolution works and won’t
> > > > inadvertently un-delete data. Users don’t need to know about
> partitioning
> > > > to get fast queries. Iceberg was designed to solve correctness
> problems
> > > in
> > > > eventually-consistent cloud object stores.
> > > >
> > > > Iceberg supports both Flink’s DataStream API and Table API. Based on
> the
> > > > guideline of the Flink community, only the latest 2 minor versions
> are
> > > > actively maintained. See the Multi-Engine Support#apache-flink for
> > > further
> > > > details.
> > > >
> > > >
> > > > Iceberg connector supports:
> > > >
> > > > • Source: detailed Source design <
> > > >
> > >
> https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit#
> > > >,
> > > > based on FLIP-27
> > > > • Sink: detailed Sink design and interfaces used <
> > > >
> > >
> https://docs.google.com/document/d/1O-dPaFct59wUWQECXEEYIkl9_MOoG3zTbC2V-fZRwrg/edit#
> > > > >
> > > > • Usable in both DataStream and Table API/SQL
> > > > • DataStream read/append/overwrite
> > > > • SQL create/alter/drop table, select, insert into, insert
> > > > overwrite
> > > > • Streaming or batch read in Java API
> > > > • Support for Flink’s Python API
> > > >
> > > > See Iceberg Flink  <
> https://iceberg.apache.org/docs/latest/flink/#flink
> > > >for
> > > > detailed usage instructions.
> > > >
> > > > Looking forward to the discussion!
> > > >
> > > > Thanks
> > > > Abid
> > >
> >

[jira] [Created] (FLINK-29672) Support oracle catalog

2022-10-17 Thread waywtdcc (Jira)

waywtdcc created FLINK-29672:


 Summary: Support oracle catalog 
 Key: FLINK-29672
 URL: https://issues.apache.org/jira/browse/FLINK-29672
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Affects Versions: 1.16.0
Reporter: waywtdcc
 Fix For: 1.17.0


Support oracle catalog 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29673) Support sqlserver catalog

2022-10-17 Thread waywtdcc (Jira)

waywtdcc created FLINK-29673:


 Summary: Support sqlserver catalog
 Key: FLINK-29673
 URL: https://issues.apache.org/jira/browse/FLINK-29673
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Affects Versions: 1.16.0
Reporter: waywtdcc
 Fix For: 1.17.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: SplitEnumerator for Bigquery Source.

2022-10-17 Thread yuxia

I'm familiar with Hive source but have no much knowledge about Bigquery. 
But from my side, the apprach number three sounds more reasonable.

option1 sounds a llitte of complex and may time-counsuming during generateing 
splits .
option2 seems isnot flexible and is too coarse-grained.
option4 need extrac efforts to wrting the data again.

Best regards,
Yuxia

- 原始邮件 -
发件人: "Lavkesh Lahngir" 
收件人: "dev" 
发送时间: 星期一, 2022年 10 月 17日 下午 10:42:29
主题: SplitEnumerator for Bigquery Source.

Hii Everybody,
we are trying to implement a google bigquery source on flink. We were
thinking of taking time partition and column information as config. I was
thinking of how to parallelize the source and how to generate splits. I
read the code of Hive source, where we could generate hadoop file splits
based on partitions. There is no way to access file level information on BQ.
What would be a solution to generate splits for BQ source?

Currently, most of our tables are partitioned daily. Assuming the columns
and time range are taken as config.
Some ideas from me to generate splits:
1. Calculate approximate number of rows and size and divide them equally.
This will require some way to add a marker for division.
2. For each daily partition create one split.
3. We can take the time partition granularity of minute/hour/day as config
and make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. Scan all the data into a distributed file system like hadoop or gcs.
Then just use file splitter.

I am thinking of going with approach number three. Because calculation of
splits is just config based, it doesn't require reading any data to
calculate, for example option four.

Any suggestions are welcome.

Thank you!
~lav

[jira] [Created] (FLINK-29674) Apache Kafka 连接器的“ setBounded”不生效

2022-10-17 Thread jiangwei (Jira)

jiangwei created FLINK-29674:


 Summary: Apache Kafka 连接器的“ setBounded”不生效
 Key: FLINK-29674
 URL: https://issues.apache.org/jira/browse/FLINK-29674
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Reporter: jiangwei


我在使用官方提供的Apache Kafka 
连接器时，我看到官方有提供设置kafka的消费边界的方法（setBounded），当我的job正常执行时（无故障），我设置的边界是有效的，我的job在消费到边界数据时能正常完成，但是当我的job出现故障时，我使用故障时的checkpoint进行重新恢复job时，发现我的job无法正常完成，一直处于running，但是我看日志能看到数据是已经消费到我设置的边界点。我不知道是不是我的用法有问题，以下是我的部分代码：

 
{code:java}
//代码占位符
String topicName = "jw-test-kafka-w-offset-002";Map 
offsets = new HashMap();offsets.put(new 
TopicPartition(topicName,0), 6L);
KafkaSource source = KafkaSource.builder()
.setBootstrapServers("xxx:9092").setProperties(properties)
.setTopics(topicName).setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.setBounded(OffsetsInitializer.offsets(offsets)).build(); {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [VOTE] Drop Gelly

2022-10-17 Thread Dong Lin

+1

On Thu, Oct 13, 2022 at 4:59 AM Martijn Visser 
wrote:

> Hi everyone,
>
> I would like to open a vote for dropping Gelly, which was discussed a long
> time ago but never put to a vote [1].
>
> Voting will be open for at least 72 hours.
>
> Best regards,
>
> Martijn
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
> [1] https://lists.apache.org/thread/2m6wtgjvxcogbf9d5q7mqt4ofqjf2ojc
>

Bigquery source/connector on Flink

2022-10-17 Thread Lavkesh Lahngir

Hi,
We are trying to implement Bigquery source on Flink. I see that there is an
existing JIRA  but there
is no progress on it. I see there is a PoC by Mat. We are also thinking of
using the DynamicTable interface to implement. We can use this mailing
thread to discuss ideas.

I had a few questions:
1. For ScanRuntimeProvider, should I implement Inputformat like in the PoC
or the current recommendation would be implement the Source interface
itself? For example in kafka source, or file source.
2. Bigquery has similar functionalities like partitioning based on
timestamp as Hive, but file system level information is not available. We
only have Bigquery clients to read data. HiveSource implements
AbsractFileSource, Would it make sense to implement FileSource or create a
new source and write it from scratch, because there is no filesystem
information available on the BQ side?
3. How to create splits on the source? I already asked about splits on the
source in another email. Any other suggestions are welcome. I guess it
might be a little different for bounded and unbounded.
For a bounded source:
We can take the time partition granularity of minute/hour/day as config and
make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. What properties should the source take? Currently I am thinking, columns
and a time range. Maybe we can implement predicate pushdown too, atleast
for filter and projection.
4. Should we work towards merging the bigquery connector in the main repo?
What challenges do we see?

We will try to add more questions on this thread. Feel free to reply to any
of these. We can shift the conversation to jira too, if that will help :)

Thanks
~Lav

37 matches

Mail list logo