from:"Flavio Pompermaier"

Re: [DISCUSS] Deprecate SourceFunction APIs

2023-07-04 Thread Flavio Pompermaier

Hi all,
I've tried to migrate my very simple Elasticsearch SourceFunction  (that
use scroll API and produce batch of documents) to new Source API, but I
gave up because it's too complicated. It should much simpler to migrate
that function to a bounded or unbounded source.
Before removing completely SourceFunction and Dataset I think it would be
better to provide a more detailed migration guide.
At least simplify the creation of a bounded Dataset...I still didn't give a
look at DataGeneratorSource though.
A review of the current online documentation is mandatory IMO.

Best,
Flavio


On Mon, Jul 3, 2023 at 5:58 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> I am happy to announce that the blocker has been resolved and
> SourceFunction
> is now marked as @Deprecated [1].
>
> The work continues to remove the dependencies on the SourceFunction API in
> Flink internals in order to prepare for dropping it completely in Flink
> 2.0.
>
> I'd like to get some opinions on an open question I currently have:
> StreamExecutionEnvironment#fromCollection() methods need to be modified to
> use
> the new FLIP-27 DataGeneratorSource [2]. This presents an issue because
> ITCases in DataGeneratorSource rely on StreamExecutionEnvironment, so we
> end
> up with a circular dependency.
>
> I see two main options here:
>
> 1. Split the tests from the DataGeneratorSource into a separate module
> called
>flink-connector-datagen-tests
>This is a rather straightforward solution that breaks the cycle, but so
> far
>we managed to avoid such workarounds and I'd like to know if anyone has
> a
>strong opinion against it
>
> 2. Move #fromCollection() methods into flink-connector-datagen, so
>StreamExecutionEnvironment#fromCollection() becomes
>DataGeneratorSource#fromCollection()
>While this deviates from the familiar pattern, it should be acceptable
> given
>the major version change.The key question here is whether we should also
>introduce a dependency from flink-connector-datagen to
> flink-streaming-java.
>This dependency does not exist in other connectors, but it would enhance
>usability. Without it, the user code would look somewhat like
>this:
>
>Collection data = ...;
>DataGeneratorSource collectionSource =
>DataGeneratorSource.fromCollection(data);
>DataStreamSource source = env.fromSource(collectionSource,
>WatermarkStrategy.forMonotonousTimestamps(), "Collection source")
>.forceNonParallel();
>
>Especially the necessity for the forceNonParallel()/setParallelism(1)
> call is
>   concerning because it is easy to forget.
>
>With the dependency, we can hide the internal details and achieve an API
>   closer to the current #fromCollection() implementation:
>
>Collection data = ...;
>DataStreamSource source =
>DataGeneratorSource.fromCollection(env, data);
>
> I would appreciate hearing your thoughts and suggestions on this matter.
>
> [1] https://github.com/apache/flink/pull/20049
> [2] https://github.com/apache/flink/pull/22850
>
> Best,
> Alex
>
>
>
>
> On Wed, 21 Jun 2023 at 19:27, Alexander Fedulov <
> alexander.fedu...@gmail.com>
> wrote:
>
> > I'd like to revive the efforts to deprecate the SourceFunction API.
> >
> > It would be great to get a review for this PR:
> > https://github.com/apache/flink/pull/21774
> >
> > It immediately unblocks marking the actual SourceFunction as deprecated.
> > https://github.com/apache/flink/pull/20049
> >
> > There is also this work thread related
> > to StreamExecutionEnvironment#fromCollection() methods.
> > The discussion seem to have stalled:
> > https://github.com/apache/flink/pull/21028
> >
> > Thanks,
> > Alex
> >
> > On 2022/06/15 19:30:31 Alexander Fedulov wrote:
> > > Thank you all for your valuable input and participation in the
> discussion
> > >
> > > The vote is open now [1]
> > >
> > > [1] https://lists.apache.org/thread/kv9rj3w2rmkb8jtss5bqffhw57or7v8v
> > >
> > > Best,
> > > Alexander Fedulov
> >
> >

Re: Change of focus

2022-02-28 Thread Flavio Pompermaier

Good luck for your new adventure Till!

On Mon, Feb 28, 2022 at 12:00 PM Till Rohrmann  wrote:

> Hi everyone,
>
> I wanted to let you know that I will be less active in the community
> because I’ve decided to start a new chapter in my life. Hence, please don’t
> wonder if I might no longer be very responsive on mails and JIRA issues.
>
> It is great being part of such a great community with so many amazing
> people. Over the past 7,5 years, I’ve learned a lot thanks to you and
> together we have shaped how people think about stream processing nowadays.
> This is something we can be very proud of. I am sure that the community
> will continue innovating and setting the pace for what is possible with
> real time processing. I wish you all godspeed!
>
> Cheers,
> Till
>

Re: [DISCUSS] Update Policy for old releases

2021-11-16 Thread Flavio Pompermaier

Hi to all,
I'd like to point out that also official downstream-projects as CDC
connector's documentation should be updated to reflect compatibility with
new Flink releases[1].
The aforementioned link for example doesn't say nothing about which version
is compatible with Flink 1.14

[1]
https://ververica.github.io/flink-cdc-connectors/master/content/about.html#deserialization

Best,
Flavio

On Mon, Nov 15, 2021 at 8:16 PM Thomas Weise  wrote:

> Nice to see this discussion!
>
> If we make Flink upgrades easy, why would users not want to upgrade? I
> think most hesitation today stems from the fact that X.Y.0 releases
> tend to break downstream in one way or the other due to unexpected
> upstream changes. If so, it would be nice if we could address that
> problem. For sure, if there was a choice between expending finite
> resources on supporting more releases vs. reducing compatibility
> issues in the next release, I would vote for the latter.
>
> There is another reason why some users are stuck on old releases that
> cannot be solved by the community: Heavily customized forks of Flink
> are hard to maintain and naturally defer upgrade as much as possible.
> I believe the solution to that is in user land (and not a community
> burden): contribute changes back to Flink, participate in the
> community to ensure robust extension hooks are provided etc.
>
> Cheers,
> Thomas
>
> On Mon, Nov 15, 2021 at 3:03 AM Till Rohrmann 
> wrote:
> >
> > Hi everyone,
> >
> > I want to second Stephan's points. I think that supporting LTS versions
> > rather fights symptoms than addressing the underlying problem that are
> > incompatible/behaviour changing changes between versions. I believe if we
> > can address this problem, then our Flink users will probably upgrade more
> > frequently.
> >
> > Making upgrades as smooth as possible won't probably help all our users
> but
> > we as a community also have to think about the costs of maintaining more
> > Flink versions/supporting certain versions for a longer period of time.
> One
> > thing is that we would need more complex processes that ensure that older
> > Flink versions/LTS receives all the required bug fixes. I think this is
> > already hard enough when only supporting the last two releases. Moreover,
> > according to my experience, backporting fixes is usually not a problem if
> > there are no merge conflicts. However, the older the Flink code is, the
> > more likely it is that there are merge conflicts. In the worst case,
> issues
> > need to be fixed in a completely different way than they are solved in
> > master.
> >
> > Additionally, we have to consider that longer support periods will mean
> > that we need to maintain our CI infrastructure and tooling the same
> period
> > as well. To give you an example, we probably would still have to operate
> > Travis if we had a LTS version. Given our existing problems with CI this
> > would add a lot more problems to this heap. Moreover, supporting more
> > versions would add more load to our CI infrastructure.
> >
> > All in all, I would expect that longer support periods would slow us down
> > with other things. One of these things could be to make upgrades as
> smooth
> > as possible so that more Flink users upgrade more frequently.
> >
> > @Tison: Concerning the problem that an older version contains a bug fix
> > that is not contained yet in the latest release of a newer version, I
> think
> > this is a valid problem. Right now, our users will have to check via JIRA
> > whether their problems are solved in the latest version of the new
> release
> > (e.g. when upgrading from 1.13.y to 1.14.x). Faster and more frequent
> > releases could mitigate the problem. Faster and lower overhead releases
> > could also allow us to release all versions in sync.
> >
> > Cheers,
> > Till
> >
> > On Sat, Nov 13, 2021 at 1:40 AM tison  wrote:
> >
> > > Hi Stephan,
> > >
> > > Thanks for your email! I agree your points about compatibility and the
> > > upgrade experience should be smooth.
> > >
> > > The problem raised by Pitor is that, even if we officially support two
> > > latest versions, many users stay in an early version end-of-support.
> > > So the downside "no one ends up using the other versions" doesn't
> > > stand. I see that a number of companies like to test the latest version
> > > but not apply on prod if it's not robust - 1.9 & 1.10 is less used.
> > >
> > > If a non-LTS version resolves users' (urgent) issues and the release
> itself
> > > is robust, they will upgrade to that version. We have tried to support
> > > Java 9 and so do other projects.
> > >
> > > I'm a fan to keep two latest supported versions and am willing to work
> on
> > > improving compatibility to help users migrate. But if I make a choice
> > > between
> > > 4 supported versions and the LTS option, I like the LTS option.
> > >
> > > Here are several issues I met when trying to support a series of
> versions:
> > >
> > > 1. cheery-pick overhead grows, o

Re: Is development in FlinkML still active?

2021-01-07 Thread Flavio Pompermaier

Or also https://github.com/alibaba/Alink, I don't know if the 2 are related
somehow..

On Thu, Jan 7, 2021 at 9:55 AM Flavio Pompermaier 
wrote:

> What about Flink-AI [1]? Would you suggest its adoption Till?
>
> [1] https://github.com/alibaba/flink-ai-extended
>
> On Thu, Jan 7, 2021 at 9:38 AM Till Rohrmann  wrote:
>
>> HI Badrul,
>>
>> FlinkML is unfortunately no longer under active development. However,
>> there
>> is some new effort to add a machine learning library to Flink [1].
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
>>
>> Cheers,
>> Till
>>
>> On Wed, Jan 6, 2021 at 7:11 PM Badrul Chowdhury <
>> badrulchowdhur...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I see that the last update to the roadmap for FlinkML was some time ago
>> > (2016). Is development still active? If so, I would like to contribute
>> some
>> > unsupervised clustering algorithms like CLARANS. Would love some
>> pointers.
>> >
>> > Thanks,
>> > Badrul
>> >
>
>

Re: Is development in FlinkML still active?

2021-01-07 Thread Flavio Pompermaier

What about Flink-AI [1]? Would you suggest its adoption Till?

[1] https://github.com/alibaba/flink-ai-extended

On Thu, Jan 7, 2021 at 9:38 AM Till Rohrmann  wrote:

> HI Badrul,
>
> FlinkML is unfortunately no longer under active development. However, there
> is some new effort to add a machine learning library to Flink [1].
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
>
> Cheers,
> Till
>
> On Wed, Jan 6, 2021 at 7:11 PM Badrul Chowdhury <
> badrulchowdhur...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I see that the last update to the roadmap for FlinkML was some time ago
> > (2016). Is development still active? If so, I would like to contribute
> some
> > unsupervised clustering algorithms like CLARANS. Would love some
> pointers.
> >
> > Thanks,
> > Badrul
> >

Re: [ANNOUNCE] New formatting rules are now in effect

2020-12-29 Thread Flavio Pompermaier

Thanks Aljoscha and Chesnay for this small but important improvement!
In the new year writing new Flink features will be funnier than ever ;)

On Tue, Dec 29, 2020 at 9:58 AM Till Rohrmann  wrote:

> Thanks a lot for this effort Aljoscha and Chesnay! Finally we have a common
> code style :-)
>
> Cheers,
> Till
>
> On Tue, Dec 29, 2020 at 7:32 AM Matthias Pohl 
> wrote:
>
> > Yes, thanks for driving this, Aljoscha. ...and Chesnay as well for
> helping
> > to finalize it.
> > Good job.
> >
> > Matthias
> >
> > On Tue, Dec 29, 2020 at 5:23 AM Jark Wu  wrote:
> >
> > > Thanks Aljoscha and Chesnay for the great work!
> > >
> > > Best,
> > > Jark
> > >
> > > On Tue, 29 Dec 2020 at 11:11, Xintong Song 
> > wrote:
> > >
> > > > Great! Thanks Aljoscha and Chesnay for driving this.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Dec 29, 2020 at 1:28 AM Chesnay Schepler  >
> > > > wrote:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > I have just merged the commits for FLINK-20651
> > > > >  to master,
> > > > > release-1.12 and release-11, which enforces new formatting rules
> > using
> > > > > the spotless plugin, following the google-java-format.
> > > > >
> > > > > This change touched every single java file in the repository,
> > > > > predominantly due to switching from tabs to spaces. This implies
> that
> > > > > every PR and WIP branch will require a rebase.
> > > > >
> > > > >
> > > > > Most of the changes were done by a single commit, which you can
> > exclude
> > > > > from git blame by configuring git as follows (note that this
> requires
> > > > > git 2.23+, and also works for IntelliJ):
> > > > >
> > > > > git config blame.ignoreRevsFile .git-blame-ignore-revs
> > > > >
> > > > >
> > > > > You can setup the IDE to follow the new code style as follows:
> > > > >
> > > > > 1. Install the google-java-format plugin
> > > > >  and
> > > > > enable it for the project
> > > > > 2. In the plugin settings, change the code style to "AOSP" (4-space
> > > > > indents)
> > > > > 3. Install the Save Actions plugin
> > > > > 
> > > > > 4. Enable the plugin, along with "Optimize imports" and "Reformat
> > file"
> > > > >
> > > > > To manually apply the formatting, run:
> > > > >
> > > > > mvn com.diffplug.spotless:spotless-maven-plugin:apply
> > > > >
> > > > >
> > > > > Please be on the lookout for any suspicious formatting, outdated
> > > > > instructions or other inconveniences.
> > > > >
> > > > >
> > > > > Finally, a big thank you to Aljoscha for pushing this topic and
> > finally
> > > > > bringing it to an end.
> > > > >
> > > > >
> > > >
> >

Re: [DISCUSS] Programmatically submit Flink job jar to session cluster

2020-12-10 Thread Flavio Pompermaier

To me creating the PackagedProgram on the client side is very bad, at least
for 2 things:
   1. You must ensure to have almost the same classpath of the Flink
cluster otherwise you can face problems in deserializing the submitted job
graph (for example jackson automatically tries to create modules that can
be found on the client classpath if using spring but not on the job
manager...that's exactly what happened to me initially)
   2. Also if you manage to create the PackagedProgram correctly, Job
listeners are not fired

So I ended up extending the RestClusterClient in order to use uploadJar +
runJob..you can look at the extended class at [1].
Unfortunately I still have to understand how to understand if dynamic
classloading il closed correctly or not by Job managers and Task managers
because I suspect that Tasks are not finalized correctly as detected for
Python at [2]

[1]
https://github.com/fpompermaier/flink-job-server/blob/main/flink-rest-client/src/main/java/org/apache/flink/client/program/rest/RestClusterClientExtended.java
[2] https://issues.apache.org/jira/browse/FLINK-20333

Best,
Flavio

On Wed, Dec 9, 2020 at 12:43 PM Yang Wang  wrote:

> Actually, I think the key point is that the Flink client is not friendly to
> the deployers.
> Most companies have their own deployers and I believe many of them depend
> on
> the cli commands(e.g. "flink run/run-application").
>
> I am not sure whether using the rest cluster client is the best choice. But
> we could
> have an alternative as follows.
>
> # Set the configuration based on the deployment mode(session, perjob)
> Configuration flinkConfig = new Configuration();
> ... ...
> flinkConfig.set("execution.target", "kubernetes-session");
> flinkConfig.set("kubernetes.cluster-id", "my-flink-k8s-session");
> # Build a packaged program
> PackagedProgram program =
> PackagedProgram.newBuilder().setConfiguration(flinkConfig)...build();
> # Run the packaged program on the deployment. Maybe we also need to set the
> Context.
> program.invokeInteractiveModeForExecution();
>
> In my opinion, the PackagedProgram is more appropriate for jar submission.
>
>
> Best,
> Yang
>
>
> Arvid Heise  于2020年12月8日周二 下午9:39写道：
>
> > I'm surprised that this is not possible currently. Seems like a glaring
> > missing feature to me.
> >
> > I'd assume the best way would be to extend the REST API to
> /jar/:jarId/run
> > with an option to overwrite configuration values. I'm not sure how to map
> > json well to the yaml structure of the config, but I guess we mostly have
> > simple key/value pairs anyways.
> >
> > On Tue, Dec 8, 2020 at 1:31 PM Till Rohrmann 
> wrote:
> >
> > > Hi Fabian,
> > >
> > > thanks for starting this discussion. In general I would be a bit
> hesitant
> > > to build upon Flink's web UI submission because it suffers from a
> couple
> > of
> > > drawbacks.
> > >
> > > 1) The web UI submission only supports single job applications.
> > > 2) The JobGraph is generated from within the web UI Netty thread.
> Hence,
> > if
> > > the user code blocks, then this can make the web UI unresponsive.
> > > 3) Uploaded jars are not persisted. Hence, if a JobManager failover
> > occurs
> > > between uploading and running the job, then you might have lost the
> > > uploaded jars.
> > >
> > > The reason for some of these problems is that the feature was actually
> > > implemented for some conference and almost remained untouched ever
> since.
> > > Building more functionality on top of it will mean that it will be
> harder
> > > to remove in the future.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Dec 8, 2020 at 12:00 PM Fabian Paul <
> > fabianp...@data-artisans.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Currently, the most convenient way of programmatically submitting a
> job
> > > to
> > > > a running session cluster is using Flink’s RestClusterClient.
> > > > Unfortunately, it is only supported, as of now, to submit a job
> > graph.[1]
> > > > To construct a job graph from a jar file, additional Flink
> dependencies
> > > are
> > > > required, which is not ideal.
> > > >
> > > > It is also possible to directly use the Flink rest API and first
> upload
> > > > the jar file via /jars/upload[2] and then run it via
> > /jar/:jarId/run[3].
> > > It
> > > > has the downside that it is impossible to set a Flink execution
> > > > configuration, and it will always take the underlying session cluster
> > > > configuration.
> > > >
> > > > I know changing the ClusterClient has already been discussed. It
> would
> > > > involve a lot of effort, so what do you think of making the jar
> > execution
> > > > more prominent via the rest endpoint by adding the option to pass an
> > > > execution configuration?
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/flink/blob/65cd385d7de504a946b17193aceea73b3c8e78a8/flink-clients/src/main/java/org/apache/flink/client/program/ClusterClient.java#L95
> > > > [2]
> > > >
> > >
> >
>

[jira] [Created] (FLINK-20004) UpperLimitExceptionParameter description is misleading

2020-11-05 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-20004:
--

 Summary: UpperLimitExceptionParameter description is misleading
 Key: FLINK-20004
 URL: https://issues.apache.org/jira/browse/FLINK-20004
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.11.2
Reporter: Flavio Pompermaier


The maxExceptions query parameter of /jobs/:jobid/exceptions REST API  is an 
integer parameter, not a list of comma separated values..this is probably a 
cut-and-paste error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-19969) CliFrontendParser does not provide any help for run-application

2020-11-04 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-19969:
--

 Summary: CliFrontendParser does not provide any help for 
run-application
 Key: FLINK-19969
 URL: https://issues.apache.org/jira/browse/FLINK-19969
 Project: Flink
  Issue Type: Bug
  Components: Command Line Client
Affects Versions: 1.11.2
Reporter: Flavio Pompermaier


flink CLI doesn't show any help about run-application



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] Enforce common opinionated coding style using Spotless

2020-10-06 Thread Flavio Pompermaier

Hi Aljoscha,
I think that having the style check directly in the IDE is a very good
feature so +1 on my side as a contributor (I also asked once on the mailing
list if there was already something like that)..I never used Spotless so I
can't say if it easy to integrate with the IDE but the nice thing is that
is has plugins both for Eclipse and IntelliJ so it was already on my
wish-to-try list ;)

Bye,
Flavio

On Tue, Oct 6, 2020 at 2:15 PM Aljoscha Krettek  wrote:

> Hi All,
>
> I know I know, but please keep reading because I recently learned about
> some new developments in the area of coding-style automation.
>
> The tool I would propose we use is Spotless
> (https://github.com/diffplug/spotless). This doesn't come with a
> formatter but allows using other popular formatters such as
> google-java-format. The nice thing about Spotless is that it serves as a
> verifier for CI but can also apply the configured style automatically.
> That is, for the programmer all she has to do is `mvn spotless:apply` to
> fix any style violations.
>
> An interesting feature, which was (somewhat) recently added is "ratchet"
> (
> https://github.com/diffplug/spotless/blob/main/plugin-maven/README.md#ratchet).
>
> With this, you can set up Spotless to only apply it's rules to files
> that were changed after a configured commit. This would allow a gradual
> application of the new coding style instead of one big change.
>
> If we decide to use Spotless, we would of course also have to decide on
> a coding style. For this I would propose google-java-format, which the
> flink-statefun project uses. The main difference from our current
> "style" is that this uses spaces instead of tabs for indentation. By
> default it would be 2 spaces but it can be configured to use 4 spaces
> which would make code look more or less like our current style. There
> are no more configuration knobs, so using tabs is not an option.
>
> Finally, why should we do this? I think most engineers agree that having
> a common enforced style is good to have so I only want to highlight a
> few thoughts here about things we could improve:
>
>   - No more nits about coding style in reviews, this makes it easier for
> both the reviewer and developer
>
>   - No manual fixing of Checkstyle errors because Spotless can do that
> automatically
>
>   - Because Flink is such a big project little islands of coding style
> have formed between people that commonly work on components. It can be a
> nuisance when you work on a different component and then reviewers don't
> like your typical coding style. And you first have to get used to the
> slight differences in style when reading code.
>
> There are also downsides I see in this:
>
>   - We break the history, but both "git blame" and modern IntelliJ can
> ignore whitespace when attributing changes. So for files that are
> already "well" formatted not much would change.
>
>   - In the short-term it will be harder to apply changes both to master
> and one of the release-x branches because formatting will be different.
> I think this is not too hard though because Spotless can automatically
> apply the style.
>
> In summary, we would have some short-term pain with this but I think it
> would be good in the long run. What are your thoughts?
>
> Best,
> Aljoscha

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

2020-09-24 Thread Flavio Pompermaier

Hi Kurt, in the past we had a very interesting use case in this regard: our
customer (oracle) db was quite big and running too many queries in parallel
was too heavy and it was causing the queries to fail.
So we had to limit the source parallelism to 2 threads. After the fetching
of the data the other operators could use the max parallelism as usual..

Best,
Flavio

On Thu, Sep 24, 2020 at 9:59 AM Kurt Young  wrote:

> Thanks Jingsong for driving this, this is indeed a useful feature and lots
> of users are asking for it.
>
> For setting a fixed source parallelism, I'm wondering whether this is
> necessary. For kafka,
> I can imagine users would expect Flink will use the number of partitions as
> the parallelism. If it's too
> large, one can use the max parallelism to make it smaller.
> But for ES, which doesn't have ability to decide a reasonable parallelism
> on its own, it might make sense
> to introduce a user specified parallelism for such table source.
>
> So I think it would be better to reorganize the document a little bit, to
> explain the connectors one by one. Briefly
> introduce use cases and what kind of options are needed in your opinion.
>
> Regarding the interface `DataStreamScanProvider`, a concrete example would
> help the discussion. What kind
> of scenarios do you want to support? And what kind of connectors need such
> an interface?
>
> Best,
> Kurt
>
>
> On Wed, Sep 23, 2020 at 9:30 PM admin <17626017...@163.com> wrote:
>
> > +1,it’s a good news
> >
> > > 2020年9月23日 下午6:22，Jingsong Li  写道：
> > >
> > > Hi all,
> > >
> > > I'd like to start a discussion about improving the new TableSource and
> > > TableSink interfaces.
> > >
> > > Most connectors have been migrated to FLIP-95, but there are still the
> > > Filesystem and Hive that have not been migrated. They have some
> > > requirements on table connector API. And users also have some
> additional
> > > requirements：
> > > - Some connectors have the ability to infer parallelism, the
> parallelism
> > is
> > > good for most cases.
> > > - Users have customized parallelism configuration requirements for
> source
> > > and sink.
> > > - The connectors need to use topology to build their source/sink
> instead
> > of
> > > a single function. Like JIRA[1], Partition Commit feature and File
> > > Compaction feature.
> > >
> > > Details are in [2].
> > >
> > > [1]https://issues.apache.org/jira/browse/FLINK-18674
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-146%3A+Improve+new+TableSource+and+TableSink+interfaces
> > >
> > > Best,
> > > Jingsong
> >
> >

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier

We use runCustomOperation to group a set of operators and into a single
functional unit, just to make the code more modular..
It's very comfortable indeed.

On Thu, Jul 30, 2020 at 5:20 PM Aljoscha Krettek 
wrote:

> That is good input! I was not aware that anyone was actually using
> `runCustomOperation()`. Out of curiosity, what are you using that for?
>
> We have definitely thought about the first two points you mentioned,
> though. Especially processing-time will make it tricky to define unified
> execution semantics.
>
> Best,
> Aljoscha
>
> On 30.07.20 17:10, Flavio Pompermaier wrote:
> > I just wanted to be propositive about missing api.. :D
> >
> > On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman 
> wrote:
> >
> >> +1 Its time to drop DataSet
> >>
> >> Flavio, those issues are expected. This FLIP isn't just to drop DataSet
> >> but to also add the necessary enhancements to DataStream such that it
> works
> >> well on bounded input.
> >>
> >> On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier <
> pomperma...@okkam.it>
> >> wrote:
> >>
> >>> Just to contribute to the discussion, when we tried to do the
> migration we
> >>> faced some problems that could make migration quite difficult.
> >>> 1 - It's difficult to test because of
> >>> https://issues.apache.org/jira/browse/FLINK-18647
> >>> 2 - missing mapPartition
> >>> 3 - missing   DataSet runOperation(CustomUnaryOperation
> >>> operation)
> >>>
> >>> On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise 
> wrote:
> >>>
> >>>> +1 of getting rid of the DataSet API. Is DataStream#iterate already
> >>>> superseding DataSet iterations or would that also need to be accounted
> >>> for?
> >>>>
> >>>> In general, all surviving APIs should also offer a smooth experience
> for
> >>>> switching back and forth.
> >>>>
> >>>> On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <
> >>> balassi.mar...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> Thanks for the write up and starting the discussion. I am in favor of
> >>>>> unifying the APIs the way described in the FLIP and deprecating the
> >>>> DataSet
> >>>>> API. I am looking forward to the detailed discussion of the changes
> >>>>> necessary.
> >>>>>
> >>>>> Best,
> >>>>> Marton
> >>>>>
> >>>>> On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <
> >>> aljos...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Everyone,
> >>>>>>
> >>>>>> my colleagues (in cc) and I would like to propose this FLIP for
> >>>>>> discussion. In short, we want to reduce the number of APIs that we
> >>> have
> >>>>>> by deprecating the DataSet API. This is a big step for Flink, that's
> >>> why
> >>>>>> I'm also cross-posting this to the User Mailing List.
> >>>>>>
> >>>>>> FLIP-131: http://s.apache.org/FLIP-131
> >>>>>>
> >>>>>> I'm posting the introduction of the FLIP below but please refer to
> >>> the
> >>>>>> document linked above for the full details:
> >>>>>>
> >>>>>> --
> >>>>>> Flink provides three main SDKs/APIs for writing Dataflow Programs:
> >>> Table
> >>>>>> API/SQL, the DataStream API, and the DataSet API. We believe that
> >>> this
> >>>>>> is one API too many and propose to deprecate the DataSet API in
> >>> favor of
> >>>>>> the Table API/SQL and the DataStream API. Of course, this is easier
> >>> said
> >>>>>> than done, so in the following, we will outline why we think that
> >>> having
> >>>>>> too many APIs is detrimental to the project and community. We will
> >>> then
> >>>>>> describe how we can enhance the Table API/SQL and the DataStream API
> >>> to
> >>>>>> subsume the DataSet API's functionality.
> >>>>>>
> >>>>>> In this FLIP, we will not describe all the technical details of how
> >>> the
> >>>>>> Table API/SQL and DataStream will be enhanced. The goal is to
> achieve
> >>>>>> consensus on the idea of deprecating the DataSet API. There will
> >>> have to
> >>>>>> be follow-up FLIPs that describe the necessary changes for the APIs
> >>> that
> >>>>>> we maintain.
> >>>>>> --
> >>>>>>
> >>>>>> Please let us know if you have any concerns or comments. Also,
> please
> >>>>>> keep discussion to this ML thread instead of commenting in the Wiki
> >>> so
> >>>>>> that we can have a consistent view of the discussion.
> >>>>>>
> >>>>>> Best,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>>
> >>>> Arvid Heise | Senior Java Developer
> >>>>
> >>>> <https://www.ververica.com/>
> >>>>
> >>>> Follow us @VervericaData
> >>>>
> >>>> --
> >>>>
> >>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> >>>> Conference
> >>>>
> >>>> Stream Processing | Event Driven | Real Time
> >>>>
> >>>> --
> >>>>
> >>>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >>>>
> >>>> --
> >>>> Ververica GmbH
> >>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> >>>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> Ji
> >>>> (Toni) Cheng
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier

I just wanted to be propositive about missing api.. :D

On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman  wrote:

> +1 Its time to drop DataSet
>
> Flavio, those issues are expected. This FLIP isn't just to drop DataSet
> but to also add the necessary enhancements to DataStream such that it works
> well on bounded input.
>
> On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier 
> wrote:
>
>> Just to contribute to the discussion, when we tried to do the migration we
>> faced some problems that could make migration quite difficult.
>> 1 - It's difficult to test because of
>> https://issues.apache.org/jira/browse/FLINK-18647
>> 2 - missing mapPartition
>> 3 - missing   DataSet runOperation(CustomUnaryOperation
>> operation)
>>
>> On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:
>>
>> > +1 of getting rid of the DataSet API. Is DataStream#iterate already
>> > superseding DataSet iterations or would that also need to be accounted
>> for?
>> >
>> > In general, all surviving APIs should also offer a smooth experience for
>> > switching back and forth.
>> >
>> > On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <
>> balassi.mar...@gmail.com>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > Thanks for the write up and starting the discussion. I am in favor of
>> > > unifying the APIs the way described in the FLIP and deprecating the
>> > DataSet
>> > > API. I am looking forward to the detailed discussion of the changes
>> > > necessary.
>> > >
>> > > Best,
>> > > Marton
>> > >
>> > > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <
>> aljos...@apache.org>
>> > > wrote:
>> > >
>> > >> Hi Everyone,
>> > >>
>> > >> my colleagues (in cc) and I would like to propose this FLIP for
>> > >> discussion. In short, we want to reduce the number of APIs that we
>> have
>> > >> by deprecating the DataSet API. This is a big step for Flink, that's
>> why
>> > >> I'm also cross-posting this to the User Mailing List.
>> > >>
>> > >> FLIP-131: http://s.apache.org/FLIP-131
>> > >>
>> > >> I'm posting the introduction of the FLIP below but please refer to
>> the
>> > >> document linked above for the full details:
>> > >>
>> > >> --
>> > >> Flink provides three main SDKs/APIs for writing Dataflow Programs:
>> Table
>> > >> API/SQL, the DataStream API, and the DataSet API. We believe that
>> this
>> > >> is one API too many and propose to deprecate the DataSet API in
>> favor of
>> > >> the Table API/SQL and the DataStream API. Of course, this is easier
>> said
>> > >> than done, so in the following, we will outline why we think that
>> having
>> > >> too many APIs is detrimental to the project and community. We will
>> then
>> > >> describe how we can enhance the Table API/SQL and the DataStream API
>> to
>> > >> subsume the DataSet API's functionality.
>> > >>
>> > >> In this FLIP, we will not describe all the technical details of how
>> the
>> > >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
>> > >> consensus on the idea of deprecating the DataSet API. There will
>> have to
>> > >> be follow-up FLIPs that describe the necessary changes for the APIs
>> that
>> > >> we maintain.
>> > >> --
>> > >>
>> > >> Please let us know if you have any concerns or comments. Also, please
>> > >> keep discussion to this ML thread instead of commenting in the Wiki
>> so
>> > >> that we can have a consistent view of the discussion.
>> > >>
>> > >> Best,
>> > >> Aljoscha
>> > >>
>> > >
>> >
>> > --
>> >
>> > Arvid Heise | Senior Java Developer
>> >
>> > <https://www.ververica.com/>
>> >
>> > Follow us @VervericaData
>> >
>> > --
>> >
>> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> > Conference
>> >
>> > Stream Processing | Event Driven | Real Time
>> >
>> > --
>> >
>> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>> >
>> > --
>> > Ververica GmbH
>> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>> > (Toni) Cheng
>
>

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier

Just to contribute to the discussion, when we tried to do the migration we
faced some problems that could make migration quite difficult.
1 - It's difficult to test because of
https://issues.apache.org/jira/browse/FLINK-18647
2 - missing mapPartition
3 - missing   DataSet runOperation(CustomUnaryOperation
operation)

On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:

> +1 of getting rid of the DataSet API. Is DataStream#iterate already
> superseding DataSet iterations or would that also need to be accounted for?
>
> In general, all surviving APIs should also offer a smooth experience for
> switching back and forth.
>
> On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi 
> wrote:
>
> > Hi All,
> >
> > Thanks for the write up and starting the discussion. I am in favor of
> > unifying the APIs the way described in the FLIP and deprecating the
> DataSet
> > API. I am looking forward to the detailed discussion of the changes
> > necessary.
> >
> > Best,
> > Marton
> >
> > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
> > wrote:
> >
> >> Hi Everyone,
> >>
> >> my colleagues (in cc) and I would like to propose this FLIP for
> >> discussion. In short, we want to reduce the number of APIs that we have
> >> by deprecating the DataSet API. This is a big step for Flink, that's why
> >> I'm also cross-posting this to the User Mailing List.
> >>
> >> FLIP-131: http://s.apache.org/FLIP-131
> >>
> >> I'm posting the introduction of the FLIP below but please refer to the
> >> document linked above for the full details:
> >>
> >> --
> >> Flink provides three main SDKs/APIs for writing Dataflow Programs: Table
> >> API/SQL, the DataStream API, and the DataSet API. We believe that this
> >> is one API too many and propose to deprecate the DataSet API in favor of
> >> the Table API/SQL and the DataStream API. Of course, this is easier said
> >> than done, so in the following, we will outline why we think that having
> >> too many APIs is detrimental to the project and community. We will then
> >> describe how we can enhance the Table API/SQL and the DataStream API to
> >> subsume the DataSet API's functionality.
> >>
> >> In this FLIP, we will not describe all the technical details of how the
> >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
> >> consensus on the idea of deprecating the DataSet API. There will have to
> >> be follow-up FLIPs that describe the necessary changes for the APIs that
> >> we maintain.
> >> --
> >>
> >> Please let us know if you have any concerns or comments. Also, please
> >> keep discussion to this ML thread instead of commenting in the Wiki so
> >> that we can have a consistent view of the discussion.
> >>
> >> Best,
> >> Aljoscha
> >>
> >
>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng

Re: PackagedProgram and ProgramDescription

2020-07-15 Thread Flavio Pompermaier

Thanks Chesnay for the tip.
I'll try to investigate the usage of GlobalJobParameters.

On Wed, Jul 15, 2020 at 2:51 PM Chesnay Schepler  wrote:

> The more we strive towards a model where an application can submit
> multiple jobs it will become increasingly important to be able to attach
> meta data to a job/application to have any idea what is going on.
>
> But I don't think the PackagedProgram/ProgramDescription is the way to
> go; and I'd envision rather something like a meta data object that is
> attached to the environment/execute calls. But we have to figure out how
> to do this in a way that also works for the SQL APIs.
>
> What we have done internally is to encode such information in the
> GlobalJobParameters which are then available in the WebUI. We have
> things like commit IDs encoded into the jar manifest, that we extract at
> submission time and put them into the parameters.
> My guess would be that such approach can work sufficiently for all
> dataset/datastream/table API users.
>
> On 15/07/2020 14:05, Flavio Pompermaier wrote:
> > Ok, it's not a problem for me if the community is not interested in
> pushing
> > this thing forward.
> > When we develop a Job is super useful for us to have the job describing
> > itself somehow (what it does and which parameters it requires).
> > If this is not in Flink I have to implement it somewhere else but I can't
> > think that this is not a common situation.
> > However I think I can live with it :D
> >
> > Best,
> > Flavio
> >
> > On Wed, Jul 15, 2020 at 12:01 PM Aljoscha Krettek 
> > wrote:
> >
> >> I think no-one is interested to push this personally right now. We would
> >> need a champion that is interested and pushes this forward.
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On Mon, Mar 30, 2020, at 12:45, Flavio Pompermaier wrote:
> >>> I would personally like to see a way of describing a Flink job/pipeline
> >>> (including its parameters and types) in order to enable better UIs,
> then
> >>> the important thing is to make things consistent and aligned with the
> new
> >>> client developments and exploit this new dev sprint to fix such issues.
> >>>
> >>> On Mon, Mar 30, 2020 at 11:38 AM Aljoscha Krettek  >
> >>> wrote:
> >>>
> >>>> On 18.03.20 14:45, Flavio Pompermaier wrote:
> >>>>> what do you think if we exploit this job-submission sprint to address
> >>>> also
> >>>>> the problem discussed in
> >>>> https://issues.apache.org/jira/browse/FLINK-10862?
> >>>>
> >>>> That's a good idea! What should we do? It seems that most committers
> on
> >>>> the issue were in favour of deprecating/removing ProgramDescription.
> >>>>
>

[jira] [Created] (FLINK-18608) CustomizedConvertRule#convertCast drops nullability

2020-07-15 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-18608:
--

 Summary: CustomizedConvertRule#convertCast drops nullability
 Key: FLINK-18608
 URL: https://issues.apache.org/jira/browse/FLINK-18608
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.11.0
Reporter: Flavio Pompermaier


The following program shows that nullability is not respected during the 
tranlastion:


{code:java}
 final TableEnvironment tableEnv = 
DatalinksExecutionEnvironment.getBatchTableEnv();
final Table inputTable = tableEnv.fromValues(//
DataTypes.ROW(//
DataTypes.FIELD("col1", DataTypes.STRING()), //
DataTypes.FIELD("col2", DataTypes.STRING())//
), //
Row.of(1L, "Hello"), //
Row.of(2L, "Hello"), //
Row.of(3L, ""), //
Row.of(4L, "Ciao"));
tableEnv.createTemporaryView("ParquetDataset", inputTable);
tableEnv.executeSql(//
"CREATE TABLE `out` (\n" + //
"col1 STRING,\n" + //
"col2 STRING\n" + //
") WITH (\n" + //
" 'connector' = 'filesystem',\n" + //
// " 'format' = 'parquet',\n" + //
" 'update-mode' = 'append',\n" + //
" 'path' = 'file://" + TEST_FOLDER + "',\n" + //
" 'sink.shuffle-by-partition.enable' = 'true'\n" + //
")");

tableEnv.executeSql("INSERT INTO `out` SELECT * FROM ParquetDataset");
{code}


-

Exception in thread "main" java.lang.AssertionError: Conversion to relational 
algebra failed to preserve datatypes:
validated type:
RecordType(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" col1, 
VARCHAR(2147483647) CHARACTER SET "UTF-16LE" col2) NOT NULL
converted type:
RecordType(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL col1, 
VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL col2) NOT NULL
rel:
LogicalProject(col1=[$0], col2=[$1])
  LogicalUnion(all=[true])
LogicalProject(col1=[_UTF-16LE'1':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"], col2=[_UTF-16LE'Hello':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"])
  LogicalValues(tuples=[[{ 0 }]])
LogicalProject(col1=[_UTF-16LE'2':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"], col2=[_UTF-16LE'Hello':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"])
  LogicalValues(tuples=[[{ 0 }]])
LogicalProject(col1=[_UTF-16LE'3':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"], col2=[_UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"])
  LogicalValues(tuples=[[{ 0 }]])
LogicalProject(col1=[_UTF-16LE'4':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"], col2=[_UTF-16LE'Ciao':VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"])
  LogicalValues(tuples=[[{ 0 }]])

at 
org.apache.calcite.sql2rel.SqlToRelConverter.checkConvertedType(SqlToRelConverter.java:465)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:580)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:164)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:151)
at 
org.apache.flink.table.planner.operations.SqlToOperationConverter.toQueryOperation(SqlToOperationConverter.java:773)
at 
org.apache.flink.table.planner.operations.SqlToOperationConverter.convertSqlQuery(SqlToOperationConverter.java:745)
at 
org.apache.flink.table.planner.operations.SqlToOperationConverter.convert(SqlToOperationConverter.java:238)
at 
org.apache.flink.table.planner.operations.SqlToOperationConverter.convertSqlInsert(SqlToOperationConverter.java:527)
at 
org.apache.flink.table.planner.operations.SqlToOperationConverter.convert(SqlToOperationConverter.java:204)
at 
org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:78)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:678)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: PackagedProgram and ProgramDescription

2020-07-15 Thread Flavio Pompermaier

Ok, it's not a problem for me if the community is not interested in pushing
this thing forward.
When we develop a Job is super useful for us to have the job describing
itself somehow (what it does and which parameters it requires).
If this is not in Flink I have to implement it somewhere else but I can't
think that this is not a common situation.
However I think I can live with it :D

Best,
Flavio

On Wed, Jul 15, 2020 at 12:01 PM Aljoscha Krettek 
wrote:

> I think no-one is interested to push this personally right now. We would
> need a champion that is interested and pushes this forward.
>
> Best,
> Aljoscha
>
> On Mon, Mar 30, 2020, at 12:45, Flavio Pompermaier wrote:
> > I would personally like to see a way of describing a Flink job/pipeline
> > (including its parameters and types) in order to enable better UIs, then
> > the important thing is to make things consistent and aligned with the new
> > client developments and exploit this new dev sprint to fix such issues.
> >
> > On Mon, Mar 30, 2020 at 11:38 AM Aljoscha Krettek 
> > wrote:
> >
> > > On 18.03.20 14:45, Flavio Pompermaier wrote:
> > > > what do you think if we exploit this job-submission sprint to address
> > > also
> > > > the problem discussed in
> > > https://issues.apache.org/jira/browse/FLINK-10862?
> > >
> > > That's a good idea! What should we do? It seems that most committers on
> > > the issue were in favour of deprecating/removing ProgramDescription.
> > >
> >

Re: [DISCUSS] Semantics of our JIRA fields

2020-05-23 Thread Flavio Pompermaier

In my experience it's quite complicated for a normal reporter to be able to
fill all the fields correctly (especially for new users).
Usually you just wanto to report a problem, remember to add a new feature
or improve code/documentation but you can't really give a priority, assign
the correct label or decide which releases will benefit of the fix/new
feature. This is something that only core developers could decide (IMHO).

My experiece says that it's better if normal users could just open tickets
with some default (or mark ticket as new) and leave them in such a state
until an experienced user, one that can assign tickets, have the time to
read it and immediately reject the ticket or accept it and properly assign
priorities, fix version, etc.

With respect to resolve/close I think that a good practice could be to mark
automatically a ticket as resolved once that a PR is detected for that
ticket, while marking it as closed should be done by the commiter who merge
the PR.

Probably this process would slightly increase the work of a committer but
this would make things a lot cleaner and will bring the benefit of having a
reliable and semantically sound JIRA state.

Cheers,
Flavio

Il Dom 24 Mag 2020, 05:05 Israel Ekpo  ha scritto:

> +1 for the proposal
>
> This will bring some consistency to the process
>
> Regarding Closed vs Resolved, should we go back and switch all the Resolved
> issues to Closed so that is is not confusing to new people to the project
> in the future that may not have seen this discussion?
>
> I would recommend changing it to Closed just to be consistent since that is
> the majority and the new process will be using Closed going forward
>
> Those are my thoughts for now
>
> On Sat, May 23, 2020 at 7:48 AM Congxian Qiu 
> wrote:
>
> > +1 for the proposal. It's good to have a unified description and write it
> > down in the wiki, so that every contributor has the same understanding of
> > all the fields.
> >
> > Best,
> > Congxian
> >
> >
> > Till Rohrmann  于2020年5月23日周六 上午12:04写道：
> >
> > > Thanks for drafting this proposal Robert. +1 for the proposal.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, May 22, 2020 at 5:39 PM Leonard Xu  wrote:
> > >
> > > > Thanks bringing up this topic @Robert,  +1 to the proposal.
> > > >
> > > > It clarifies the JIRA fields well and should be a rule to follow.
> > > >
> > > > Best,
> > > > Leonard Xu
> > > > > 在 2020年5月22日，20:24，Aljoscha Krettek  写道：
> > > > >
> > > > > +1 That's also how I think of the semantics of the fields.
> > > > >
> > > > > Aljoscha
> > > > >
> > > > > On 22.05.20 08:07, Robert Metzger wrote:
> > > > >> Hi all,
> > > > >> I have the feeling that the semantics of some of our JIRA fields
> > > (mostly
> > > > >> "affects versions", "fix versions" and resolve / close) are not
> > > defined
> > > > in
> > > > >> the same way by all the core Flink contributors, which leads to
> > cases
> > > > where
> > > > >> I spend quite some time on filling the fields correctly (at least
> > > what I
> > > > >> consider correctly), and then others changing them again to match
> > > their
> > > > >> semantics.
> > > > >> In an effort to increase our efficiency, and since I'm creating a
> > lot
> > > of
> > > > >> (test instability-related) tickets these days, I would like to
> > discuss
> > > > the
> > > > >> semantics, come to a conclusion and document this in our Wiki.
> > > > >> *Proposal:*
> > > > >> *Priority:*
> > > > >> "Blocker": needs to be resolved before a release (matched based on
> > fix
> > > > >> versions)
> > > > >> "Critical": strongly considered before a release
> > > > >> other priorities have no practical meaning in Flink.
> > > > >> *Component/s:*
> > > > >> Primary component relevant for this feature / fix.
> > > > >> For test-related issues, add the component the test belongs to
> (for
> > > > example
> > > > >> "Connectors / Kafka" for Kafka test failures) + "Test".
> > > > >> The same applies for documentation tickets. For example, if
> there's
> > > > >> something wrong with the DataStream API, add it to the "API /
> > > > DataStream"
> > > > >> and "Documentation" components.
> > > > >> *Affects Version/s:*Only for Bug / Task-type tickets: We list all
> > > > currently
> > > > >> supported and unreleased Flink versions known to be affected by
> > this.
> > > > >> Example: If I see a test failure that happens on "master" and
> > > > >> "release-1.11", I set "affects version" to "1.12.0" and "1.11.0".
> > > > >> *Fix Version/s:*
> > > > >> For closed/resolved tickets, this field lists the released Flink
> > > > versions
> > > > >> that contain a fix or feature for the first time.
> > > > >> For open tickets, it indicates that a fix / feature should be
> > > contained
> > > > in
> > > > >> the listed versions. Only blocker issues can block a release, all
> > > other
> > > > >> tickets which have "fix version/s" set at the time of a release
> and
> > > are
> > > > >> unresolved will be moved to the next version.
> > > >

[jira] [Created] (FLINK-17622) Remove useless switch for decimal in PostresCatalog

2020-05-11 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17622:
--

 Summary: Remove useless switch for decimal in PostresCatalog
 Key: FLINK-17622
 URL: https://issues.apache.org/jira/browse/FLINK-17622
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


Remove the useless switch for decimal fields. The Postgres JDBC connector 
translate them to numeric



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17509) Support OracleDialect

2020-05-04 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17509:
--

 Summary: Support OracleDialect
 Key: FLINK-17509
 URL: https://issues.apache.org/jira/browse/FLINK-17509
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


Support OracleDialect in JDBCDialects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17508) Develop OracleCatalog

2020-05-04 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17508:
--

 Summary: Develop OracleCatalog
 Key: FLINK-17508
 URL: https://issues.apache.org/jira/browse/FLINK-17508
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


Similarly to https://issues.apache.org/jira/browse/FLINK-16471



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS]Refactor flink-jdbc connector structure

2020-04-30 Thread Flavio Pompermaier

Very big +1 from me

Best,
Flavio

On Thu, Apr 30, 2020 at 4:47 PM David Anderson 
wrote:

> I'm very happy to see the jdbc connector being normalized in this way. +1
> from me.
>
> David
>
> On Thu, Apr 30, 2020 at 2:14 PM Timo Walther  wrote:
>
> > Hi Leonard,
> >
> > this sounds like a nice refactoring for consistency. +1 from my side.
> >
> > However, I'm not sure how much backwards compatibility is required.
> > Maybe others can comment on this.
> >
> > Thanks,
> > Timo
> >
> > On 30.04.20 14:09, Leonard Xu wrote:
> > > Hi, dear community
> > >
> > > Recently, I’m thinking to refactor the flink-jdbc connector structure
> > before release 1.11.
> > > After the refactor, in the future,  we can easily introduce unified
> > pluggable JDBC dialect for Table and DataStream, and we can have a better
> > module organization and implementations.
> > >
> > > So, I propose following changes:
> > > 1) Use `Jdbc` instead of `JDBC` in the new public API and interface
> > name. The Datastream API `JdbcSink` which imported in this version has
> > followed this standard.
> > >
> > > 2) Move all interface and classes from `org.apache.flink.java.io
> .jdbc`(old
> > package) to `org.apache.flink.connector.jdbc`(new package) to follow the
> > base connector path in FLIP-27.
> > > I think we can move JDBC TableSource, TableSink and factory from old
> > package to new package because TableEnvironment#registerTableSource、
> > TableEnvironment#registerTableSink  will be removed in 1.11 ans these
> > classes are not exposed to users[1].
> > > We can move Datastream API JdbcSink from old package to new package
> > because it’s  introduced in this version.
> > > We will still keep `JDBCInputFormat` and `JDBCOoutoutFormat` in old
> > package and deprecate them.
> > > Other classes/interfaces are internal used and we can move to new
> > package without breaking compatibility.
> > > 3) Rename `flink-jdbc` to `flink-connector-jdbc`. well, this is a
> > compatibility broken change but in order to comply with other connectors
> > and it’s real a connector rather than a flink-jdc-driver[2] we’d better
> > decide do it ASAP.
> > >
> > >
> > > What do you think? Any feedback is appreciate.
> > >
> > >
> > > Best,
> > > Leonard Xu
> > >
> > > [1]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Remove-registration-of-TableSource-TableSink-in-Table-Env-and-ConnectTableDescriptor-td37270.html
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Remove-registration-of-TableSource-TableSink-in-Table-Env-and-ConnectTableDescriptor-td37270.html
> > >
> > > [2]https://github.com/ververica/flink-jdbc-driver <
> > https://github.com/ververica/flink-jdbc-driver>
> > >
> > >
> >
> >
>

Re: [DISCUSS] Hierarchies in ConfigOption

2020-04-29 Thread Flavio Pompermaier

Personally I don't have any preference here.  Compliance wih standard YAML
parser is probably more important

On Wed, Apr 29, 2020 at 5:10 PM Jark Wu  wrote:

> From a user's perspective, I prefer the shorter one "format=json", because
> it's more concise and straightforward. The "kind" is redundant for users.
> Is there a real case requires to represent the configuration in JSON style?
> As far as I can see, I don't see such requirement, and everything works
> fine by now.
>
> So I'm in favor of "format=json". But if the community insist to follow
> code style on this, I'm also fine with the longer one.
>
> Btw, I also CC user mailing list to listen more user's feedback. Because I
> think this is relative to usability.
>
> Best,
> Jark
>
> On Wed, 29 Apr 2020 at 22:09, Chesnay Schepler  wrote:
>
> >  > Therefore, should we advocate instead:
> >  >
> >  > 'format.kind' = 'json',
> >  > 'format.fail-on-missing-field' = 'false'
> >
> > Yes. That's pretty much it.
> >
> > This is reasonable important to nail down as with such violations I
> > believe we could not actually switch to a standard YAML parser.
> >
> > On 29/04/2020 16:05, Timo Walther wrote:
> > > Hi everyone,
> > >
> > > discussions around ConfigOption seem to be very popular recently. So I
> > > would also like to get some opinions on a different topic.
> > >
> > > How do we represent hierarchies in ConfigOption? In FLIP-122, we
> > > agreed on the following DDL syntax:
> > >
> > > CREATE TABLE fs_table (
> > >  ...
> > > ) WITH (
> > >  'connector' = 'filesystem',
> > >  'path' = 'file:///path/to/whatever',
> > >  'format' = 'csv',
> > >  'format.allow-comments' = 'true',
> > >  'format.ignore-parse-errors' = 'true'
> > > );
> > >
> > > Of course this is slightly different from regular Flink core
> > > configuration but a connector still needs to be configured based on
> > > these options.
> > >
> > > However, I think this FLIP violates our code style guidelines because
> > >
> > > 'format' = 'json',
> > > 'format.fail-on-missing-field' = 'false'
> > >
> > > is an invalid hierarchy. `format` cannot be a string and a top-level
> > > object at the same time.
> > >
> > > We have similar problems in our runtime configuration:
> > >
> > > state.backend=
> > > state.backend.incremental=
> > > restart-strategy=
> > > restart-strategy.fixed-delay.delay=
> > > high-availability=
> > > high-availability.cluster-id=
> > >
> > > The code style guide states "Think of the configuration as nested
> > > objects (JSON style)". So such hierarchies cannot be represented in a
> > > nested JSON style.
> > >
> > > Therefore, should we advocate instead:
> > >
> > > 'format.kind' = 'json',
> > > 'format.fail-on-missing-field' = 'false'
> > >
> > > What do you think?
> > >
> > > Thanks,
> > > Timo
> > >
> > > [1]
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-components.html#configuration-changes
> > >
> >
> >

Re: Integration of DataSketches into Flink

2020-04-27 Thread Flavio Pompermaier

If this can encourage Lee I'm one of the Flink users that already use
datasketches and I found it an amazing library.
When I was trying it out (lat year) I tried to stimulate some discussion[1]
but at that time it was probably too early..
I really hope that now things are mature for both communities!

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html

Best,
Flavio

On Mon, Apr 27, 2020 at 7:37 PM leerho  wrote:

> Hi Arvid,
>
> Note: I am dual listing this thread on both dev lists for better tracking.
>
>1. I'm curious on how you would estimate the effort to port datasketches
> >to Flink? It already has a Java API, but how difficult would it be to
> >subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >on Pig, I think we could use this port as a baseline
>
>
> Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
> Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
> of aggregation API, which allows users to plug in custom aggregation
> functions.  Typical API functions found in these APIs are Initialize(),
> Update() (or Add()), Merge(), and getResult().  How these are named and
> operate vary considerably from system to system.  These APIs are sometimes
> called User Defined Functions (UDFs) or User Defined Aggregation Functions
> (UDAFs).
>
> DataSketches is a library of Sketching (streaming) aggregation functions,
> each of which perform specific types of aggregation. For example, counting
> unique items, determining quantiles and histograms of unknown
> distributions, identifying most frequent items (heavy hitters) from a
> stream, etc.   The advantage of using DataSketches is that they are
> extremely fast, small in size, and have well defined error properties
> defined by published scientific papers that define the underlying
> mathematics.
>
> The task of porting DataSketches is usually developing a thin wrapping
> layer that translates the specific UDAF API of Flink to the equivalent API
> methods of the targeted sketches in the library.  This is best done by
> someone with deep knowledge of the UDAF code of the targeted system.   We
> are certainly available answer questions about the DataSketches APIs.
>  Although we did write the UDAF layers for Hive and Pig, we did that as a
> proof of concept and example on how to write such layers.  We are a small
> team and are not in a position to support these integration layers for
> every system out there.
>
> 2. Do you have any idea who is usually driving the adoptions?
>
>
> To start, you only need to write the UDAF layer for the sketches that you
> think would be in most demand by your users.  The big 4 categories are
> distinct (unique) counting, quantiles, frequent-items, and sampling.  This
> is a natural way of subdividing the task: choose the sketches you want to
> adapt and in what order.  Each sketch is independent so it can be adapted
> whenever it is needed.
>
> Please let us know if you have any further questions :)
>
> Lee.
>
>
>
>
> On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise  wrote:
>
> > Hi Lee,
> >
> > I must admit that I also heard of data sketches for the first time (there
> > are really many Apache projects).
> >
> > Datasketches sounds really exciting. As a (former) data engineer, I can
> > 100% say that this is something that (end-)users want and need and it
> would
> > make so much sense to have it in Flink from the get-go.
> > Flink, however, is a quite old project already, which grew at a strong
> pace
> > leading to some 150 modules in the core. We are currently in the process
> to
> > restructure that and reduce the number of things in the core, such that
> > build times and stability improve.
> >
> > To counter that we created Flink packages [1], which includes everything
> > new that we deem to not be essential. I'd propose to incorporate a Flink
> > datasketch package there. If it seems like it's becoming essential, we
> can
> > still move it to core at a later point.
> >
> > As I have seen on the page, there are already plenty of adoptions. That
> > leaves a few questions to me.
> >
> >1. I'm curious on how you would estimate the effort to port
> datasketches
> >to Flink? It already has a Java API, but how difficult would it be to
> >subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >on Pig, I think we could use this port as a baseline.
> >2. Do you have any idea who is usually driving the adoptions?
> >
> >
> > [1] https://flink-packages.org/
> >
> > On Sun, Apr 26, 2020 at 8:07 AM leerho  wrote:
> >
> > > Hello All,
> > >
> > > I am a committer on DataSketches.apache.org
> > >  and just learning about Flink,
> Since
> > > Flink is designed for stateful stream processing I would think it would
> > > make sense to have the DataSketches library integrated into its core so
> > all
> > > users of Flin

Re: [DISCUSS] Should max/min be part of the hierarchy of config option?

2020-04-27 Thread Flavio Pompermaier

+1 for Chesnay approach

On Mon, Apr 27, 2020 at 2:31 PM Chesnay Schepler  wrote:

> +1 for xyz.[min|max]; imo it becomes obvious if think of it like a yaml
> file:
>
> xyz:
>  min:
>  max:
>
> opposed to
>
> min-xyz:
> max-xyz:
>
> IIRC this would also be more in-line with the hierarchical scheme for
> config options we decided on months ago.
>
> On 27/04/2020 13:25, Xintong Song wrote:
> > +1 for Robert's idea about adding tests/tools checking the pattern of new
> > configuration options, and migrate the old ones in release 2.0.
> >
> > Concerning the preferred pattern, I personally agree with Till's
> opinion. I
> > think 'xyz.[min|max]' somehow expresses that 'min' and 'max' are
> properties
> > of 'xyz', while 'xyz' may also have other properties. An example could be
> > 'taskmanager.memory.network.[min|max|fraction]'.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Apr 27, 2020 at 6:00 PM Till Rohrmann 
> wrote:
> >
> >> Hi everyone,
> >>
> >> as Robert said I think the problem is that we don't have strict
> guidelines
> >> and every committer follows his/her personal taste. I'm actually not
> sure
> >> whether we can define bullet-proof guidelines but we can definitely
> >> make them more concrete.
> >>
> >> In this case here, I have to admit that I have an opposing
> opinion/taste. I
> >> think it would be better to use xyz.min and xyz.max. The reason is that
> we
> >> configure a property of xyz which consists of the minimum and maximum
> >> value. Differently said, min and max belong semantically together and
> hence
> >> should be defined together. You can think of it as if the type of the
> xyz
> >> config option would be a tuple of two integers instead of two individual
> >> integers.
> >>
> >> A comment concerning the existing styles of config options: I think
> many of
> >> the config options which follow the max-xzy pattern are actually older
> >> configuration options.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Mon, Apr 27, 2020 at 10:34 AM Robert Metzger 
> >> wrote:
> >>
> >>> Thanks for starting this discussion.
> >>> I believe the different options are a lot about personal taste, there
> are
> >>> no objective arguments why one option is better than the other.
> >>>
> >>> I agree with your proposal to simply go with the "max-xyz" pattern, as
> >> this
> >>> is the style of the majority of the current configuration options in
> >> Flink
> >>> (maybe this also means it is the taste of the majority of Flink
> >>> developers?).
> >>>
> >>> I would propose to add a test or some tooling that checks that all new
> >>> configuration parameters follow this pattern, as well as tickets for
> >> Flink
> >>> 2.0 to migrate the "wrong" configuration options.
> >>>
> >>>
> >>>
> >>> On Wed, Apr 22, 2020 at 5:47 AM Yangze Guo  wrote:
> >>>
>  Hi, everyone,
> 
>  I'm working on FLINK-16605 Add max limitation to the total number of
>  slots[1]. In the PR, I, Gary and Xintong has a discussion[2] about the
>  config option of this limit.
>  The central question is whether the "max" should be part of the
>  hierarchy or part of the property itself.
> 
>  It means there could be two patterns:
>  - max-xyz
>  - xyz.max
> 
>  Currently, there is no clear consensus on which style is better and we
>  could find both patterns in the current Flink. Here, I'd like to first
>  sort out[3]:
> 
>  Config options follow the "max-xyz" pattern:
>  - restart-strategy.failure-rate.max-failures-per-interval
>  - yarn.maximum-failed-containers
>  - state.backend.rocksdb.compaction.level.max-size-level-base
>  - cluster.registration.max-timeout
>  - high-availability.zookeeper.client.max-retry-attempts
>  - rest.client.max-content-length
>  - rest.retry.max-attempts
>  - rest.server.max-content-length
>  - jobstore.max-capacity
>  - taskmanager.registration.max-backoff
>  - compiler.delimited-informat.max-line-samples
>  - compiler.delimited-informat.min-line-samples
>  - compiler.delimited-informat.max-sample-len
>  - taskmanager.runtime.max-fan
>  - pipeline.max-parallelism
>  - execution.checkpointing.max-concurrent-checkpoint
>  - execution.checkpointing.min-pause
> 
>  Config options follow the "xyz.max" pattern:
>  - taskmanager.memory.jvm-overhead.max
>  - taskmanager.memory.jvm-overhead.min
>  - taskmanager.memory.network.max
>  - taskmanager.memory.network.min
>  - taskmanager.network.request-backoff.max
>  - env.log.max
> 
>  Config options do not follow the above two patterns:
>  - akka.client-socket-worker-pool.pool-size-max
>  - akka.client-socket-worker-pool.pool-size-min
>  - akka.fork-join-executor.parallelism-max
>  - akka.fork-join-executor.parallelism-min
>  - akka.server-socket-worker-pool.pool-size-max
>  - akka.server-socket-worker-pool.pool-size-min
>  - containerized.heap-cutoff-

[jira] [Created] (FLINK-17366) Implement listViews

2020-04-24 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17366:
--

 Summary: Implement listViews
 Key: FLINK-17366
 URL: https://issues.apache.org/jira/browse/FLINK-17366
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


But how to set as read-only Table..?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17361) Support creating of a JDBC table using a custom query

2020-04-23 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17361:
--

 Summary: Support creating of a JDBC table using a custom query
 Key: FLINK-17361
 URL: https://issues.apache.org/jira/browse/FLINK-17361
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Reporter: Flavio Pompermaier


In a long discussion on the mailing list it has emerged how it is not possible 
to create a JDBC table that extract data using a custom query.
A temporary workaround could be to assign as 'connector.table' the target query.
However this is undesirable. 

Moreover, in relation to https://issues.apache.org/jira/browse/FLINK-17360, a 
query could be actually a statement that requires parameters to be filled by 
the custom parameter values provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17360) Support custom partitioners in JDBCReadOptions

2020-04-23 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17360:
--

 Summary: Support custom partitioners in JDBCReadOptions 
 Key: FLINK-17360
 URL: https://issues.apache.org/jira/browse/FLINK-17360
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Reporter: Flavio Pompermaier


Suport custom ParameterValuesProvider. At the moment only 
NumericBetweenParametersProvider is handled if a partition column and min/max 
values are specified.

In a discussion in the mailing list some discussion about this was made.

Me ([~f.pompermaier]):
Then we can add a *scan.parametervalues.provider.class* in order to customize 
the splitting of the query (we can also add a check that the query contains at 
least 1 question mark).
If we introduce a custom parameters provider we need also to specify 
parameters, using something like:
'scan.parametervalues.0.name' = 'minDate', 
'scan.parametervalues.0.value'= '12/10/2019'
'scan.parametervalues.1.name' = 'maxDate', 
'scan.parametervalues.1.value'= '01/01/2020'

[~lzljs3620320]
Maybe we need add something like "scan.parametervalues.provider.type", it can 
be "bound, specify, custom":
- when *bound*, using old p*artitionLowerBound* and *partitionUpperBound*, 
*numPartitions*
- when *specify*, using specify parameters like your proposal
- when *custom*, need *scan.parametervalues.provider.class*

Actually I don't know if specify and custom can be separated..but this can be 
further discussed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17358) JDBCTableSource support FiltertableTableSource

2020-04-23 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17358:
--

 Summary: JDBCTableSource support FiltertableTableSource
 Key: FLINK-17358
 URL: https://issues.apache.org/jira/browse/FLINK-17358
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


Let JDBCTableSource implement FiltertableTableSource



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17357) add "DROP catalog" DDL to blink planner

2020-04-23 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17357:
--

 Summary: add "DROP catalog" DDL to blink planner
 Key: FLINK-17357
 URL: https://issues.apache.org/jira/browse/FLINK-17357
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Reporter: Flavio Pompermaier


https://cwiki.apache.org/confluence/display/FLINK/FLIP+69+-+Flink+SQL+DDL+Enhancement

some customers who have internal streaming platform requested this feature, as 
it's not possible on a platform to load catalogs dynamically at runtime now via 
sql client yaml. Catalog DDL will come into play

(Similarly to https://issues.apache.org/jira/browse/FLINK-15349)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17356) Properly set constraints (PK and UNIQUE)

2020-04-23 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17356:
--

 Summary: Properly set constraints (PK and UNIQUE)
 Key: FLINK-17356
 URL: https://issues.apache.org/jira/browse/FLINK-17356
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / JDBC
Reporter: Flavio Pompermaier


At the moment the PostgresCatalog does not create field constraints (at the 
moment there's only UNIQUE and  PRIMARY_KEY in the TableSchema..could it worth 
to add also NOT_NULL?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17284) Support serial field type in PostgresCatalog

2020-04-20 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-17284:
--

 Summary: Support serial field type in PostgresCatalog
 Key: FLINK-17284
 URL: https://issues.apache.org/jira/browse/FLINK-17284
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Affects Versions: 1.11.0
Reporter: Flavio Pompermaier


In the current version of the PostgresCatalog the serial type is not handled, 
while it can be safely mapped to INT.

See an example at  https://www.postgresqltutorial.com/postgresql-create-table/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: PackagedProgram and ProgramDescription

2020-03-30 Thread Flavio Pompermaier

I would personally like to see a way of describing a Flink job/pipeline
(including its parameters and types) in order to enable better UIs, then
the important thing is to make things consistent and aligned with the new
client developments and exploit this new dev sprint to fix such issues.

On Mon, Mar 30, 2020 at 11:38 AM Aljoscha Krettek 
wrote:

> On 18.03.20 14:45, Flavio Pompermaier wrote:
> > what do you think if we exploit this job-submission sprint to address
> also
> > the problem discussed in
> https://issues.apache.org/jira/browse/FLINK-10862?
>
> That's a good idea! What should we do? It seems that most committers on
> the issue were in favour of deprecating/removing ProgramDescription.
>

PackagedProgram and ProgramDescription

2020-03-18 Thread Flavio Pompermaier

Hi all,
what do you think if we exploit this job-submission sprint to address also
the problem discussed in https://issues.apache.org/jira/browse/FLINK-10862?

Best,
Flavio

FLIP-117: HBase catalog

2020-03-13 Thread Flavio Pompermaier

Hello everybody,
I started a new FLIP to discuss about an HBaseCatalog implementation[1]
after the opening of the relative issue by Bowen [2].
I drafted a very simple version of the FLIP just to discuss about the
critical points (in red) in order to decide how to proceed.

Best,
Flavio

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-117%3A+HBase+catalog
[2] https://issues.apache.org/jira/browse/FLINK-16575

Re: [VOTE] [FLIP-85] Flink Application Mode

2020-03-12 Thread Flavio Pompermaier

+1 (non-binding).
There's also a related issue that I opened a long time ago
https://issues.apache.org/jira/browse/FLINK-10879 that could be closed once
implemented this FLIP (or closed immediately and referenced as duplicated
by the new JIRA ticket that would be created


On Thu, Mar 12, 2020 at 11:17 AM Aljoscha Krettek 
wrote:

> +1 (binding)
>
> Aljoscha
>

Re: [DISCUSS] Remove Eclipse-specific plugin configurations

2020-03-03 Thread Flavio Pompermaier

Yes, in my experience.. I always asked myself if I was the only one using
Eclipse.. :D

On Tue, Mar 3, 2020 at 2:33 PM Chesnay Schepler  wrote:

> To clarify, the whole lifecycle-mapping business is both unnecessary and
> actively harmful?
>
> On 03/03/2020 14:18, Flavio Pompermaier wrote:
> > Sorry for the late reply. I'm using the latest Eclipse (2019-12 R) and
> if I
> > create a project using the Flink 1.10 archetype Eclipse doesn't
> reconstruct
> > correctly the sources folders.
> > If I remove the lifecycle-mapping plugin from the build section
> everything
> > works as expected.
> >
> > About Flink development it's almost the same. I only need to install the
> > scale plugin if I need to debug or edit the scala code.
> >
> > On Sat, Feb 29, 2020 at 12:45 AM Stephan Ewen  wrote:
> >
> >> Flavio, do you load Flink source code into Eclipse, or develop Flink
> >> applications in Eclipse (based on the quickstart archetypes)?
> >>
> >> On Fri, Feb 28, 2020 at 4:10 PM Chesnay Schepler 
> >> wrote:
> >>
> >>> What do you have to change it to?
> >>>
> >>> What happens if you just remove it completely?
> >>>
> >>> On 28/02/2020 16:08, Flavio Pompermaier wrote:
> >>>> I use Eclipse but the stuff added in the pom.xml to improve the
> >>>> out-of-the-box experience is pretty useless, I always have to change
> it
> >>>>
> >>>> On Fri, Feb 28, 2020 at 4:01 PM Chesnay Schepler 
> >>> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> in various maven pom.xml we have some plugin definitions exclusively
> to
> >>>>> increase support for the Eclipse IDE.
> >>>>>
> >>>>>From what I have heard developing Flink is not really possible
> with
> >>>>> Exclipse (we explicitly recommend IntelliJ in our documentation); I'm
> >>>>> not aware of any committer using it at least.
> >>>>>
> >>>>> Hence I wanted to ask here to find out whether anyone is using
> Eclipse.
> >>>>>
> >>>>> If not, then I would like remove this stuff from the poms in an
> effort
> >>>>> to reduce noise, and to reduce issues when using other IDE's (like
> what
> >>>>> happened for vscode-java in FLINK-16150).
> >>>>>
> >>>
>

Re: [DISCUSS] Remove Eclipse-specific plugin configurations

2020-03-03 Thread Flavio Pompermaier

Sorry for the late reply. I'm using the latest Eclipse (2019-12 R) and if I
create a project using the Flink 1.10 archetype Eclipse doesn't reconstruct
correctly the sources folders.
If I remove the lifecycle-mapping plugin from the build section everything
works as expected.

About Flink development it's almost the same. I only need to install the
scale plugin if I need to debug or edit the scala code.

On Sat, Feb 29, 2020 at 12:45 AM Stephan Ewen  wrote:

> Flavio, do you load Flink source code into Eclipse, or develop Flink
> applications in Eclipse (based on the quickstart archetypes)?
>
> On Fri, Feb 28, 2020 at 4:10 PM Chesnay Schepler 
> wrote:
>
>> What do you have to change it to?
>>
>> What happens if you just remove it completely?
>>
>> On 28/02/2020 16:08, Flavio Pompermaier wrote:
>> > I use Eclipse but the stuff added in the pom.xml to improve the
>> > out-of-the-box experience is pretty useless, I always have to change it
>> >
>> > On Fri, Feb 28, 2020 at 4:01 PM Chesnay Schepler 
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> in various maven pom.xml we have some plugin definitions exclusively to
>> >> increase support for the Eclipse IDE.
>> >>
>> >>   From what I have heard developing Flink is not really possible with
>> >> Exclipse (we explicitly recommend IntelliJ in our documentation); I'm
>> >> not aware of any committer using it at least.
>> >>
>> >> Hence I wanted to ask here to find out whether anyone is using Eclipse.
>> >>
>> >> If not, then I would like remove this stuff from the poms in an effort
>> >> to reduce noise, and to reduce issues when using other IDE's (like what
>> >> happened for vscode-java in FLINK-16150).
>> >>
>>
>>

Re: Flink dev blog

2020-03-03 Thread Flavio Pompermaier

Big +1 from my side. I'd be very interested in what Jeff proposed, in
particular everything related to client part (job submission, workflow
management, callbacks on submission/success/failure, etc).
Something I can't find anywhere is also how to query Flink states..would it
be possible to have something like the Presto UI [1]? Does Flink implement
some sort of query queuing? I heard about a query proxy server but I don't
know if there's a will to push in that direction.
For Stateful Functions it would be nice to deeply compare the taxi driver
solution with a more common implementation (i.e. using a database to
persist the legal data..is it safe to keep them as a Flink state?).
[1] https://www.tutorialspoint.com/apache_presto/images/web_interface.jpg

Best,
Flavio

On Tue, Mar 3, 2020 at 10:47 AM Jeff Zhang  wrote:

> +1 for this proposal.  I am preparing some articles for how to use Flink on
> Zeppelin, although it is not closely related with this topic, but should be
> helpful for users to get started with Flink.
>
> Till Rohrmann  于2020年3月3日周二 下午5:39写道：
>
> > I like the idea. +1 from my side.
> >
> > Potential topics:
> > - Scheduling
> > - Cluster partitions
> > - Memory configuration
> > - Recovery
> >
> > Cheers,
> > Till
> >
> > On Tue, Mar 3, 2020 at 3:56 AM Xintong Song 
> wrote:
> >
> > > Big +1. Thanks for the idea, Arvid.
> > >
> > > I'd be excited to read such blogs.
> > >
> > > And we would also be happy to contribute some contents on the newest
> > > efforts from our team.
> > > Potential topics:
> > > - Memory configuration
> > > - Active Kubernetes integration
> > > - GPU support
> > > - Pluggable (dynamic) slot allocation
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Mar 3, 2020 at 9:59 AM Benchao Li  wrote:
> > >
> > > > +1 for this proposal. As a contributor, it would be very helpful to
> > have
> > > > such blogs for us to understand status and future of Flink.
> > > >
> > > > Robert Metzger  于2020年3月3日周二 上午6:00写道：
> > > >
> > > > > I would be excited to read such a blog (can I request topics? :) )
> > > > >
> > > > > We could start very low key by using our wiki's blog feature:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=FLINK
> > > > >
> > > > > On Mon, Mar 2, 2020 at 8:26 PM Stephan Ewen 
> > wrote:
> > > > >
> > > > > > Great idea, but I also second Seth's comment to separate this in
> a
> > > > clear
> > > > > > way. It's easy to confuse new / potential users.
> > > > > >
> > > > > > On Mon, Mar 2, 2020 at 8:15 PM Seth Wiesman  >
> > > > wrote:
> > > > > >
> > > > > > > +1 on the idea.
> > > > > > >
> > > > > > > My only request would be they are clearly marked as being about
> > > > > > internals /
> > > > > > > for advanced users to not give typical users the wrong
> impression
> > > > about
> > > > > > how
> > > > > > > much they need to understand to use Flink. Nico's network stack
> > > blog
> > > > > post
> > > > > > > does this well[1].
> > > > > > >
> > > > > > > Seth
> > > > > > >
> > > > > > > [1]
> https://flink.apache.org/2019/06/05/flink-network-stack.html
> > > > > > >
> > > > > > > On Mon, Mar 2, 2020 at 10:39 AM Ufuk Celebi 
> > > wrote:
> > > > > > >
> > > > > > > > I'd be happy to read such a blog. Big +1 as a potential
> reader.
> > > ;-)
> > > > > > > >
> > > > > > > > – Ufuk
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Mar 2, 2020 at 11:53 AM Arvid Heise <
> > ar...@ververica.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear devs,
> > > > > > > > >
> > > > > > > > > development speed of Flink has steadily increased. Lots of
> > new
> > > > > > concepts
> > > > > > > > are
> > > > > > > > > introduced and technical debt removed. However, it's hard
> to
> > > keep
> > > > > > track
> > > > > > > > of
> > > > > > > > > these things if you are not directly involved. Especially
> for
> > > new
> > > > > > > > > contributors, it's often not easy to know what the best
> > > practices
> > > > > are
> > > > > > > or
> > > > > > > > if
> > > > > > > > > there are related work streams going on.
> > > > > > > > >
> > > > > > > > > In the runtime team, we had the idea to set up a dev blog
> > where
> > > > we
> > > > > > > could
> > > > > > > > > introduce newest developments. The scope should be expert
> > users
> > > > > that
> > > > > > > > > contribute to the project. Of course, some articles may
> have
> > a
> > > > > > broader
> > > > > > > > > scope and even be linked from release notes.
> > > > > > > > >
> > > > > > > > > Examples from our team to give a more specific idea:
> > > > > > > > > * Deprecated checkpoint lock and mailbox model
> > > > > > > > > * Revised interface for two phase commit sinks and new JDBC
> > > sink
> > > > > > > > > * N-ary input operators
> > > > > > > > > * Unaligned checkpoints
> > > > > > > > > * Operator factories
> > > > > > > > > * Plugins
> > > > > > > > >
> > > > > > > > > These articles would be l

Re: [DISCUSS] Remove Eclipse-specific plugin configurations

2020-02-28 Thread Flavio Pompermaier

I use Eclipse but the stuff added in the pom.xml to improve the
out-of-the-box experience is pretty useless, I always have to change it

On Fri, Feb 28, 2020 at 4:01 PM Chesnay Schepler  wrote:

> Hello,
>
> in various maven pom.xml we have some plugin definitions exclusively to
> increase support for the Eclipse IDE.
>
>  From what I have heard developing Flink is not really possible with
> Exclipse (we explicitly recommend IntelliJ in our documentation); I'm
> not aware of any committer using it at least.
>
> Hence I wanted to ask here to find out whether anyone is using Eclipse.
>
> If not, then I would like remove this stuff from the poms in an effort
> to reduce noise, and to reduce issues when using other IDE's (like what
> happened for vscode-java in FLINK-16150).
>

Re: [DISCUSS] Drop connectors for Elasticsearch 2.x and 5.x

2020-02-10 Thread Flavio Pompermaier

+1 for dropping all Elasticsearch connectors < 6.x

On Mon, Feb 10, 2020 at 2:45 PM Dawid Wysakowicz 
wrote:

> Hi all,
>
> As described in this https://issues.apache.org/jira/browse/FLINK-11720
> ticket our elasticsearch 5.x connector does not work out of the box on
> some systems and requires a version bump. This also happens for our e2e.
> We cannot bump the version in es 5.x connector, because 5.x connector
> shares a common class with 2.x that uses an API that was replaced in 5.2.
>
> Both versions are already long eol: https://www.elastic.co/support/eol
>
> I suggest to drop both connectors 5.x and 2.x. If it is too much to drop
> both of them, I would strongly suggest dropping at least 2.x connector
> and update the 5.x line to a working es client module.
>
> What do you think? Should we drop both versions? Drop only the 2.x
> connector? Or keep them both?
>
> Best,
>
> Dawid
>
>

Re: [Discussion] Job generation / submission hooks & Atlas integration

2020-02-05 Thread Flavio Pompermaier

Hi Gyula,
thanks for taking care of integrating Flink with Atlas (and Egeria
initiative in the end) that is IMHO the most important part of all the
Hadoop ecosystem and that, unfortunately, was quite overlooked. I can
confirm that the integration with Atlas/Egeria is absolutely of big
interest.

Il Mer 5 Feb 2020, 17:12 Till Rohrmann  ha scritto:

> Hi Gyula,
>
> thanks for starting this discussion. Before diving in the details of how to
> implement this feature, I wanted to ask whether it is strictly required
> that the Atlas integration lives within Flink or not? Could it also work if
> you have tool which receives job submissions, extracts the required
> information, forwards the job submission to Flink, monitors the execution
> result and finally publishes some information to Atlas (modulo some other
> steps which are missing in my description)? Having a different layer being
> responsible for this would keep complexity out of Flink.
>
> Cheers,
> Till
>
> On Wed, Feb 5, 2020 at 12:48 PM Gyula Fóra  wrote:
>
> > Hi all!
> >
> > We have started some preliminary work on the Flink - Atlas integration at
> > Cloudera. It seems that the integration will require some new hook
> > interfaces at the jobgraph generation and submission phases, so I
> figured I
> > will open a discussion thread with my initial ideas to get some early
> > feedback.
> >
> > *Minimal background*
> > Very simply put Apache Atlas is a data governance framework that stores
> > metadata for our data and processing logic to track ownership, lineage
> etc.
> > It is already integrated with systems like HDFS, Kafka, Hive and many
> > others.
> >
> > Adding Flink integration would mean that we can track the input output
> data
> > of our Flink jobs, their owners and how different Flink jobs are
> connected
> > to each other through the data they produce (lineage). This seems to be a
> > very big deal for a lot of companies :)
> >
> > *Flink - Atlas integration in a nutshell*
> > In order to integrate with Atlas we basically need 2 things.
> >  - Flink entity definitions
> >  - Flink Atlas hook
> >
> > The entity definition is the easy part. It is a json that contains the
> > objects (entities) that we want to store for any give Flink job. As a
> > starter we could have a single FlinkApplication entity that has a set of
> > inputs and outputs. These inputs/outputs are other Atlas entities that
> are
> > already defines such as Kafka topic or Hbase table.
> >
> > The Flink atlas hook will be the logic that creates the entity instance
> and
> > uploads it to Atlas when we start a new Flink job. This is the part where
> > we implement the core logic.
> >
> > *Job submission hook*
> > In order to implement the Atlas hook we need a place where we can inspect
> > the pipeline, create and send the metadata when the job starts. When we
> > create the FlinkApplication entity we need to be able to easily determine
> > the sources and sinks (and their properties) of the pipeline.
> >
> > Unfortunately there is no JobSubmission hook in Flink that could execute
> > this logic and even if there was one there is a mismatch of abstraction
> > levels needed to implement the integration.
> > We could imagine a JobSubmission hook executed in the JobManager runner
> as
> > this:
> >
> > void onSuccessfulSubmission(JobGraph jobGraph, Configuration
> > configuration);
> >
> > This is nice but the JobGraph makes it super difficult to extract sources
> > and UDFs to create the metadata entity. The atlas entity however could be
> > easily created from the StreamGraph object (used to represent the logical
> > flow) before the JobGraph is generated. To go around this limitation we
> > could add a JobGraphGeneratorHook interface:
> >
> > void preProcess(StreamGraph streamGraph); void postProcess(JobGraph
> > jobGraph);
> >
> > We could then generate the atlas entity in the preprocess step and add a
> > jobmission hook in the postprocess step that will simply send the already
> > baked in entity.
> >
> > *This kinda works but...*
> > The approach outlined above seems to work and we have built a POC using
> it.
> > Unfortunately it is far from nice as it exposes non-public APIs such as
> the
> > StreamGraph. Also it feels a bit weird to have 2 hooks instead of one.
> >
> > It would be much nicer if we could somehow go back from JobGraph to
> > StreamGraph or at least have an easy way to access source/sink UDFS.
> >
> > What do you think?
> >
> > Cheers,
> > Gyula
> >
>

Re: [DISCUSS] FLIP-92: JDBC catalog and Postgres catalog

2020-01-21 Thread Flavio Pompermaier

Hi all,
I'm happy to see a lot of interest in easing the integration with JDBC data
sources. Maybe this could be a rare situation (not in my experience
however..) but what if I have to connect to the same type of source (e.g.
Mysql) with 2 incompatible version...? How can I load the 2 (or more)
connectors jars without causing conflicts?

Il Mar 14 Gen 2020, 23:32 Bowen Li  ha scritto:

> Hi devs,
>
> I've updated the wiki according to feedbacks. Please take another look.
>
> Thanks!
>
>
> On Fri, Jan 10, 2020 at 2:24 PM Bowen Li  wrote:
>
> > Thanks everyone for the prompt feedback. Please see my response below.
> >
> > > In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
> > java.time.Instant semantic, and should be mapped to Flink's
> TIME/TIMESTAMP
> > WITH LOCAL TIME ZONE
> >
> > Zhenghua, you are right that pg's 'timestamp with timezone' should be
> > translated into flink's 'timestamp with local timezone'. I don't find
> 'time
> > with (local) timezone' though, so we may not support that type from pg in
> > Flink.
> >
> > > I suggest that the parameters can be completely consistent with the
> > JDBCTableSource / JDBCTableSink. If you take a look to JDBC api:
> > "DriverManager.getConnection".
> > That allow "default db, username, pwd" things optional. They can included
> > in URL. Of course JDBC api also allows establishing connections to
> > different databases in a db instance. So I think we don't need provide a
> > "base_url", we can just provide a real "url". To be consistent with JDBC
> > api.
> >
> > Jingsong, what I'm saying is a builder can be added on demand later if
> > there's enough user requesting it, and doesn't need to be a core part of
> > the FLIP.
> >
> > Besides, unfortunately Postgres doesn't allow changing databases via
> JDBC.
> >
> > JDBC provides different connecting options as you mentioned, but I'd like
> > to keep our design and API simple and having to handle extra parsing
> logic.
> > And it doesn't shut the door for what you proposed as a future effort.
> >
> > > Since the PostgreSQL does not have catalog but schema under database,
> > why not mapping the PG-database to Flink catalog and PG-schema to Flink
> > database
> >
> > Danny, because 1) there are frequent use cases where users want to switch
> > databases or referencing objects across databases in a pg instance 2)
> > schema is an optional namespace layer in pg, it always has a default
> value
> > ("public") and can be invisible to users if they'd like to as shown in
> the
> > FLIP 3) as you mentioned it is specific to postgres, and I don't feel
> it's
> > necessary to map Postgres substantially different than others DBMSs with
> > additional complexity
> >
> > >'base_url' configuration: We are following the configuration format
> > guideline [1] which suggest to use dash (-) instead of underline (_). And
> > I'm a little confused the meaning of "base_url" at the first glance,
> > another idea is split it into several configurations: 'driver',
> 'hostname',
> > 'port'.
> >
> > Jark, I agreed we should use "base-url" in yaml config.
> >
> > I'm not sure about having hostname and port separately because you can
> > specify multiple hosts with ports in jdbc, like
> > "jdbc:dbms/host1:port1,host2:port2/", for connection failovers.
> Separating
> > them would make configurations harder.
> >
> > I will add clear doc and example to avoid any possible confusion.
> >
> > > 'default-database' is optional, then which database will be used or
> what
> > is the behavior when the default database is not selected.
> >
> > This should be DBMS specific. For postgres, it will be the 
> > database.
> >
> >
> > On Thu, Jan 9, 2020 at 9:48 PM Zhenghua Gao  wrote:
> >
> >> Hi Bowen, Thanks for driving this.
> >> I think it would be very convenience to use tables in external DBs with
> >> JDBC Catalog.
> >>
> >> I have one concern about "Flink-Postgres Data Type Mapping" part:
> >>
> >> In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
> java.time.Instant
> >> semantic,
> >> and should be mapped to Flink's TIME/TIMESTAMP WITH LOCAL TIME ZONE
> >>
> >> *Best Regards,*
> >> *Zhenghua Gao*
> >>
> >>
> >> On Fri, Jan 10, 2020 at 11:09 AM Jingsong Li 
> >> wrote:
> >>
> >> > Hi Bowen, thanks for reply and updating.
> >> >
> >> > > I don't see much value in providing a builder for jdbc catalogs, as
> >> they
> >> > only have 4 or 5 required params, no optional ones. I prefer users
> just
> >> > provide a base url without default db, usrname, pwd so we don't need
> to
> >> > parse url all around, as I mentioned jdbc catalog may need to
> establish
> >> > connections to different databases in a db instance,
> >> >
> >> > I suggest that the parameters can be completely consistent with the
> >> > JDBCTableSource / JDBCTableSink.
> >> > If you take a look to JDBC api: "DriverManager.getConnection".
> >> > That allow "default db, username, pwd" things optional. They can
> >> included
> >> > in URL. Of course JDBC api also allows establi

Re: [jira] [Created] (FLINK-15644) Add support for SQL query validation

2020-01-19 Thread Flavio Pompermaier

Ok thanks for the pointer, I wasn't awareof that!

Il Dom 19 Gen 2020, 03:00 godfrey he  ha scritto:

> hi Flavio, TableEnvironment.getCompletionHints maybe already meet the
> requirement.
>
> Flavio Pompermaier  于2020年1月18日周六 下午3:39写道：
>
> > Why not adding also a suggest() method (also unimplemented initially)
> that
> > would return the list of suitable completions/tokens on the current
> query?
> > How complex eould it be to implement it in you opinion?
> >
> > Il Ven 17 Gen 2020, 18:32 Fabian Hueske (Jira)  ha
> > scritto:
> >
> > > Fabian Hueske created FLINK-15644:
> > > -
> > >
> > >  Summary: Add support for SQL query validation
> > >  Key: FLINK-15644
> > >  URL:
> https://issues.apache.org/jira/browse/FLINK-15644
> > >  Project: Flink
> > >   Issue Type: New Feature
> > >   Components: Table SQL / API
> > > Reporter: Fabian Hueske
> > >
> > >
> > > It would be good if the {{TableEnvironment}} would offer methods to
> check
> > > the validity of SQL queries. Such a method could be used by services
> (CLI
> > > query shells, notebooks, SQL UIs) that are backed by Flink and execute
> > > their queries on Flink.
> > >
> > > Validation should be available in two levels:
> > >  # Validation of syntax and semantics: This includes parsing the query,
> > > checking the catalog for dbs, tables, fields, type checks for
> expressions
> > > and functions, etc. This will check if the query is a valid SQL query.
> > >  # Validation that query is supported: Checks if Flink can execute the
> > > given query. Some syntactically and semantically valid SQL queries are
> > not
> > > supported, esp. in a streaming context. This requires running the
> > > optimizer. If the optimizer generates an execution plan, the query can
> be
> > > executed. This check includes the first step and is more expensive.
> > >
> > > The reason for this separation is that the first check can be done much
> > > fast as it does not involve calling the optimizer. Hence, it would be
> > > suitable for fast checks in an interactive query editor. The second
> check
> > > might take more time (depending on the complexity of the query) and
> might
> > > not be suitable for rapid checks but only on explicit user request.
> > >
> > > Requirements:
> > >  * validation does not modify the state of the {{TableEnvironment}},
> i.e.
> > > it does not add plan operators
> > >  * validation does not require connector dependencies
> > >  * validation can identify the update mode of a continuous query result
> > > (append-only, upsert, retraction).
> > >
> > > Out of scope for this issue:
> > >  * better error messages for unsupported features as suggested by
> > > FLINK-7217
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian Jira
> > > (v8.3.4#803005)
> > >
> >
>

Re: [jira] [Created] (FLINK-15644) Add support for SQL query validation

2020-01-17 Thread Flavio Pompermaier

Why not adding also a suggest() method (also unimplemented initially) that
would return the list of suitable completions/tokens on the current query?
How complex eould it be to implement it in you opinion?

Il Ven 17 Gen 2020, 18:32 Fabian Hueske (Jira)  ha scritto:

> Fabian Hueske created FLINK-15644:
> -
>
>  Summary: Add support for SQL query validation
>  Key: FLINK-15644
>  URL: https://issues.apache.org/jira/browse/FLINK-15644
>  Project: Flink
>   Issue Type: New Feature
>   Components: Table SQL / API
> Reporter: Fabian Hueske
>
>
> It would be good if the {{TableEnvironment}} would offer methods to check
> the validity of SQL queries. Such a method could be used by services (CLI
> query shells, notebooks, SQL UIs) that are backed by Flink and execute
> their queries on Flink.
>
> Validation should be available in two levels:
>  # Validation of syntax and semantics: This includes parsing the query,
> checking the catalog for dbs, tables, fields, type checks for expressions
> and functions, etc. This will check if the query is a valid SQL query.
>  # Validation that query is supported: Checks if Flink can execute the
> given query. Some syntactically and semantically valid SQL queries are not
> supported, esp. in a streaming context. This requires running the
> optimizer. If the optimizer generates an execution plan, the query can be
> executed. This check includes the first step and is more expensive.
>
> The reason for this separation is that the first check can be done much
> fast as it does not involve calling the optimizer. Hence, it would be
> suitable for fast checks in an interactive query editor. The second check
> might take more time (depending on the complexity of the query) and might
> not be suitable for rapid checks but only on explicit user request.
>
> Requirements:
>  * validation does not modify the state of the {{TableEnvironment}}, i.e.
> it does not add plan operators
>  * validation does not require connector dependencies
>  * validation can identify the update mode of a continuous query result
> (append-only, upsert, retraction).
>
> Out of scope for this issue:
>  * better error messages for unsupported features as suggested by
> FLINK-7217
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>

Re: [DISCUSS] Remove old WebUI

2019-11-22 Thread Flavio Pompermaier

+1 to drop the old UI

On Thu, Nov 21, 2019 at 1:05 PM Chesnay Schepler  wrote:

> Hello everyone,
>
> Flink 1.9 shipped with a new UI, with the old one being kept around as a
> backup in case something wasn't working as expected.
>
> Currently there are no issues indicating any significant problems
> (exclusive to the new UI), so I wanted to check what people think about
> dropping the old UI for 1.10.
>

Re: [DISCUSS] Semantic and implementation of per-job mode

2019-10-31 Thread Flavio Pompermaier

Hi all,
we're using a lot the multiple jobs in one program and this is why: when
you fetch data from a huge number of sources and, for each source, you do
some transformation and then you want to write into a single directory the
union of all outputs (this assumes you're doing batch). When the number of
sources is large, if you want to do this in a single job, the graph becomes
very big and this is a problem for several reasons:

   - too many substasks /threadsi per slot
   - increase of back pressure
   - if a single "sub-job" fails all the job fails..this is very annoying
   if this happens after a half a day for example
   - In our use case, the big-graph mode takes much longer than running
   each job separately (but maybe this is true only if you don't have much
   hardware resources)
   - debugging the cause of a fail could become a daunting task if the job
   graph is too large
   - we faced may strange errors when trying to run the single big-job mode
   (due to serialization corruption)

So, summarizing our overall experience with Flink batch is: the easier is
the job graph the better!

Best,
Flavio


On Thu, Oct 31, 2019 at 10:14 AM Yang Wang  wrote:

> Thanks for tison starting this exciting discussion. We also suffer a lot
> from the per job mode.
> I think the per-job cluster is a dedicated cluster for only one job and
> will not accept more other
> jobs. It has the advantage of one-step submission, do not need to start
> dispatcher first and
> then submit the job. And it does not matter where the job graph is
> generated and job is submitted.
> Now we have two cases.
>
> (1) Current Yarn detached cluster. The job graph is generated in client
> and then use distributed
> cache to flink master container. And the MiniDispatcher uses
> `FileJobGraphRetrieve` to get it.
> The job will be submitted at flink master side.
>
>
> (2) Standalone per job cluster. User jars are already built into image. So
> the job graph will be
> generated at flink master side and `ClasspathJobGraphRetriver` is used to
> get it. The job will
> also be submitted at flink master side.
>
>
> For the (1) and (2), only one job in user program could be supported. The
> per job means
> per job-graph, so it works just as expected.
>
>
> Tison suggests to add a new mode "per-program”. The user jar will be
> transferred to flink master
> container, and a local client will be started to generate job graph and
> submit job. I think it could
> cover all the functionality of current per job, both (1) and (2). Also the
> detach mode and attach
> mode could be unified. We do not need to start a session cluster to
> simulate per job for multiple parts.
>
>
> I am in favor of the “per-program” mode. Just two concerns.
> 1. How many users are using multiple jobs in one program?
> 2. Why do not always use session cluster to simulate per job? Maybe
> one-step submission
> is a convincing reason.
>
>
>
> Best,
> Yang
>
> tison  于2019年10月31日周四 上午9:18写道：
>
>> Thanks for your attentions!
>>
>> @shixiaoga...@gmail.com 
>>
>> Yes correct. We try to avoid jobs affect one another. Also a local
>> ClusterClient
>> in case saves the overhead about retry before leader elected and persist
>> JobGraph before submission in RestClusterClient as well as the net cost.
>>
>> @Paul Lam 
>>
>> 1. Here is already a note[1] about multiple part jobs. I am also confused
>> a bit
>> on this concept at first :-) Things go in similar way if you program
>> contains the
>> only JobGraph so that I think per-program acts like per-job-graph in this
>> case
>> which provides compatibility for many of one job graph program.
>>
>> Besides, we have to respect user program which doesn't with current
>> implementation because we return abruptly when calling env#execute which
>> hijack user control so that they cannot deal with the job result or the
>> future of
>> it. I think this is why we have to add a detach/attach option.
>>
>> 2. For compilation part, I think it could be a workaround that you upload
>> those
>> resources in a commonly known address such as HDFS so that compilation
>> can read from either client or cluster.
>>
>> Best,
>> tison.
>>
>> [1]
>> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16927430&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16927430
>>
>>
>> Newport, Billy  于2019年10月30日周三 下午10:41写道：
>>
>>> We execute multiple job graphs routinely because we cannot submit a
>>> single graph without it blowing up. I believe Regina spoke to this in
>>> Berlin during her talk. We instead if we are processing a database
>>> ingestion with 200 tables in it, we do a job graph per table rather than a
>>> single job graph that does all tables instead. A single job graph can be in
>>> the tens of thousands of nodes in our largest cases and we have found flink
>>> (as os 1.3/1.6.4) cannot handle graphs of that size. We’re currently
>>> testing 1.9.1 but have not retested the large graph scenario.
>>>
>>>
>>>
>>>

Re: [DISCUSS] Stateful Functions - in which form to contribute? (same or different repository)

2019-10-15 Thread Flavio Pompermaier

Definitely on the same page..+1 to keep it in a separate repo (at least
until the cose becomes "stable" and widely adopted from the community)

Il Mar 15 Ott 2019, 23:17 Stephan Ewen  ha scritto:

> Hi Flink folks!
>
> After the positive reaction to the contribution proposal for Stateful
> Functions, I would like to kick off the discussion for the big question: In
> which form should it go into Flink?
>
> Before jumping into the "repository" question directly, let's get some
> clarity on what would be our high-level goal with this project and the
> contribution.
> My thinking so far was:
>
>   - Stateful Functions is a way for Flink and stream processing to become
> applicable for more general application development. That is a chance to
> grow our community to a new crowd of developers.
>
>   - While adding this to Flink gives synergies with the runtime it build on
> top of, it makes sense to offer the new developers a lightweight way to get
> involved. Simple setup, easy contributions.
>
>   - This is a new project, the API and many designs are not frozen at this
> point and may still change heavily.
> To become really good, the project needs to still make a bunch of
> iterations (no pun intended) and change many things quickly.
>
>   - The Stateful Functions project will likely try to release very
> frequently in its early days, to improve quickly and gather feedback fast.
> Being bound to Flink core release cycle would hurt here.
>
>
> I believe that with all those goals, adding Stateful Functions to the Flink
> core repository would not make sense. Flink core has processes that make
> sense for an established project that needs to guarantee stability. These
> processes are simply prohibitive for new projects to develop.
> In addition, the Flink main repository is gigantic, has a build system and
> CI system that cannot handle the size of the project any more. Not the best
> way to start expanding into a new community.
>
> In some sense, Stateful Functions could make sense as an independent
> project, but it is so tightly coupled to Flink right now that I think an
> even better fit is a separate repository in Flink.
> Think Hive and Hadoop in the early days. That way, we get the synergy
> between the two (the same community drives them) while letting both move at
> their own speed.
> It would somehow mean two closely related projects shepherded by the same
> community.
>
> It might be possible at a later stage to either merge this into Flink core
> (once Stateful Functions is more settled) or even spin this out as a
> standalone Apache project, if that is how the community develops.
>
> That is my main motivation. It is not driven primarily by technicalities
> like code versioning and dependencies, but much rather by what is the best
> setup to develop this as Flink's way to expand its community towards new
> users from a different background.
>
> Curious to hear if that makes sense to you.
>
> Best,
> Stephan
>

Re: [PROPOSAL] Contribute Stateful Functions to Apache Flink

2019-10-14 Thread Flavio Pompermaier

I'm obviously pro about promoting the usage of this amazing library but,
maybe, in this early stage I'd try to keep it as a separate project.
However, this really depends about how frequently the code is goin to
change..the Flink main repo is becoming more and more complex to handle due
to the increasing number of open issues and PRs.
I think this library is a perfect fit to test an alternative build system
that is based on multiple git repositories.

Best,
Flavio

Il Lun 14 Ott 2019, 14:53 vino yang  ha scritto:

> +1 to add Stateful Function to flink core to let it stay in the Flink
> repository.
>
> Best,
> Vino
>
> Stephan Ewen  于2019年10月14日周一 下午7:29写道：
>
> > Thank you all for the encouraging feedback! So far the reaction to add
> this
> > to Flink was exclusively positive, which is really great to see!
> >
> > To make this happen, here would be the next steps:
> >
> > (1) As per the bylaws, a contribution like that would need a PMC vote,
> > because it is a commitment to take this and shepherd
> > it in the future. I will kick that off next.
> >
> > (2) The biggest open question in the current discussion would be whether
> to
> > go with a separate repository, or put it into Flink core.
> > Related to the repository discussion is also how to link and present this
> > on the Flink website.
> > I will spin off a separate discussion for that, to keep the threads
> > focused.
> >
> > Best,
> > Stephan
> >
> >
> > On Mon, Oct 14, 2019 at 10:16 AM Becket Qin 
> wrote:
> >
> > > +1 to adding Stateful Function to Flink. It is a very useful addition
> to
> > > the Flink ecosystem.
> > >
> > > Given this is essentially a new top-level / first-citizen API of Flink,
> > it
> > > seems better to have it the Flink core repo. This will also avoid
> letting
> > > this important new API to be blocked on potential problems of
> maintaining
> > > multiple different repositories.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Sun, Oct 13, 2019 at 4:48 AM Hequn Cheng 
> > wrote:
> > >
> > >> Hi Stephan,
> > >>
> > >> Big +1 for adding this to Apache Flink!
> > >>
> > >> As for the problem of whether this should be added to the Flink main
> > >> repository, from my side, I prefer to put it in the main repository.
> Not
> > >> only Stateful Functions shares very close relations with the current
> > Flink,
> > >> but also other libs or modules in Flink can make use of it the other
> way
> > >> round in the future. At that time the Flink API stack would also be
> > changed
> > >> a bit and this would be cool.
> > >>
> > >> Best, Hequn
> > >>
> > >> On Sat, Oct 12, 2019 at 9:16 PM Biao Liu  wrote:
> > >>
> > >>> Hi Stehpan,
> > >>>
> > >>> +1 for having Stateful Functions in Flink.
> > >>>
> > >>> Before discussing which repository it should belong, I was wondering
> if
> > >>> we have reached an agreement of "splitting flink repository" as Piotr
> > >>> mentioned or not. It seems that it's just no more further discussion.
> > >>> It's OK for me to add it to core repository. After all almost
> > everything
> > >>> is in core repository now. But if we decide to split the core
> > repository
> > >>> someday, I tend to create a separate repository for Stateful
> > Functions. It
> > >>> might be good time to take the first step of splitting.
> > >>>
> > >>> Thanks,
> > >>> Biao /'bɪ.aʊ/
> > >>>
> > >>>
> > >>>
> > >>> On Sat, 12 Oct 2019 at 19:31, Yu Li  wrote:
> > >>>
> >  Hi Stephan,
> > 
> >  Big +1 for adding stateful functions to Flink. I believe a lot of
> user
> >  would be interested to try this out and I could imagine how this
> could
> >  contribute to reduce the TCO for business requiring both streaming
> >  processing and stateful functions.
> > 
> >  And my 2 cents is to put it into flink core repository since I could
> >  see a tight connection between this library and flink state.
> > 
> >  Best Regards,
> >  Yu
> > 
> > 
> >  On Sat, 12 Oct 2019 at 17:31, jincheng sun <
> sunjincheng...@gmail.com>
> >  wrote:
> > 
> > > Hi Stephan,
> > >
> > > bit +1 for adding this great features to Apache Flink.
> > >
> > > Regarding where we should place it, put it into Flink core
> repository
> > > or create a separate repository? I prefer put it into main
> > repository and
> > > looking forward the more detail discussion for this decision.
> > >
> > > Best,
> > > Jincheng
> > >
> > >
> > > Jingsong Li  于2019年10月12日周六 上午11:32写道：
> > >
> > >> Hi Stephan,
> > >>
> > >> big +1 for this contribution. It provides another user interface
> > that
> > >> is easy to use and popular at this time. these functions, It's
> hard
> > for
> > >> users to write in SQL/TableApi, while using DataStream is too
> > complex.
> > >> (We've done some stateFun kind jobs using DataStream before). With
> > >> statefun, it is very easy.
> > >>
> > >> I think it's also a good

Re: [DISCUSS] FLIP-74: Flink JobClient API

2019-09-27 Thread Flavio Pompermaier

Hi all,
just a remark about the Flink REST APIs (and its client as well): almost
all the times we need a way to dynamically know which jobs are contained in
a jar file, and this could be exposed by the REST endpoint under
/jars/:jarid/entry-points (a simple way to implement this would be to check
the value of Main-class or Main-classes inside the Manifest of the jar if
they exists [1]).

I understand that this is something that is not strictly required to
execute Flink jobs but IMHO it would ease A LOT the work of UI developers
that could have a way to show the users all available jobs inside a jar +
their configurable parameters.
For example, right now in the WebUI, you can upload a jar and then you have
to set (without any autocomplete or UI support) the main class and their
params (for example using a string like --param1 xx --param2 yy).
Adding this functionality to the REST API and the respective client would
enable the WebUI (and all UIs interacting with a Flink cluster) to prefill
a dropdown list containing the list of entry-point classes (i.e. Flink
jobs) and, once selected, their required (typed) parameters.

Best,
Flavio

[1] https://issues.apache.org/jira/browse/FLINK-10864

On Fri, Sep 27, 2019 at 9:16 AM Zili Chen  wrote:

> modify
>
> /we just shutdown the cluster on the exit of client that running inside
> cluster/
>
> to
>
> we just shutdown the cluster on both the exit of client that running inside
> cluster and the finish of job.
> Since client is running inside cluster we can easily wait for the end of
> two both in ClusterEntrypoint.
>
>
> Zili Chen  于2019年9月27日周五 下午3:13写道：
>
> > About JobCluster
> >
> > Actually I am not quite sure what we gains from DETACHED configuration on
> > cluster side.
> > We don't have a NON-DETACHED JobCluster in fact in our codebase, right?
> >
> > It comes to me one major questions we have to answer first.
> >
> > *What JobCluster conceptually is exactly*
> >
> > Related discussion can be found in JIRA[1] and mailing list[2]. Stephan
> > gives a nice
> > description of JobCluster:
> >
> > Two things to add: - The job mode is very nice in the way that it runs
> the
> > client inside the cluster (in the same image/process that is the JM) and
> > thus unifies both applications and what the Spark world calls the "driver
> > mode". - Another thing I would add is that during the FLIP-6 design, we
> > were thinking about setups where Dispatcher and JobManager are separate
> > processes. A Yarn or Mesos Dispatcher of a session could run
> independently
> > (even as privileged processes executing no code). Then you the "per-job"
> > mode could still be helpful: when a job is submitted to the dispatcher,
> it
> > launches the JM again in a per-job mode, so that JM and TM processes are
> > bound to teh job only. For higher security setups, it is important that
> > processes are not reused across jobs.
> >
> > However, currently in "per-job" mode we generate JobGraph in client side,
> > launching
> > the JobCluster and retrieve the JobGraph for execution. So actually, we
> > don't "run the
> > client inside the cluster".
> >
> > Besides, refer to the discussion with Till[1], it would be helpful we
> > follow the same process
> > of session mode for that of "per-job" mode in user perspective, that we
> > don't use
> > OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink
> > cluster in env.execute.
> >
> > Generally 2 points
> >
> > 1. Running Flink job by invoke user main method and execute throughout,
> > instead of create
> > JobGraph from main-class.
> > 2. Run the client inside the cluster.
> >
> > If 1 and 2 are implemented. There is obvious no need for DETACHED mode in
> > cluster side
> > because we just shutdown the cluster on the exit of client that running
> > inside cluster. Whether
> > or not delivered the result is up to user code.
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
> > [2]
> >
> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
> >
> >
> > Zili Chen  于2019年9月27日周五 下午2:13写道：
> >
> >> Thanks for your replies Kostas & Aljoscha!
> >>
> >> Below are replies point by point.
> >>
> >> 1. For DETACHED mode, what I said there is about the DETACHED mode in
> >> client side.
> >> There are two configurations overload the item DETACHED[1].
> >>
> >> In client side, it means whether or not client.submitJob is blocking to
> >> job execution result.
> >> Due to client.submitJob returns CompletableFuture
> NON-DETACHED
> >> is no
> >> power at all. Caller of submitJob makes the decision whether or not
> >> blocking to get the
> >> JobClient and request for the job execution result. If client crashes,
> it
> >> is a user scope
> >> exception that should be handled in user code; if client lost connection
> >> to cluster, we have
> >> a ret

[jira] [Created] (FLINK-14209) Add jackson-dataformat-xml to flink-shaded

2019-09-25 Thread Flavio Pompermaier (Jira)

Flavio Pompermaier created FLINK-14209:
--

 Summary: Add jackson-dataformat-xml to flink-shaded
 Key: FLINK-14209
 URL: https://issues.apache.org/jira/browse/FLINK-14209
 Project: Flink
  Issue Type: Wish
  Components: BuildSystem / Shaded
Reporter: Flavio Pompermaier


This would ease a lot integration with frameworks like Spring



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] Flink client api enhancement for downstream project

2019-08-20 Thread Flavio Pompermaier

ubmission and execution return a CompletableFuture
> and
> > users control whether or not wait for the result. In this point we don't
> > need a detached option but the functionality is covered.
> >
> > 5. How does per-job mode interact with interactive programming.
> >
> > All of YARN, Mesos and Kubernetes scenarios follow the pattern launch a
> > JobCluster now. And I don't think there would be inconsistency between
> > different resource management.
> >
> > Best,
> > tison.
> >
> > [1]
> >
> https://lists.apache.org/x/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
> > [2]
> >
> https://docs.google.com/document/d/1UWJE7eYWiMuZewBKS0YmdVO2LUTqXPd6-pbOCof9ddY/edit?disco=DZaGGfs
> >
> > Aljoscha Krettek  于2019年8月16日周五 下午9:20写道：
> >
> >> Hi,
> >>
> >> I read both Jeffs initial design document and the newer document by
> >> Tison. I also finally found the time to collect our thoughts on the
> issue,
> >> I had quite some discussions with Kostas and this is the result: [1].
> >>
> >> I think overall we agree that this part of the code is in dire need of
> >> some refactoring/improvements but I think there are still some open
> >> questions and some differences in opinion what those refactorings should
> >> look like.
> >>
> >> I think the API-side is quite clear, i.e. we need some JobClient API
> that
> >> allows interacting with a running Job. It could be worthwhile to spin
> that
> >> off into a separate FLIP because we can probably find consensus on that
> >> part more easily.
> >>
> >> For the rest, the main open questions from our doc are these:
> >>
> >>   - Do we want to separate cluster creation and job submission for
> >> per-job mode? In the past, there were conscious efforts to *not*
> separate
> >> job submission from cluster creation for per-job clusters for Mesos,
> YARN,
> >> Kubernets (see StandaloneJobClusterEntryPoint). Tison suggests in his
> >> design document to decouple this in order to unify job submission.
> >>
> >>   - How to deal with plan preview, which needs to hijack execute() and
> >> let the outside code catch an exception?
> >>
> >>   - How to deal with Jar Submission at the Web Frontend, which needs to
> >> hijack execute() and let the outside code catch an exception?
> >> CliFrontend.run() “hijacks” ExecutionEnvironment.execute() to get a
> >> JobGraph and then execute that JobGraph manually. We could get around
> that
> >> by letting execute() do the actual execution. One caveat for this is
> that
> >> now the main() method doesn’t return (or is forced to return by
> throwing an
> >> exception from execute()) which means that for Jar Submission from the
> >> WebFrontend we have a long-running main() method running in the
> >> WebFrontend. This doesn’t sound very good. We could get around this by
> >> removing the plan preview feature and by removing Jar
> Submission/Running.
> >>
> >>   - How to deal with detached mode? Right now, DetachedEnvironment will
> >> execute the job and return immediately. If users control when they want
> to
> >> return, by waiting on the job completion future, how do we deal with
> this?
> >> Do we simply remove the distinction between detached/non-detached?
> >>
> >>   - How does per-job mode interact with “interactive programming”
> >> (FLIP-36). For YARN, each execute() call could spawn a new Flink YARN
> >> cluster. What about Mesos and Kubernetes?
> >>
> >> The first open question is where the opinions diverge, I think. The rest
> >> are just open questions and interesting things that we need to consider.
> >>
> >> Best,
> >> Aljoscha
> >>
> >> [1]
> >>
> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit#heading=h.na7k0ad88tix
> >> <
> >>
> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit#heading=h.na7k0ad88tix
> >> >
> >>
> >> > On 31. Jul 2019, at 15:23, Jeff Zhang  wrote:
> >> >
> >> > Thanks tison for the effort. I left a few comments.
> >> >
> >> >
> >> > Zili Chen  于2019年7月31日周三 下午8:24写道：
> >> >
> >> >> Hi Flavio,
> >> >>
> >> >> Thanks for your reply.
> >> >>
> >> >

Re: [DISCUSS] Flink client api enhancement for downstream project

2019-07-31 Thread Flavio Pompermaier

Just one note on my side: it is not clear to me whether the client needs to
be able to generate a job graph or not.
In my opinion, the job jar must resides only on the server/jobManager side
and the client requires a way to get the job graph.
If you really want to access to the job graph, I'd add a dedicated method
on the ClusterClient. like:

   - getJobGraph(jarId, mainClass): JobGraph
   - listMainClasses(jarId): List

These would require some addition also on the job manager endpoint as
well..what do you think?

On Wed, Jul 31, 2019 at 12:42 PM Zili Chen  wrote:

> Hi all,
>
> Here is a document[1] on client api enhancement from our perspective.
> We have investigated current implementations. And we propose
>
> 1. Unify the implementation of cluster deployment and job submission in
> Flink.
> 2. Provide programmatic interfaces to allow flexible job and cluster
> management.
>
> The first proposal is aimed at reducing code paths of cluster deployment
> and
> job submission so that one can adopt Flink in his usage easily. The second
> proposal is aimed at providing rich interfaces for advanced users
> who want to make accurate control of these stages.
>
> Quick reference on open questions:
>
> 1. Exclude job cluster deployment from client side or redefine the semantic
> of job cluster? Since it fits in a process quite different from session
> cluster deployment and job submission.
>
> 2. Maintain the codepaths handling class o.a.f.api.common.Program or
> implement customized program handling logic by customized CliFrontend?
> See also this thread[2] and the document[1].
>
> 3. Expose ClusterClient as public api or just expose api in
> ExecutionEnvironment
> and delegate them to ClusterClient? Further, in either way is it worth to
> introduce a JobClient which is an encapsulation of ClusterClient that
> associated to specific job?
>
> Best,
> tison.
>
> [1]
>
> https://docs.google.com/document/d/1UWJE7eYWiMuZewBKS0YmdVO2LUTqXPd6-pbOCof9ddY/edit?usp=sharing
> [2]
>
> https://lists.apache.org/thread.html/7ffc9936a384b891dbcf0a481d26c6d13b2125607c200577780d1e18@%3Cdev.flink.apache.org%3E
>
> Jeff Zhang  于2019年7月24日周三 上午9:19写道：
>
> > Thanks Stephan, I will follow up this issue in next few weeks, and will
> > refine the design doc. We could discuss more details after 1.9 release.
> >
> > Stephan Ewen  于2019年7月24日周三 上午12:58写道：
> >
> > > Hi all!
> > >
> > > This thread has stalled for a bit, which I assume ist mostly due to the
> > > Flink 1.9 feature freeze and release testing effort.
> > >
> > > I personally still recognize this issue as one important to be solved.
> > I'd
> > > be happy to help resume this discussion soon (after the 1.9 release)
> and
> > > see if we can do some step towards this in Flink 1.10.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > > On Mon, Jun 24, 2019 at 10:41 AM Flavio Pompermaier <
> > pomperma...@okkam.it>
> > > wrote:
> > >
> > > > That's exactly what I suggested a long time ago: the Flink REST
> client
> > > > should not require any Flink dependency, only http library to call
> the
> > > REST
> > > > services to submit and monitor a job.
> > > > What I suggested also in [1] was to have a way to automatically
> suggest
> > > the
> > > > user (via a UI) the available main classes and their required
> > > > parameters[2].
> > > > Another problem we have with Flink is that the Rest client and the
> CLI
> > > one
> > > > behaves differently and we use the CLI client (via ssh) because it
> > allows
> > > > to call some other method after env.execute() [3] (we have to call
> > > another
> > > > REST service to signal the end of the job).
> > > > Int his regard, a dedicated interface, like the JobListener suggested
> > in
> > > > the previous emails, would be very helpful (IMHO).
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-10864
> > > > [2] https://issues.apache.org/jira/browse/FLINK-10862
> > > > [3] https://issues.apache.org/jira/browse/FLINK-10879
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > > > On Mon, Jun 24, 2019 at 9:54 AM Jeff Zhang  wrote:
> > > >
> > > > > Hi, Tison,
> > > > >
> > > > > Thanks for your comments. Overall I agree with you that it is
> > difficult
> > > > for
> > > > > down stream project t

Re: [DISCUSS] Drop stale class Program

2019-07-19 Thread Flavio Pompermaier

+1 to remove directly the Program class (I think nobody use it and it's not
supported at all by REST services and UI).
Moreover it requires a lot of transitive dependencies while it should be a
very simple thing..
+1 to add this discussion to "Flink client api enhancement"

On Fri, Jul 19, 2019 at 11:14 AM Biao Liu  wrote:

> To Flavio, good point for the integration suggestion.
>
> I think it should be considered in the "Flink client api enhancement"
> discussion. But the outdated API should be deprecated somehow.
>
> Flavio Pompermaier  于2019年7月19日周五 下午4:21写道：
>
> > In my experience a basic "official" (but optional) program description
> > would be very useful indeed (in order to ease the integration with other
> > frameworks).
> >
> > Of course it should be extended and integrated with the REST services and
> > the Web UI (when defined) in order to be useful..
> > It ease to show to the user what a job does and which parameters it
> > requires (optional or mandatory) and with a proper help description.
> > Indeed, when we write a Flink job we implement the following interface:
> >
> > public interface FlinkJob {
> >   String getDescription();
> >   List getParameters();
> >  boolean isStreamingOrBatch();
> > }
> >
> > public class ClusterJobParameter {
> >
> >   private String paramName;
> >   private String paramType = "string";
> >   private String paramDesc;
> >   private String paramDefaultValue;
> >   private Set choices;
> >   private boolean mandatory;
> > }
> >
> > This really helps to launch a Flink job by a frontend (if the rest
> services
> > returns back those infos).
> >
> > Best,
> > Flavio
> >
> > On Fri, Jul 19, 2019 at 9:57 AM Biao Liu  wrote:
> >
> > > Hi Zili,
> > >
> > > Thank you for bring us this discussion.
> > >
> > > My gut feeling is +1 for dropping it.
> > > Usually it costs some time to deprecate a public (actually it's
> > > `PublicEvolving`) API. Ideally it should be marked as `Deprecated`
> first.
> > > Then it might be abandoned it in some later version.
> > >
> > > I'm not sure how big the burden is to make it compatible with the
> > enhanced
> > > client API. If it's a critical blocker, I support dropping it radically
> > in
> > > 1.10. Of course a survey is necessary. And the result of survey is
> > > acceptable.
> > >
> > >
> > >
> > > Zili Chen  于2019年7月19日周五 下午1:44写道：
> > >
> > > > Hi devs,
> > > >
> > > > Participating the thread "Flink client api enhancement"[1], I just
> > notice
> > > > that inside submission codepath of Flink we always has a branch
> dealing
> > > > with the case that main class of user program is a subclass of
> > > > o.a.f.api.common.Program, which is defined as
> > > >
> > > > @PublicEvolving
> > > > public interface Program {
> > > >   Plan getPhan(String... args);
> > > > }
> > > >
> > > > This class, as user-facing interface, asks user to implement #getPlan
> > > > which return a almost Flink internal class. FLINK-10862[2] pointed
> out
> > > > this confusion.
> > > >
> > > > Since our codebase contains quite a bit code handling this stale
> class,
> > > > and also it obstructs the effort enhancing Flink cilent api,
> > > > I'd like to propose dropping it. Or we can start a survey on user
> list
> > > > to see if there is any user depending on this class.
> > > >
> > > > best,
> > > > tison.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
> > > > [2] https://issues.apache.org/jira/browse/FLINK-10862
> > > >
> > >
> >
>

Re: [DISCUSS] Drop stale class Program

2019-07-19 Thread Flavio Pompermaier

In my experience a basic "official" (but optional) program description
would be very useful indeed (in order to ease the integration with other
frameworks).

Of course it should be extended and integrated with the REST services and
the Web UI (when defined) in order to be useful..
It ease to show to the user what a job does and which parameters it
requires (optional or mandatory) and with a proper help description.
Indeed, when we write a Flink job we implement the following interface:

public interface FlinkJob {
  String getDescription();
  List getParameters();
 boolean isStreamingOrBatch();
}

public class ClusterJobParameter {

  private String paramName;
  private String paramType = "string";
  private String paramDesc;
  private String paramDefaultValue;
  private Set choices;
  private boolean mandatory;
}

This really helps to launch a Flink job by a frontend (if the rest services
returns back those infos).

Best,
Flavio

On Fri, Jul 19, 2019 at 9:57 AM Biao Liu  wrote:

> Hi Zili,
>
> Thank you for bring us this discussion.
>
> My gut feeling is +1 for dropping it.
> Usually it costs some time to deprecate a public (actually it's
> `PublicEvolving`) API. Ideally it should be marked as `Deprecated` first.
> Then it might be abandoned it in some later version.
>
> I'm not sure how big the burden is to make it compatible with the enhanced
> client API. If it's a critical blocker, I support dropping it radically in
> 1.10. Of course a survey is necessary. And the result of survey is
> acceptable.
>
>
>
> Zili Chen  于2019年7月19日周五 下午1:44写道：
>
> > Hi devs,
> >
> > Participating the thread "Flink client api enhancement"[1], I just notice
> > that inside submission codepath of Flink we always has a branch dealing
> > with the case that main class of user program is a subclass of
> > o.a.f.api.common.Program, which is defined as
> >
> > @PublicEvolving
> > public interface Program {
> >   Plan getPhan(String... args);
> > }
> >
> > This class, as user-facing interface, asks user to implement #getPlan
> > which return a almost Flink internal class. FLINK-10862[2] pointed out
> > this confusion.
> >
> > Since our codebase contains quite a bit code handling this stale class,
> > and also it obstructs the effort enhancing Flink cilent api,
> > I'd like to propose dropping it. Or we can start a survey on user list
> > to see if there is any user depending on this class.
> >
> > best,
> > tison.
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
> > [2] https://issues.apache.org/jira/browse/FLINK-10862
> >
>

Re: [DISCUSS] Flink client api enhancement for downstream project

2019-06-24 Thread Flavio Pompermaier

Flink configuration and passed
> > 2. execution environment knows all configs and handles execution(both
> > deployment and submission)
> >
> > to the issues above I propose eliminating inconsistencies by following
> > approach:
> >
> > 1) CliFrontend should exactly be a front end, at least for "run" command.
> > That means it just gathered and passed all config from command line to
> > the main method of user program. Execution environment knows all the info
> > and with an addition to utils for ClusterClient, we gracefully get a
> > ClusterClient by deploying or retrieving. In this way, we don't need to
> > hijack #execute/executePlan methods and can remove various hacking
> > subclasses of exec env, as well as #run methods in ClusterClient(for an
> > interface-ized ClusterClient). Now the control flow flows from
> CliFrontend
> > to the main method and never returns.
> >
> > 2) Job cluster means a cluster for the specific job. From another
> > perspective, it is an ephemeral session. We may decouple the deployment
> > with a compiled job graph, but start a session with idle timeout
> > and submit the job following.
> >
> > These topics, before we go into more details on design or implementation,
> > are better to be aware and discussed for a consensus.
> >
> > Best,
> > tison.
> >
> >
> > Zili Chen  于2019年6月20日周四 上午3:21写道：
> >
> >> Hi Jeff,
> >>
> >> Thanks for raising this thread and the design document!
> >>
> >> As @Thomas Weise mentioned above, extending config to flink
> >> requires far more effort than it should be. Another example
> >> is we achieve detach mode by introduce another execution
> >> environment which also hijack #execute method.
> >>
> >> I agree with your idea that user would configure all things
> >> and flink "just" respect it. On this topic I think the unusual
> >> control flow when CliFrontend handle "run" command is the problem.
> >> It handles several configs, mainly about cluster settings, and
> >> thus main method of user program is unaware of them. Also it compiles
> >> app to job graph by run the main method with a hijacked exec env,
> >> which constrain the main method further.
> >>
> >> I'd like to write down a few of notes on configs/args pass and respect,
> >> as well as decoupling job compilation and submission. Share on this
> >> thread later.
> >>
> >> Best,
> >> tison.
> >>
> >>
> >> SHI Xiaogang  于2019年6月17日周一 下午7:29写道：
> >>
> >>> Hi Jeff and Flavio,
> >>>
> >>> Thanks Jeff a lot for proposing the design document.
> >>>
> >>> We are also working on refactoring ClusterClient to allow flexible and
> >>> efficient job management in our real-time platform.
> >>> We would like to draft a document to share our ideas with you.
> >>>
> >>> I think it's a good idea to have something like Apache Livy for Flink,
> >>> and
> >>> the efforts discussed here will take a great step forward to it.
> >>>
> >>> Regards,
> >>> Xiaogang
> >>>
> >>> Flavio Pompermaier  于2019年6月17日周一 下午7:13写道：
> >>>
> >>> > Is there any possibility to have something like Apache Livy [1] also
> >>> for
> >>> > Flink in the future?
> >>> >
> >>> > [1] https://livy.apache.org/
> >>> >
> >>> > On Tue, Jun 11, 2019 at 5:23 PM Jeff Zhang  wrote:
> >>> >
> >>> > > >>>  Any API we expose should not have dependencies on the runtime
> >>> > > (flink-runtime) package or other implementation details. To me,
> this
> >>> > means
> >>> > > that the current ClusterClient cannot be exposed to users because
> it
> >>> >  uses
> >>> > > quite some classes from the optimiser and runtime packages.
> >>> > >
> >>> > > We should change ClusterClient from class to interface.
> >>> > > ExecutionEnvironment only use the interface ClusterClient which
> >>> should be
> >>> > > in flink-clients while the concrete implementation class could be
> in
> >>> > > flink-runtime.
> >>> > >
> >>> > > >>> What happens when a failure/restart in the client happens?
> There
> >>

Re: [DISCUSS] Flink client api enhancement for downstream project

2019-06-17 Thread Flavio Pompermaier

Is there any possibility to have something like Apache Livy [1] also for
Flink in the future?

[1] https://livy.apache.org/

On Tue, Jun 11, 2019 at 5:23 PM Jeff Zhang  wrote:

> >>>  Any API we expose should not have dependencies on the runtime
> (flink-runtime) package or other implementation details. To me, this means
> that the current ClusterClient cannot be exposed to users because it   uses
> quite some classes from the optimiser and runtime packages.
>
> We should change ClusterClient from class to interface.
> ExecutionEnvironment only use the interface ClusterClient which should be
> in flink-clients while the concrete implementation class could be in
> flink-runtime.
>
> >>> What happens when a failure/restart in the client happens? There need
> to be a way of re-establishing the connection to the job, set up the
> listeners again, etc.
>
> Good point.  First we need to define what does failure/restart in the
> client mean. IIUC, that usually mean network failure which will happen in
> class RestClient. If my understanding is correct, restart/retry mechanism
> should be done in RestClient.
>
>
>
>
>
> Aljoscha Krettek  于2019年6月11日周二 下午11:10写道：
>
> > Some points to consider:
> >
> > * Any API we expose should not have dependencies on the runtime
> > (flink-runtime) package or other implementation details. To me, this
> means
> > that the current ClusterClient cannot be exposed to users because it
>  uses
> > quite some classes from the optimiser and runtime packages.
> >
> > * What happens when a failure/restart in the client happens? There need
> to
> > be a way of re-establishing the connection to the job, set up the
> listeners
> > again, etc.
> >
> > Aljoscha
> >
> > > On 29. May 2019, at 10:17, Jeff Zhang  wrote:
> > >
> > > Sorry folks, the design doc is late as you expected. Here's the design
> > doc
> > > I drafted, welcome any comments and feedback.
> > >
> > >
> >
> https://docs.google.com/document/d/1VavBrYn8vJeZs-Mhu5VzKO6xrWCF40aY0nlQ_UVVTRg/edit?usp=sharing
> > >
> > >
> > >
> > > Stephan Ewen  于2019年2月14日周四 下午8:43写道：
> > >
> > >> Nice that this discussion is happening.
> > >>
> > >> In the FLIP, we could also revisit the entire role of the environments
> > >> again.
> > >>
> > >> Initially, the idea was:
> > >>  - the environments take care of the specific setup for standalone (no
> > >> setup needed), yarn, mesos, etc.
> > >>  - the session ones have control over the session. The environment
> holds
> > >> the session client.
> > >>  - running a job gives a "control" object for that job. That behavior
> is
> > >> the same in all environments.
> > >>
> > >> The actual implementation diverged quite a bit from that. Happy to
> see a
> > >> discussion about straitening this out a bit more.
> > >>
> > >>
> > >> On Tue, Feb 12, 2019 at 4:58 AM Jeff Zhang  wrote:
> > >>
> > >>> Hi folks,
> > >>>
> > >>> Sorry for late response, It seems we reach consensus on this, I will
> > >> create
> > >>> FLIP for this with more detailed design
> > >>>
> > >>>
> > >>> Thomas Weise  于2018年12月21日周五 上午11:43写道：
> > >>>
> >  Great to see this discussion seeded! The problems you face with the
> >  Zeppelin integration are also affecting other downstream projects,
> > like
> >  Beam.
> > 
> >  We just enabled the savepoint restore option in
> > RemoteStreamEnvironment
> > >>> [1]
> >  and that was more difficult than it should be. The main issue is
> that
> >  environment and cluster client aren't decoupled. Ideally it should
> be
> >  possible to just get the matching cluster client from the
> environment
> > >> and
> >  then control the job through it (environment as factory for cluster
> >  client). But note that the environment classes are part of the
> public
> > >>> API,
> >  and it is not straightforward to make larger changes without
> breaking
> >  backward compatibility.
> > 
> >  ClusterClient currently exposes internal classes like JobGraph and
> >  StreamGraph. But it should be possible to wrap this with a new
> public
> > >> API
> >  that brings the required job control capabilities for downstream
> > >>> projects.
> >  Perhaps it is helpful to look at some of the interfaces in Beam
> while
> >  thinking about this: [2] for the portable job API and [3] for the
> old
> >  asynchronous job control from the Beam Java SDK.
> > 
> >  The backward compatibility discussion [4] is also relevant here. A
> new
> > >>> API
> >  should shield downstream projects from internals and allow them to
> >  interoperate with multiple future Flink versions in the same release
> > >> line
> >  without forced upgrades.
> > 
> >  Thanks,
> >  Thomas
> > 
> >  [1] https://github.com/apache/flink/pull/7249
> >  [2]
> > 
> > 
> > >>>
> > >>
> >
> https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/beam_job_api.proto
> >  [3]
> > 
> > 
> > >>>
> > >>
> >
>

Re: [Discuss] Add JobListener (hook) in flink job lifecycle

2019-05-09 Thread Flavio Pompermaier

Hi everybody,
any news on this? For us would be VERY helpful to have such a feature
because we need to execute a call to a REST service once a job ends.
Right now we do this after the env.execute() but this works only if the job
is submitted via the CLI client, the REST client doesn't execute anything
after env.execute().

Best,
Flavio




On Thu, Apr 25, 2019 at 3:12 PM Jeff Zhang  wrote:

> Hi  Beckett，
>
> Thanks for your feedback, See my comments inline
>
> >>>  How do user specify the listener? *
> What I proposal is to register JobListener in ExecutionEnvironment. I
> don't think we should make ClusterClient as public api.
>
> >>> Where should the listener run? *
> I don't think it is proper to run listener in JobMaster. The listener is
> user code, and usually it is depends on user's other component. So running
> it in client side make more sense to me.
>
> >>> What should be reported to the Listener? *
> I am open to add other api in this JobListener. But for now, I am afraid
> the ExecutionEnvironment is not aware of failover, so it is not possible to
> report failover event.
>
> >>> What can the listeners do on notifications? *
> Do you mean to pass JobGraph to these methods ? like following ( I am
> afraid JobGraph is not a public and stable api, we should not expose it to
> users)
>
> public interface JobListener {
>
> void onJobSubmitted(JobGraph graph, JobID jobId);
>
> void onJobExecuted(JobGraph graph, JobExecutionResult jobResult);
>
> void onJobCanceled(JobGraph graph, JobID jobId, String savepointPath);
> }
>
>
> Becket Qin  于2019年4月25日周四 下午7:40写道：
>
>> Thanks for the proposal, Jeff. Adding a listener to allow users handle
>> events during the job lifecycle makes a lot of sense to me.
>>
>> Here are my two cents.
>>
>> * How do user specify the listener? *
>> It is not quite clear to me whether we consider ClusterClient as a public
>> interface? From what I understand ClusterClient is not a public interface
>> right now. In contrast, ExecutionEnvironment is the de facto interface for
>> administrative work. After job submission, it is essentially bound to a job
>> as an administrative handle. Given this current state, personally I feel
>> acceptable to have the listener registered to the ExecutionEnvironment.
>>
>> * Where should the listener run? *
>> If the listener runs on the client side, the client have to be always
>> connected to the Flink cluster. This does not quite work if the Job is a
>> streaming job. Should we provide the option to run the listener in
>> JobMaster as well?
>>
>> * What should be reported to the Listener? *
>> Besides the proposed APIs, does it make sense to also report events such
>> as failover?
>>
>> * What can the listeners do on notifications? *
>> If the listeners are expected to do anything on the job, should some
>> helper class to manipulate the jobs be passed to the listener method?
>> Otherwise users may not be able to easily take action.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>>
>> On Wed, Apr 24, 2019 at 2:43 PM Jeff Zhang  wrote:
>>
>>> Hi Till,
>>>
>>> IMHO, allow adding hooks involves 2 steps.
>>> 1. Provide hook interface, and call these hook in flink (ClusterClient)
>>> at the right place. This should be done by framework (flink)
>>> 2. Implement new hook implementation and add/register them into
>>> framework(flink)
>>>
>>> What I am doing is step 1 which should be done by flink, step 2 is done
>>> by users. But IIUC, your suggestion of using custom ClusterClient seems
>>> mixing these 2 steps together. Say I'd like to add new hooks, I have to
>>> implement a new custom ClusterClient, add new hooks and call them in the
>>> custom ClusterClient at the right place.
>>> This doesn't make sense to me. For a user who want to add hooks, he is
>>> not supposed to understand the mechanism of ClusterClient, and should not
>>> touch ClusterClient. What do you think ?
>>>
>>>
>>>
>>>
>>> Till Rohrmann  于2019年4月23日周二 下午4:24写道：
>>>
 I think we should not expose the ClusterClient configuration via the
 ExecutionEnvironment (env.getClusterClient().addJobListener) because this
 is effectively the same as exposing the JobListener interface directly on
 the ExecutionEnvironment. Instead I think it could be possible to provide a
 ClusterClient factory which is picked up from the Configuration or some
 other mechanism for example. That way it would not need to be exposed via
 the ExecutionEnvironment at all.

 Cheers,
 Till

 On Fri, Apr 19, 2019 at 11:12 AM Jeff Zhang  wrote:

> >>>  The ExecutionEnvironment is usually used by the user who writes
> the code and this person (I assume) would not be really interested in 
> these
> callbacks.
>
> Usually ExecutionEnvironment is used by the user who write the code,
> but it doesn't needs to be created and configured by this person. e.g. in
> Zeppelin notebook, ExecutionEnvironment is created by Zeppelin, user just
>

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

2019-05-02 Thread Flavio Pompermaier

Hi to all,
I have read many discussion about Flink ML and none of them take into
account the ongoing efforts carried out of by the Streamline H2020 project
[1] on this topic.
Have you tried to ping them? I think that both projects could benefits from
a joined effort on this side..
[1] https://h2020-streamline-project.eu/objectives/

Best,
Flavio

On Thu, May 2, 2019 at 12:18 AM Rong Rong  wrote:

> Hi Shaoxuan/Weihua,
>
> Thanks for the proposal and driving the effort.
> I also replied to the original discussion thread, and still a +1 on moving
> towards the ski-learn model.
> I just left a few comments on the API details and some general questions.
> Please kindly take a look.
>
> There's another thread regarding a close to merge FLIP-23 implementation
> [1]. I agree this might still be early stage to talk about productionizing
> and model-serving. But I would be nice to keep the design/implementation in
> mind that: ease of use for productionizing a ML pipeline is also very
> important.
> And if we can leverage the implementation in FLIP-23 in the future, (some
> adjustment might be needed) that would be super helpful.
>
> Best,
> Rong
>
>
> [1]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-23-Model-Serving-td20260.html
>
>
> On Tue, Apr 30, 2019 at 1:47 AM Shaoxuan Wang  wrote:
>
> > Thanks for all the feedback.
> >
> > @Jincheng Sun
> > > I recommend It's better to add a detailed implementation plan to FLIP
> and
> > google doc.
> > Yes, I will add a subsection for implementation plan.
> >
> > @Chen Qin
> > >Just share some of insights from operating SparkML side at scale
> > >- map reduce may not best way to iterative sync partitioned workers.
> > >- native hardware accelerations is key to adopt rapid changes in ML
> > improvements in foreseeable future.
> > Thanks for sharing your experience on SparkML. The purpose of this FLIP
> is
> > mainly to provide the interfaces for ML pipeline and ML lib, and the
> > implementations of most standard algorithms. Besides this FLIP, for AI
> > computing on Flink, we will continue to contribute the efforts, like the
> > enhancement of iterative and the integration of deep learning engines
> (such
> > as Tensoflow/Pytorch). I have presented part of these work in
> >
> >
> https://www.ververica.com/resources/flink-forward-san-francisco-2019/when-table-meets-ai-build-flink-ai-ecosystem-on-table-api
> > I am not sure if I have fully got your comments. Can you please elaborate
> > them with more details, and if possible, please provide some suggestions
> > about what we should work on to address the challenges you have
> mentioned.
> >
> > Regards,
> > Shaoxuan
> >
> > On Mon, Apr 29, 2019 at 11:28 AM Chen Qin  wrote:
> >
> > > Just share some of insights from operating SparkML side at scale
> > > - map reduce may not best way to iterative sync partitioned workers.
> > > - native hardware accelerations is key to adopt rapid changes in ML
> > > improvements in foreseeable future.
> > >
> > > Chen
> > >
> > > On Apr 29, 2019, at 11:02, jincheng sun 
> > wrote:
> > > >
> > > > Hi Shaoxuan,
> > > >
> > > > Thanks for doing more efforts for the enhances of the scalability and
> > the
> > > > ease of use of Flink ML and make it one step further. Thank you for
> > > sharing
> > > > a lot of context information.
> > > >
> > > > big +1 for this proposal!
> > > >
> > > > Here only one suggestion, that is, It has been a short time until the
> > > > release of flink-1.9, so I recommend It's better to add a detailed
> > > > implementation plan to FLIP and google doc.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Jincheng
> > > >
> > > > Shaoxuan Wang  于2019年4月29日周一 上午10:34写道：
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> Weihua has proposed to rebuild Flink ML pipeline on top of TableAPI
> > > several
> > > >> months ago in this mail thread:
> > > >>
> > > >>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Embracing-Table-API-in-Flink-ML-td25368.html
> > > >>
> > > >> Luogen, Becket, Xu, Weihua and I have been working on this proposal
> > > >> offline in
> > > >> the past a few months. Now we want to share the first phase of the
> > > entire
> > > >> proposal with a FLIP. In this FLIP-39, we want to achieve several
> > things
> > > >> (and hope those can be accomplished and released in Flink-1.9):
> > > >>
> > > >>   -
> > > >>
> > > >>   Provide a new set of ML core interface (on top of Flink TableAPI)
> > > >>   -
> > > >>
> > > >>   Provide a ML pipeline interface (on top of Flink TableAPI)
> > > >>   -
> > > >>
> > > >>   Provide the interfaces for parameters management and pipeline/mode
> > > >>   persistence
> > > >>   -
> > > >>
> > > >>   All the above interfaces should facilitate any new ML algorithm.
> We
> > > will
> > > >>   gradually add various standard ML algorithms on top of these new
> > > >> proposed
> > > >>   interfaces to ensure their feasibility and scalability.

Re: [Discuss] Add support for Apache Arrow

2019-04-11 Thread Flavio Pompermaier

Very BIG +1 for adoption of Apache Arrow. This would simplify a lot the
integration with other tools

On Thu, Apr 11, 2019 at 2:21 PM Run  wrote:

> Hi guys,
>
>
> Apache Arrow provides a cross-language, standardized, columnar, memory
> format for data.
> So it is highly desirable to import Arrow to Flink, and make use of its
> memory layout and memory management facilities.
> More background on this can be found in
> https://issues.apache.org/jira/browse/FLINK-10929
>
>
> The problem is that, incorporating Arrow to Flink is a big move, and may
> affect many modules of Flink.
> In our efforts to vectorize Blink batch jobs, we have tried to incorporate
> Arrow to Flink. Those were incremental changes, which can be easily turned
> off with a single flag, and the changes are transparent to other parts of
> the code base.
>
>
> This is the draft of our design document:
>
> https://docs.google.com/document/d/1LstaiGzlzTdGUmyG_-9GSWleKpLRid8SJWXQCTqcQB4/edit?usp=sharing
>
>
> For the first step, I suggest providing a flag which is disabled by
> default, and let the MemoryManager depend on the Arrow Buffer Allocator.
> With this change, all the MemorySegment will be based on Arrow buffers, but
> this is transparent to other components, and should never break them.
>
>
> After this step, we can apply the changes described in the design
> documents, incrementally.
> This is our initial thoughts. Would you please give your valuable comments?
>
>
> Best,
> Liya Fan

Re: Row.setField returning row itself

2019-03-22 Thread Flavio Pompermaier

You're right Chesnay, I didn't remember that .of was introduced :(
Sorry!

On Fri, Mar 22, 2019 at 12:35 PM Chesnay Schepler 
wrote:

> You could even use a method reference here: "map(Row::of)"
>
> On 22/03/2019 12:33, Chesnay Schepler wrote:
> > I can see that this would be convenient but please find a better
> > example; yours can be solved easily using "Row.of(value)".
> >
> > On 22/03/2019 12:26, Flavio Pompermaier wrote:
> >> Hi all,
> >> many times I had the feeling that allowing Row.setField() to return the
> >> modified object instead of void would really make the (Java) code
> >> cleaner
> >> in a very unobtrusive way.
> >> For example, I could write something like:
> >>
> >> DataSet columnData = input.map(value -> new Row(1).setField(0,
> >> value))
> >>
> >> instead of:
> >>
> >> DataSet columnData = input//
> >>  .map(value -> {
> >>Row r = new Row(1);
> >>r.setField(0, value);
> >>return r;
> >>  })
> >>
> >> What do you think?
> >> May I open a JIRA issue about it?
> >>
> >> Best,
> >> Flavio
> >>
> >
> >
>
>

Row.setField returning row itself

2019-03-22 Thread Flavio Pompermaier

Hi all,
many times I had the feeling that allowing Row.setField() to return the
modified object instead of void would really make the (Java) code cleaner
in a very unobtrusive way.
For example, I could write something like:

DataSet columnData = input.map(value -> new Row(1).setField(0, value))

instead of:

DataSet columnData = input//
.map(value -> {
  Row r = new Row(1);
  r.setField(0, value);
  return r;
})

What do you think?
May I open a JIRA issue about it?

Best,
Flavio

[jira] [Created] (FLINK-11852) Improve Processing function example

2019-03-07 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-11852:
--

 Summary: Improve Processing function example
 Key: FLINK-11852
 URL: https://issues.apache.org/jira/browse/FLINK-11852
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.7.2
Reporter: Flavio Pompermaier


In the processing function documentation 
([https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html)]
 there's an "abusive" usage of the timers since a new timer is registered for 
every new tuple coming in. This could cause problems in terms of allocated 
objects and could burden the overall application.

It could worth to mention this problem and remove useless timers, e.g.:

 
{code:java}
CountWithTimestamp current = state.value();
if (current == null) {
     current = new CountWithTimestamp();
     current.key = value.f0;
 } else {
    ctx.timerService().deleteEventTimeTimer(current.lastModified + timeout);
 }{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Some dev questions

2019-03-01 Thread Flavio Pompermaier

You're right Chesnay, I did a git fetch and I didn't remember that it
doesn't update the tag list..
After a git fetch --tags --all I was able to find the tags.
It would be nice to add this info into the "build from source"
documentation [1]:
the page states "This page covers how to build Flink 1.7.2 from sources."
but this is not true, unless you move to the right tag.
If you follow the listed commands you don't know which version you are
building.

Thanks also for the ML pointer. In this case it could be the case to move
flink-ml to another repo (IMHO).

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html

Best,
Flavio

Some dev questions

2019-03-01 Thread Flavio Pompermaier

Hi to all,
I was going to upgrade our Flink cluster to 1.7.x release but I saw that
releases are no more taged and I need to download the sources for a
specific version. Is this a definitive choice? I think that tagging was
really helpful when recompiling the code..why has this policy changed?
There are probably good reasons behind this, I'm just curious as a
developer.

Another thing I noticed is that all ML tickets have been closed since the
ML stuff has been frozen. Is there really no interest into improving it?

Best,
Flavio

Re: Dataset rowCount accumulator

2019-02-04 Thread Flavio Pompermaier

Thinking about it I came up that adding a map function after the read is
probably more general.
Is there any "significant" difference in terms of performance in using such
dedicated map function (that just reads a row, increment an accumulator and
returns immediately) vs adding this accumulator directly in the input
formats?

On Mon, Feb 4, 2019 at 10:18 AM Flavio Pompermaier 
wrote:

> Hi to all,
> we often need to track the number of rows of a dataset.
> In order to burden on the job complexitye we use accumulators to track
> this information.
> The problem is that we have to extends all InputFormats that we use in
> order to properly handle such row-count accumulator...my question is: what
> about introducing it as a first class citizen (forcing all input format to
> handle a rowCount accumulator when required)?
>
> What do you think? Will it be useful in general?
>
> Best,
> Flavio
>

Dataset rowCount accumulator

2019-02-04 Thread Flavio Pompermaier

Hi to all,
we often need to track the number of rows of a dataset.
In order to burden on the job complexitye we use accumulators to track this
information.
The problem is that we have to extends all InputFormats that we use in
order to properly handle such row-count accumulator...my question is: what
about introducing it as a first class citizen (forcing all input format to
handle a rowCount accumulator when required)?

What do you think? Will it be useful in general?

Best,
Flavio

Re: [DISCUSS] Support Interactive Programming in Flink Table API

2018-11-26 Thread Flavio Pompermaier

What about to add also Apache Plasma + Arrow as an alternative to Apache
Ignite?
[1] https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/

On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske  wrote:

> Hi,
>
> Thanks for the proposal!
>
> To summarize, you propose a new method Table.cache(): Table that will
> trigger a job and write the result into some temporary storage as defined
> by a TableFactory.
> The cache() call blocks while the job is running and eventually returns a
> Table object that represents a scan of the temporary table.
> When the "session" is closed (closing to be defined?), the temporary tables
> are all dropped.
>
> I think this behavior makes sense and is a good first step towards more
> interactive workloads.
> However, its performance suffers from writing to and reading from external
> systems.
> I think this is OK for now. Changes that would significantly improve the
> situation (i.e., pinning data in-memory across jobs) would have large
> impacts on many components of Flink.
> Users could use in-memory filesystems or storage grids (Apache Ignite) to
> mitigate some of the performance effects.
>
> Best, Fabian
>
>
>
> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> becket@gmail.com
> >:
>
> > Thanks for the explanation, Piotrek.
> >
> > Is there any extra thing user can do on a MaterializedTable that they
> > cannot do on a Table? After users call *table.cache(), *users can just
> use
> > that table and do anything that is supported on a Table, including SQL.
> >
> > Naming wise, either cache() or materialize() sounds fine to me. cache()
> is
> > a bit more general than materialize(). Given that we are enhancing the
> > Table API to also support non-relational processing cases, cache() might
> be
> > slightly better.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski  >
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Ops, sorry I didn’t notice that you intend to reuse existing
> > > `TableFactory`. I don’t know why, but I assumed that you want to
> provide
> > an
> > > alternate way of writing the data.
> > >
> > > Now that I hopefully understand the proposal, maybe we could rename
> > > `cache()` to
> > >
> > > void materialize()
> > >
> > > or going step further
> > >
> > > MaterializedTable materialize()
> > > MaterializedTable createMaterializedView()
> > >
> > > ?
> > >
> > > The second option with returning a handle I think is more flexible and
> > > could provide features such as “refresh”/“delete” or generally speaking
> > > manage the the view. In the future we could also think about adding
> hooks
> > > to automatically refresh view etc. It is also more explicit -
> > > materialization returning a new table handle will not have the same
> > > implicit side effects as adding a simple line of code like `b.cache()`
> > > would have.
> > >
> > > It would also be more SQL like, making it more intuitive for users
> > already
> > > familiar with the SQL.
> > >
> > > Piotrek
> > >
> > > > On 23 Nov 2018, at 14:53, Becket Qin  wrote:
> > > >
> > > > Hi Piotrek,
> > > >
> > > > For the cache() method itself, yes, it is equivalent to creating a
> > > BUILT-IN
> > > > materialized view with a lifecycle. That functionality is missing
> > today,
> > > > though. Not sure if I understand your question. Do you mean we
> already
> > > have
> > > > the functionality and just need a syntax sugar?
> > > >
> > > > What's more interesting in the proposal is do we want to stop at
> > creating
> > > > the materialized view? Or do we want to extend that in the future to
> a
> > > more
> > > > useful unified data store distributed with Flink? And do we want to
> > have
> > > a
> > > > mechanism allow more flexible user job pattern with their own user
> > > defined
> > > > services. These considerations are much more architectural.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > pi...@data-artisans.com>
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Interesting idea. I’m trying to understand the problem. Isn’t the
> > > >> `cache()` call an equivalent of writing data to a sink and later
> > reading
> > > >> from it? Where this sink has a limited live scope/live time? And the
> > > sink
> > > >> could be implemented as in memory or a file sink?
> > > >>
> > > >> If so, what’s the problem with creating a materialised view from a
> > table
> > > >> “b” (from your document’s example) and reusing this materialised
> view
> > > >> later? Maybe we are lacking mechanisms to clean up materialised
> views
> > > (for
> > > >> example when current session finishes)? Maybe we need some syntactic
> > > sugar
> > > >> on top of it?
> > > >>
> > > >> Piotrek
> > > >>
> > > >>> On 23 Nov 2018, at 07:21, Becket Qin  wrote:
> > > >>>
> > > >>> Thanks for the suggestion, Jincheng.
> > > >>>
> > > >>> Yes, I think it makes sense to have a persist() w

Re: JIRA notifications

2018-11-21 Thread Flavio Pompermaier

+1

On Wed, Nov 21, 2018 at 12:05 PM Saar Bar  wrote:

> 💯 agree
>
> Sent from my iPhone
>
> > On 21 Nov 2018, at 13:03, Maximilian Michels  wrote:
> >
> > Hi!
> >
> > Do you think it would make sense to send JIRA notifications to a
> separate mailing list? Some people just want to casually follow the mailing
> list and it requires a filter to delete all the JIRA mails.
> >
> > We already have an "issues" mailing list which receives the JIRA
> notifications: https://mail-archives.apache.org/mod_mbox/flink-issues/
> >
> > What do you think?
> >
> > Thanks,
> > Max
>

[jira] [Created] (FLINK-10947) Document handling of null keys in the data types documentation

2018-11-20 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10947:
--

 Summary: Document handling of null keys in the data types 
documentation
 Key: FLINK-10947
 URL: https://issues.apache.org/jira/browse/FLINK-10947
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier


The [Flink documentation about data 
types|https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/types_serialization.html]
 should be extended with a discussion about the handling of null keys. See 
discussion at 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Group-by-with-null-keys-td24594.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10879) Align Flink clients on env.execute()

2018-11-14 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10879:
--

 Summary: Align Flink clients on env.execute()
 Key: FLINK-10879
 URL: https://issues.apache.org/jira/browse/FLINK-10879
 Project: Flink
  Issue Type: Improvement
  Components: Client
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier


Right now the REST APIs do not support any code after env.execute while the 
Flink API, CLI client or the code executed within the IDE do.

Both clients should behave in the same way (supporting env.execute() to return 
something and continue the code execution or not).

See the discussion on the DEV ML for more details: 
http://mail-archives.apache.org/mod_mbox/flink-dev/201811.mbox/%3CCAELUF_DhjzL9FECvx040_GE3d85Ykb-HcGVCh0O4y9h-cThq7A%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: REST job submission

2018-11-14 Thread Flavio Pompermaier

Done: https://issues.apache.org/jira/browse/FLINK-10879

On Wed, Nov 14, 2018 at 10:14 AM Chesnay Schepler 
wrote:

> I wouldn't consider it a _bug_ in that sense, but agree that the current
> behavior isn't ideal. Running a job via the CLI or WebUI should behave
> the same way, as such please open a JIRA.
>
> On 12.11.2018 12:50, Flavio Pompermaier wrote:
> > Hi to all,
> > in our ETL we need to call an external (REST) service once a job ends: we
> > extract informations about accumulators and we update the job status.
> > However this is only possible if using the CLI client: if we call the job
> > via the REST API o Web UI (that is very useful to decouple our UI from
> the
> > Flink cluster) then this is not possible, because the REST API cannot
> > execute any code after env.execute().
> > I think that this is a very huge limitation: first of all, when writing
> > (and debugging) a Flink job, you assume that you can call multiple times
> > execute() and use the returned JobExecutionResult.
> > In second instance, the binary client and the rest client behaves
> > differently (with the CLI client everything works as expected).
> >
> > What do you think about this? Is this a bug or not?
> >
> > PS: I think also that the REST client should not be aware of any jar or
> > class instance, it should just call the job manager with the proper class
> > name and jar id (plus other options of course).
> >
> > Cheers,
> > Flavio
> >
>
>

[jira] [Created] (FLINK-10864) Support multiple Main classes per jar

2018-11-13 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10864:
--

 Summary: Support multiple Main classes per jar
 Key: FLINK-10864
 URL: https://issues.apache.org/jira/browse/FLINK-10864
 Project: Flink
  Issue Type: Improvement
  Components: Job-Submission
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier


Right now all the REST API and job submission system assumes that a jar 
contains only a single main class. In my experience this is rarely the case in 
real scenario: a jar contains multiple jobs (with similar dependencies) that 
performs different tasks.

In our use case, for example, the shaded jar is around 200 MB and 10 jobs 
within it...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10862) REST API does not show program descriptions of "simple" ProgramDescription

2018-11-13 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10862:
--

 Summary: REST API does not show program descriptions of "simple" 
ProgramDescription
 Key: FLINK-10862
 URL: https://issues.apache.org/jira/browse/FLINK-10862
 Project: Flink
  Issue Type: Bug
  Components: REST
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier


When uploading a jar containing a main class implementing ProgramDescription 
interface, the REST API doesn't list its description. It works only if the 
class implements Program (that I find pretty useless...why should I return the 
plan?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

REST job submission

2018-11-12 Thread Flavio Pompermaier

Hi to all,
in our ETL we need to call an external (REST) service once a job ends: we
extract informations about accumulators and we update the job status.
However this is only possible if using the CLI client: if we call the job
via the REST API o Web UI (that is very useful to decouple our UI from the
Flink cluster) then this is not possible, because the REST API cannot
execute any code after env.execute().
I think that this is a very huge limitation: first of all, when writing
(and debugging) a Flink job, you assume that you can call multiple times
execute() and use the returned JobExecutionResult.
In second instance, the binary client and the rest client behaves
differently (with the CLI client everything works as expected).

What do you think about this? Is this a bug or not?

PS: I think also that the REST client should not be aware of any jar or
class instance, it should just call the job manager with the proper class
name and jar id (plus other options of course).

Cheers,
Flavio

[jira] [Created] (FLINK-10795) STDDEV_POP error

2018-11-05 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10795:
--

 Summary: STDDEV_POP error
 Key: FLINK-10795
 URL: https://issues.apache.org/jira/browse/FLINK-10795
 Project: Flink
  Issue Type: Improvement
  Components: Table API & SQL
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier
 Attachments: FlinkTableApiError.java, test.tsv

if using STDDEV_POP in the attached job the following error is thrown (with 
Flink 1.6.1):

 
{code:java}
Exception in thread "main" 
org.apache.flink.runtime.client.JobExecutionException: 
java.lang.NumberFormatException
 at 
org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:623)
 at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:235)
 at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:91)
 at 
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
 at org.apache.flink.api.java.DataSet.collect(DataSet.java:413)
 at org.apache.flink.api.java.DataSet.print(DataSet.java:1652)
 at 
it.okkam.datalinks.batch.flink.operations.FlinkTableApiError.main(FlinkTableApiError.java:466)
Caused by: java.lang.NumberFormatException
 at java.math.BigDecimal.(BigDecimal.java:494)
 at java.math.BigDecimal.(BigDecimal.java:383)
 at java.math.BigDecimal.(BigDecimal.java:806)
 at java.math.BigDecimal.valueOf(BigDecimal.java:1274)
 at org.apache.calcite.runtime.SqlFunctions.sround(SqlFunctions.java:1242)
 at DataSetCalcRule$6909.flatMap(Unknown Source)
 at org.apache.flink.table.runtime.FlatMapRunner.flatMap(FlatMapRunner.scala:52)
 at org.apache.flink.table.runtime.FlatMapRunner.flatMap(FlatMapRunner.scala:31)
 at 
org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80)
 at 
org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
 at DataSetSingleRowJoinRule$6450.join(Unknown Source)
 at 
org.apache.flink.table.runtime.MapJoinLeftRunner.flatMap(MapJoinLeftRunner.scala:35)
 at org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:109)
 at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
 at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
 at java.lang.Thread.run(Thread.java:748)
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] Change underlying Frontend Architecture for Flink Web Dashboard

2018-10-31 Thread Flavio Pompermaier

I think that it is important to have a nice "official" (or at least free)
Flink UI, we use it to see the detail of the jobs.
It's very useful for people starting working with Flink and also for those
that does not have the resources to write a custom UI.
How are you going to monitor the status of a job otherwise??

On Wed, Oct 31, 2018 at 8:48 AM Fabian Wollert  wrote:

> Hi Till, I basically agree with all your points. i would stress the
> "dustiness" of the current architecture: the package manager used (bower)
> is deprecated since a long time, the chance for the builds of the flink web
> dashboard not working anymore is increasing every day.
>
> About the knowledge in the community: Two days is not a lot of time, but
> interest in this topic seems to be minor anyways. Is someone using the
> Flink Web Dashboard at all, or is everyone running their own custom
> solutions? Because if there is no interest in using the Web UI AND no one
> interested in developing, there would be no need to package this as part of
> the official Flink package, but rather develop an independent solution (I'm
> not voting for this right now, just putting it out), if at all. The
> official package could then just ship with the API, which other solutions
> can build upon. This solution could be from an agile point of view also the
> best (enhanced testing, independent and more effective dev workflow, etc.),
> but is bad for the usage of the Flink UI, because people need to install
> two things individually (Flink and the web dashboard).
>
> How did the choice for Angular1 happen back then? Who was writing the
> Dashboard in the first place?
>
> Cheers
>
> --
>
>
> *Fabian WollertZalando SE*
>
> E-Mail: fab...@zalando.de
>
>
> Am Di., 30. Okt. 2018 um 15:07 Uhr schrieb Till Rohrmann <
> trohrm...@apache.org>:
>
> > Thanks for starting this discussion Fabian! I think our web UI technology
> > stack is quite dusty by now and it would be beneficial to think about its
> > technological future.
> >
> > On the one hand, our current web UI works more or less reliable and
> > changing the underlying technology has the risk of breaking things.
> > Moreover, there might be the risk that the newly chosen technology will
> be
> > deprecated at some point in time as well.
> >
> > On the other hand, we don't have much Angular 1 knowledge in the
> community
> > and extending the web UI is, thus, quite hard at the moment. Maybe by
> using
> > some newer web technologies we might be able to attract more people with
> a
> > web technology background to join the community.
> >
> > The lack of people working on the web UI is for me the biggest problem I
> > would like to address. If there is interest in the web UI, then I'm quite
> > sure that we will be able to even migrate to some other technology in the
> > future. The next important issue for me is to do the change incrementally
> > if possible. Ideally we never break the web UI in the process of
> migrating
> > to a new technology. I'm not an expert here so it might or might not be
> > possible. But if it is, then we should design the implementation steps in
> > such a way.
> >
> > Cheers,
> > Till
> >
> > On Mon, Oct 29, 2018 at 1:06 PM Fabian Wollert 
> wrote:
> >
> > > Hi everyone,
> > >
> > > in this email thread
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Cluster-Overview-Dashboard-Improvement-Proposal-td24531.html
> > > >
> > > and the tickets FLINK-10705
> > >  and FLINK-10706
> > >  the discussion
> came
> > up
> > > whether to change the underlying architecture of Flink's Web Dashboard
> > from
> > > Angular1 to something else. This email thread should be solely to
> discuss
> > > the pro's and con's of this, and what could be the target architecture.
> > >
> > > My choice would be React. Personally I agree with Till's comments in
> the
> > > ticket, Angular 1 being basically outdated and is not having a large
> > > following anymore. From my experience the choice between Angular 2-7 or
> > > React is subjective, you can get things done with both. I personally
> only
> > > have experience with React, so I  personally would be faster to develop
> > > with this one. I currently have not planned to learn Angular as well
> > (being
> > > a more backend focused developer in general) so if the decision would
> be
> > to
> > > go with Angular, i would be unfortunately out of this rework of the
> Flink
> > > Dashboard most certainly.
> > >
> > > Additionally i would like to get rid of bower, since its officially
> > > deprecated  >.
> > > my
> > > idea would be to just use a create-react-app package with npm and
> webpack
> > > under the hood. no need for additional lib's here imho. But again:
> thats
> > > mostly what i've been working with recently, so thats a subjective
> > poin

[jira] [Created] (FLINK-10731) Support AVG on Date fields

2018-10-31 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10731:
--

 Summary: Support AVG on Date fields
 Key: FLINK-10731
 URL: https://issues.apache.org/jira/browse/FLINK-10731
 Project: Flink
  Issue Type: Improvement
  Components: Table API & SQL
Affects Versions: 1.6.2
Reporter: Flavio Pompermaier


AVG function does not work on date fields right now



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10562) Relax (or document) table name constraints

2018-10-16 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-10562:
--

 Summary: Relax (or document) table name constraints
 Key: FLINK-10562
 URL: https://issues.apache.org/jira/browse/FLINK-10562
 Project: Flink
  Issue Type: Improvement
  Components: Table API & SQL
Affects Versions: 1.6.1
Reporter: Flavio Pompermaier


At the moment it's not possible to register a table whose name starts with a 
number (e.g. 1_test). Moreover this constraint is not reported in the 
documentation.

I propose to enable table name escaping somehow in order to enable more general 
scenarios like those having spaces in between (e.g. select * from 'my table' ).

Best,
Flavio

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Snapshots and RC-candidate maven artifacts

2018-05-14 Thread Flavio Pompermaier

Actually my solution was to recompile Flink on my own PC.
I was just arguing whether the current build system should be considered OK
or not..

On Mon, May 14, 2018 at 4:59 PM, Ted Yu  wrote:

> Flavio:
> Can you use the snapshot for 1.5 RC ?
> https://repository.apache.org/content/repositories/orgapacheflink-1154/
>
> It was uploaded on Apr 2nd.
>
> FYI
>
> On Mon, May 14, 2018 at 7:54 AM, Fabian Hueske  wrote:
>
> > Hi,
> >
> > I'd assume that we stopped updating 1.5-SNAPSHOT jars when we forked off
> > the release-1.5 branch and updated the version on master to 1.6-SNAPSHOT.
> >
> > Best, Fabian
> >
> > 2018-05-14 15:51 GMT+02:00 Flavio Pompermaier :
> >
> > > Hi to all.
> > > we were trying to run a 1.5 Flink job and we set the version to
> > > 1.5-SNAPSHOT.
> > > Unfortunately the 1.5-SNAPSHOT version uploaded on the apache snapshot
> > repo
> > > is very old (february 2018). Shouldn't be this version be updated as
> > well?
> > >
> > > Best,
> > > Flavio
> > >
> >
>

Snapshots and RC-candidate maven artifacts

2018-05-14 Thread Flavio Pompermaier

Hi to all.
we were trying to run a 1.5 Flink job and we set the version to
1.5-SNAPSHOT.
Unfortunately the 1.5-SNAPSHOT version uploaded on the apache snapshot repo
is very old (february 2018). Shouldn't be this version be updated as well?

Best,
Flavio

Re: Elasticsearch Sink

2018-05-12 Thread Flavio Pompermaier

+1. Torally agree

On Sat, 12 May 2018, 18:14 Christophe Jolif,  wrote:

> Hi all,
>
> There is quite some time Flink Elasticsearch sink is broken for
> Elastisearch 5.x  (nearly a year):
>
> https://issues.apache.org/jira/browse/FLINK-7386
>
> And there is no support for Elasticsearch 6.x:
>
> https://issues.apache.org/jira/browse/FLINK-8101
>
> However several PRs were issued:
>
> https://github.com/apache/flink/pull/4675
> https://github.com/apache/flink/pull/5374
>
> I also raised the issue on the mailing list in the 1.5 timeframe:
>
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/DISCUSS-Releasing-Flink-1-5-0-td20867.html#a20905
>
> But things are still not really moving. However this seems something people
> are looking for, so I would really like the community to consider that for
> 1.6.
>
> The problems I see from comments on the PRs:
>
> - getting something that is following the Flink APIs initially created is a
> nightmare because Elastic is pretty good at breaking compatibility the hard
> way (see in particular in the last PR the cast that have to be made to get
> an API that works in all cases)
> - Elasticsearch is moving away from their "native" API Flink is using to
> the REST APIs so there is little  common ground between pre 6 and post 6
> even if Elasticsearch tried to get some level of compatibility in the APIs.
>
> My fear is that by trying to kill two birds with one stone, we actually get
> nothing done.
>
> In the hope of moving that forward I would like to propose for 1.6 a new
> Elasticsearch 6.x+ sink that would follow the design of the previous ones
> BUT only leverage the new REST API and not inherit from existing classes.
> It would really be close to what is in my previous PR:
> https://github.com/apache/flink/pull/5374 but just focus on E6+/REST and
> so
> avoid any "strange" cast.
>
> This would not fill the gap of the 5.2+ not working but at least we would
> be back on track with something for the future as REST API is where Elastic
> is going.
>
> If people feel there is actual interest and chances this can be merged I'll
> be working on issuing a new PR around that.
>
> Alternative is to get back working on the existing PR but it seems to be a
> dead-end at the moment and not necessarily the best option long term as
> anyway Elasticsearch is looking into promoting the REST API.
>
> Please let me know what you think?
>
> --
> Christophe
>

Re: [VOTE] Release 1.5.0, release candidate #1

2018-04-03 Thread Flavio Pompermaier

Sorry Till but we had an error the first time we've tried to build Flink
using mvn clean package...now we're trying to replicate the error but no
luck.. :)
So for the moment don't consider my note...

On Tue, Apr 3, 2018 at 10:56 AM, Till Rohrmann  wrote:

> Hi Sihua,
>
> Stefan will look into FLINK-8699 and then we will decide whether it should
> be part of 1.5.0 or not. The issue should, however, not block others from
> testing the RC.
>
> @Flavio, what exactly is the problem with maven 3.0.x when running `mvn
> clean package -DskipTests`?
>
> Cheers,
> Till
>
> On Tue, Apr 3, 2018 at 10:34 AM, Flavio Pompermaier 
> wrote:
>
> > Just one note.
> > There are 2 suggested ways of compiling Flink:
> >
> >1. From github Readme: mvn clean package -DskipTests (
> >https://github.com/apache/flink <https://github.com/apache/flink>)
> >2. From site documentation: mvn clean install -DskipTests (
> >https://ci.apache.org/projects/flink/flink-docs-
> >release-1.5/start/building.html
> ><https://ci.apache.org/projects/flink/flink-docs-
> > release-1.5/start/building.html>
> >)
> >
> > The first one doesn't work with maven 3.0.x while the second one works
> with
> > both (3.0.x and 3.3.x)
> >
> > On Tue, Apr 3, 2018 at 7:59 AM, 周思华  wrote:
> >
> > > Hi Till,
> > > Thanks for raising this candidate, I found this
> > https://github.com/apache/
> > > flink/pull/5705 is still not merged, should we get it into flink-1.5, I
> > > ask this because I found it on the bug list on the JIRA release notes
> > that
> > > you provided.
> > >
> > >
> > > Best Regards,
> > > Sihua Zhou
> > >
> > >
> > > 发自网易邮箱大师
> > >
> > >
> > > On 04/3/2018 03:47，Till Rohrmann wrote：
> > > Hi everyone,
> > > Please review and vote on the release candidate #1 for the version
> 1.5.0,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > > The complete staging area is available for your review, which includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint 1F302569A96CFFD5 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "release-1.5.0-rc1" [5],
> > >
> > > Please use this document for coordinating testing efforts: [6]
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Your friendly Release Manager
> > >
> > > [1]
> > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > > projectId=12315522&version=12341764
> > > [2] http://people.apache.org/~trohrmann/flink-1.5.0-rc1/
> > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [4] https://repository.apache.org/content/repositories/
> > orgapacheflink-1154
> > > [5]
> > > https://git-wip-us.apache.org/repos/asf?p=flink.git;a=commit;h=
> > > 2af481a0fa88dd2f7964c9105cf4d735c3ac8303
> > > [6]
> > > https://docs.google.com/document/d/1Y-tB6yxs9HHoUxQKdTLPzJiJeNxx_
> > > uoPIvOVV9dlrPw/edit?usp=sharing
> > >
> > > Pro-tip: you can create a settings.xml file with these contents:
> > >
> > > 
> > > 
> > > flink-1.5.0
> > > 
> > > 
> > > 
> > > flink-1.5.0
> > > 
> > > 
> > > flink-1.5.0
> > > 
> > >
> > > https://repository.apache.org/content/repositories/
> orgapacheflink-1154/
> > > 
> > > 
> > > 
> > > archetype
> > > 
> > >
> > > https://repository.apache.org/content/repositories/
> orgapacheflink-1154/
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > And reference that in you maven commands via --settings
> > > path/to/settings.xml. This is useful for creating a quickstart based on
> > the
> > > staged release and for building against the staged jars.
> > >
> >
> >
> >
> > --
> > Flavio Pompermaier
> > Development Department
> >
> > OKKAM S.r.l.
> > Tel. +(39) 0461 041809
> >
>



-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 041809

Re: [VOTE] Release 1.5.0, release candidate #1

2018-04-03 Thread Flavio Pompermaier

Just one note.
There are 2 suggested ways of compiling Flink:

   1. From github Readme: mvn clean package -DskipTests (
   https://github.com/apache/flink <https://github.com/apache/flink>)
   2. From site documentation: mvn clean install -DskipTests (
   https://ci.apache.org/projects/flink/flink-docs-
   release-1.5/start/building.html
   
<https://ci.apache.org/projects/flink/flink-docs-release-1.5/start/building.html>
   )

The first one doesn't work with maven 3.0.x while the second one works with
both (3.0.x and 3.3.x)

On Tue, Apr 3, 2018 at 7:59 AM, 周思华  wrote:

> Hi Till,
> Thanks for raising this candidate, I found this https://github.com/apache/
> flink/pull/5705 is still not merged, should we get it into flink-1.5, I
> ask this because I found it on the bug list on the JIRA release notes that
> you provided.
>
>
> Best Regards,
> Sihua Zhou
>
>
> 发自网易邮箱大师
>
>
> On 04/3/2018 03:47，Till Rohrmann wrote：
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 1.5.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 1F302569A96CFFD5 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.5.0-rc1" [5],
>
> Please use this document for coordinating testing efforts: [6]
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Your friendly Release Manager
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12315522&version=12341764
> [2] http://people.apache.org/~trohrmann/flink-1.5.0-rc1/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1154
> [5]
> https://git-wip-us.apache.org/repos/asf?p=flink.git;a=commit;h=
> 2af481a0fa88dd2f7964c9105cf4d735c3ac8303
> [6]
> https://docs.google.com/document/d/1Y-tB6yxs9HHoUxQKdTLPzJiJeNxx_
> uoPIvOVV9dlrPw/edit?usp=sharing
>
> Pro-tip: you can create a settings.xml file with these contents:
>
> 
> 
> flink-1.5.0
> 
> 
> 
> flink-1.5.0
> 
> 
> flink-1.5.0
> 
>
> https://repository.apache.org/content/repositories/orgapacheflink-1154/
> 
> 
> 
> archetype
> 
>
> https://repository.apache.org/content/repositories/orgapacheflink-1154/
> 
> 
> 
> 
> 
> 
>
> And reference that in you maven commands via --settings
> path/to/settings.xml. This is useful for creating a quickstart based on the
> staged release and for building against the staged jars.
>



-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 041809

Re: [jira] [Created] (FLINK-8356) JDBCAppendTableSink does not work for Hbase Phoenix Driver

2018-01-03 Thread Flavio Pompermaier

I had a similar problem with batch API...the problem is that you have to
enable autocommit in the connection URL. Thr jdbc connector should better
handle this specific case as well (IMHO).

See https://issues.apache.org/jira/browse/FLINK-7605

On 3 Jan 2018 22:25, "Paul Wu (JIRA)"  wrote:

> Paul Wu created FLINK-8356:
> --
>
>  Summary: JDBCAppendTableSink does not work for Hbase Phoenix
> Driver
>  Key: FLINK-8356
>  URL: https://issues.apache.org/jira/browse/FLINK-8356
>  Project: Flink
>   Issue Type: Bug
>   Components: Table API & SQL
> Affects Versions: 1.4.0
> Reporter: Paul Wu
>
>
> The following code runs without errors, but the data is not inserted into
> the HBase table. However, it does work for MySQL (see the commented out
> code). The Phoenix driver is from https://mvnrepository.com/
> artifact/org.apache.phoenix/phoenix/4.7.0-HBase-1.1
>
> String query = "select CURRENT_DATE SEGMENTSTARTTIME, CURRENT_DATE
> SEGMENTENDTIME, cast (imsi as varchar) imsi, cast(imei as varchar) imei
> from ts ";
>
> Table table = ste.sqlQuery(query);
> JDBCAppendTableSinkBuilder jdbc = JDBCAppendTableSink.builder();
> jdbc.setDrivername("org.apache.phoenix.jdbc.PhoenixDriver");
> jdbc.setDBUrl("jdbc:phoenix:hosts:2181:/hbase-unsecure");
> jdbc.setQuery("upsert INTO GEO_ANALYTICS_STREAMING_DATA
> (SEGMENTSTARTTIME,SEGMENTENDTIME, imsi, imei) values (?,?,?, ?)");
> // JDBCAppendTableSinkBuilder jdbc = JDBCAppendTableSink.builder();
> //jdbc.setDrivername("com.mysql.jdbc.Driver");
> //jdbc.setDBUrl("jdbc:mysql://localhost/test");
> //jdbc.setUsername("root").setPassword("");
> //jdbc.setQuery("insert INTO GEO_ANALYTICS_STREAMING_DATA
> (SEGMENTSTARTTIME,SEGMENTENDTIME, imsi, imei) values (?,?,?, ?)");
> //jdbc.setBatchSize(1);
> jdbc.setParameterTypes(Types.SQL_DATE, Types.SQL_DATE,
> Types.STRING, Types.STRING);
> JDBCAppendTableSink sink = jdbc.build();
> table.writeToSink(sink);
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
>

[jira] [Created] (FLINK-7930) Support periodic jobs with state that gets restored and persisted in a savepoint

2017-10-26 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-7930:
-

 Summary: Support periodic jobs with state that gets restored and 
persisted in a savepoint 
 Key: FLINK-7930
 URL: https://issues.apache.org/jira/browse/FLINK-7930
 Project: Flink
  Issue Type: New Feature
  Components: DataStream API
Reporter: Flavio Pompermaier


As discussed in 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/State-snapshotting-when-source-is-finite-td16398.html,
 it could be useful to support the use case of  periodic jobs with state that 
gets restored and persisted in a savepoint (in order to avoid the need of an 
external sink)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7845) Netty Exception when submitting batch job repeatedly

2017-10-16 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-7845:
-

 Summary: Netty Exception when submitting batch job repeatedly
 Key: FLINK-7845
 URL: https://issues.apache.org/jira/browse/FLINK-7845
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 1.3.2
Reporter: Flavio Pompermaier


We had some problems with Flink and Netty so we wrote a small unit test to 
reproduce the memory issues we have in production. It happens that we have to 
restart the Flink cluster because the memory is always increasing from job to 
job. 
The github project is https://github.com/okkam-it/flink-memory-leak and the 
JUnit test is contained in the MemoryLeakTest class (within src/main/test).
I don't know if this is the root of our problems but at some point, usually 
around the 28th loop, the job fails with the following exception (actually we 
never faced that in production but maybe is related to the memory issue 
somehow...):

{code:java}
Caused by: java.lang.IllegalAccessError: 
org/apache/flink/runtime/io/network/netty/NettyMessage
at 
io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
at 
io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
... 16 more
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7605) JDBCOutputFormat autoCommit

2017-09-08 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-7605:
-

 Summary: JDBCOutputFormat autoCommit
 Key: FLINK-7605
 URL: https://issues.apache.org/jira/browse/FLINK-7605
 Project: Flink
  Issue Type: Improvement
  Components: Batch Connectors and Input/Output Formats
Affects Versions: 1.3.2
Reporter: Flavio Pompermaier
Priority: Minor


Currently, if a connection is not created with autoCommit = true by default 
(e.g. Apache Phoenix), no data is written into the database.

So, in the JDBCOutputFormat.open() autoCommit should be forced on the created 
Connection, i.e.:

{quote}
if (!conn.getAutoCommit()) {
  conn.setAutoCommit(true);
}
{quote}

This should be well documented also..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: [POLL] Who still uses Java 7 with Flink ?

2017-06-08 Thread Flavio Pompermaier

We're not using Java 7 anymore from last year already.
So +1 to upgrade to java 8 from Okkam

On 8 Jun 2017 5:39 pm, "Robert Metzger"  wrote:

> Hi all,
>
> as promised in March, I want to revive this discussion!
>
> Our users are begging for Scala 2.12 support [1], migration to Akka 2.4
> would solve a bunch of shading / dependency issues (Akka 2.4 will remove
> Akka's protobuf dependency [2][3]) and generally Java 8's new language
> features all speak for dropping Java 7.
>
> Java 8 has been released in March, 2014. Java 7 is unsupported since June
> 2016.
>
> So what's the feeling in the community regarding the step?
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-5005#
> [2] https://issues.apache.org/jira/browse/FLINK-5989
> [3] https://issues.apache.org/jira/browse/FLINK-3211?focused
> CommentId=15274018&page=com.atlassian.jira.plugin.system.
> issuetabpanels:comment-tabpanel#comment-15274018
>
>
> On Thu, Mar 23, 2017 at 2:42 PM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm sure you've considered this already, but what this data does not
>> include is all the potential future users,
>> i.e. slower moving organizations (banks etc.) which could be on Java 7
>> still.
>>
>> Whether those are relevant is up for debate.
>>
>> Cheers,
>> Theo
>>
>> On Thu, Mar 23, 2017 at 12:14 PM, Robert Metzger 
>> wrote:
>>
>>> Yeah, you are right :)
>>> I'll put something in my calendar for end of May.
>>>
>>> On Thu, Mar 23, 2017 at 12:12 PM, Greg Hogan  wrote:
>>>
 Robert,

 Thanks for the report. Shouldn’t we be revisiting this decision at the
 beginning of the new release cycle rather than near the end? There is
 currently little cost to staying with Java 7 since no Flink code or pull
 requests have been written for Java 8.

 Greg



 On Mar 23, 2017, at 6:37 AM, Robert Metzger 
 wrote:

 Looks like 9% on twitter and 24% on the mailing list are still using
 Java 7.

 I would vote to keep supporting Java 7 for Flink 1.3 and then revisit
 once we are approaching 1.4 in September.

 On Thu, Mar 16, 2017 at 8:00 AM, Bowen Li 
 wrote:

> There's always a tradeoff we need to make. I'm in favor of upgrading
> to Java 8 to bring in all new Java features.
>
> The common way I've seen (and I agree) other software upgrading major
> things like this is 1) upgrade for next big release without backward
> compatibility and notify everyone 2) maintain and patch current, old-tech
> compatible version at a reasonably limited scope. Building backward
> compatibility is too much for an open sourced project
>
>
>
> On Wed, Mar 15, 2017 at 7:10 AM, Robert Metzger 
> wrote:
>
>> I've put it also on our Twitter account:
>> https://twitter.com/ApacheFlink/status/842015062667755521
>>
>> On Wed, Mar 15, 2017 at 2:19 PM, Martin Neumann > >
>> wrote:
>>
>> > I think this easier done in a straw poll than in an email
>> conversation.
>> > I created one at: http://www.strawpoll.me/12535073
>> > (Note that you have multiple choices.)
>> >
>> >
>> > Though I prefer Java 8 most of the time I have to work on Java 7. A
>> lot of
>> > the infrastructure I work on still runs Java 7, one of the
>> companies I
>> > build a prototype for a while back just updated to Java 7 2 years
>> ago. I
>> > doubt we can ditch Java 7 support any time soon if we want to make
>> it easy
>> > for companies to use Flink.
>> >
>> > cheers Martin
>> >
>> > //PS sorry if this gets sent twice, we just migrated to a new mail
>> system
>> > and a lot of things are broken
>> >
>> > 
>> > From: Stephan Ewen 
>> > Sent: Wednesday, March 15, 2017 12:30:24 PM
>> > To: u...@flink.apache.org; dev@flink.apache.org
>> > Subject: [POLL] Who still uses Java 7 with Flink ?
>> >
>> > Hi all!
>> >
>> > I would like to get a feeling how much Java 7 is still being used
>> among
>> > Flink users.
>> >
>> > At some point, it would be great to drop Java 7 support and make
>> use of
>> > Java 8's new features, but first we would need to get a feeling how
>> much
>> > Java 7 is still used.
>> >
>> > Would be happy if users on Java 7 respond here, or even users that
>> have
>> > some insights into how widespread they think Java 7 still is.
>> >
>> > Thanks,
>> > Stephan
>> >
>> >
>> >
>> >
>> >
>>
>
>


>>>
>>
>

Operator name

2017-05-12 Thread Flavio Pompermaier

Hi to all,
in many of my Flink job it is helpful to give a name to operators in order
to make JobManager UI simpler to read.
Currently, to give a name to operators it is necessary to specify it of the
operator everytime it is used, for example:

   - env.readAsCsv().map().name("My map function")

Wouldn't be more convinient to allow an operator to override the default
name (at least for rich ones) by adding to *RichFunction* something like

   - public abstract String getOperatorName()

What do you think?

Best,
Flavio
-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908

Re: [DISCUSS] Feature Freeze

2017-04-28 Thread Flavio Pompermaier

Any chance to cherry-pick this also into 1.2.1? We're usign Flink 1.2.0 in
production and maybe an upgrade to 1.2.1 would be a safer option in the
short term..

Best,
Flavio

On Fri, Apr 28, 2017 at 2:00 PM, Aljoscha Krettek 
wrote:

> Ah, I see. The fix for that has been merged into master so it will be
> release in Flink 1.3.
>
> > On 28. Apr 2017, at 13:50, Flavio Pompermaier 
> wrote:
> >
> > Sorry, you're right Aljosha..the issue number is correct, the link is
> > wrong! The correct one is https://issues.apache.org/
> jira/browse/FLINK-6398
> >
> > On Fri, Apr 28, 2017 at 11:48 AM, Aljoscha Krettek 
> > wrote:
> >
> >> I think there might be a typo. We haven’t yet reached issue number 6389,
> >> if I’m not mistaken. The latest as I’m writing this is 6410.
> >>
> >>> On 28. Apr 2017, at 10:00, Flavio Pompermaier 
> >> wrote:
> >>>
> >>> If it's not a problem it will be great for us to include also
> FLINK-6398
> >>> <https://issues.apache.org/jira/browse/FLINK-6938> if it's not a big
> >> deal
> >>>
> >>> Best,
> >>> Flavio
> >>>
> >>> On Fri, Apr 28, 2017 at 3:32 AM, Zhuoluo Yang <
> >> zhuoluo@alibaba-inc.com>
> >>> wrote:
> >>>
> >>>> Hi Devs,
> >>>>
> >>>> Thanks for the release plan.
> >>>>
> >>>> Could you also please add the feature FLINK-6196
> >>>> <https://issues.apache.org/jira/browse/FLINK-6196> Support dynamic
> >> schema
> >>>> in Table Function?
> >>>> I’d like to update the code as comments left on PR today.
> >>>> I will try to make sure the code is updated before the Apr 30th.
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Zhuoluo 😀
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 在 2017年4月28日，上午8:48，Haohui Mai  写道：
> >>>>
> >>>> Hello,
> >>>>
> >>>> Thanks for starting this thread. It would be great to see the
> following
> >>>> features available in Flink 1.3:
> >>>>
> >>>> * Support for complex schema: FLINK-6033, FLINK-6377
> >>>> * Various improvements on SQL over group windows: FLINK-6335,
> FLINK-6373
> >>>> * StreamTableSink for JDBC and Cassandra: FLINK-6281, FLINK-6225
> >>>> * Decoupling Flink and Hadoop: FLINK-5998
> >>>>
> >>>> All of them have gone through at least one round of review so I'm
> >>>> optimistic that they can make it to 1.3 in a day or two.
> >>>>
> >>>> Additionally it would be great to see FLINK-6232 go in, but it depends
> >> on
> >>>> FLINK-5884 so it might be a little bit tough.
> >>>>
> >>>> Regards,
> >>>> Haohui
> >>>>
> >>>> On Thu, Apr 27, 2017 at 12:22 PM Chesnay Schepler  >
> >>>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> FLINK-5892 (Restoring state by operator) is also nearing completion,
> but
> >>>> with only 1 day left before the weekend we're cutting it really short.
> >>>>
> >>>> Since this eliminates a major pain point when updating jobs, as it
> >>>> allows the modification of chains, another day or 2 would be good i
> >> think.
> >>>>
> >>>> Regards,
> >>>> Chesnay
> >>>>
> >>>> On 27.04.2017 18:55, Bowen Li wrote:
> >>>>
> >>>> Hi Ufuk,
> >>>>I'd like to get FLINK-6013 (Adding Datadog Http metrics reporter)
> >>>>
> >>>> into
> >>>>
> >>>> release 1.3. It's in the final state of code review in
> >>>> https://github.com/apache/flink/pull/3736
> >>>>
> >>>> Thanks,
> >>>> Bowen
> >>>>
> >>>> On Thu, Apr 27, 2017 at 8:38 AM, Zhijiang(wangzhijiang999) <
> >>>> wangzhijiang...@aliyun.com> wrote:
> >>>>
> >>>> Hi Ufuk,
> >>>> Thank you for launching this topic!
> >>>> I wish my latest refinement of buffer provider (
> >>>>
> >>>> https://issues.apache.org/
> >>>>
> >>>> jira/browse/FLINK-6337)  to be included in 1.3 and most of the jobs
> can
> >>>> get benefit from it. And I think it can be completed with the help of
> >>>>
> >>>> your
> >>>>
> >>>> reviews this week.
> >>>>
> >>>> Cheers,Zhijiang-
> >>>> -发件人：Ufuk
> >>>>
> >>>> Celebi 发送时间：2017年4月27日(星期四) 22:25收件人：dev <
> >>>> dev@flink.apache.org>抄 送：Robert Metzger 主
> >>>> 题：[DISCUSS] Feature Freeze
> >>>> Hey devs! :-)
> >>>>
> >>>> We decided to follow a time-based release model with the upcoming 1.3
> >>>> release and the planned feature freeze is on Monday, May 1st.
> >>>>
> >>>> I wanted to start a discussion to get a quick overview of the current
> >>>> state of things.
> >>>>
> >>>> - Is everyone on track and aware of the feature freeze? ;)
> >>>> - Are there any major features we want in 1.3 that
> >>>> have not been merged yet?
> >>>> - Do we need to extend the feature freeze, because of an
> >>>> important feature?
> >>>>
> >>>> Would be great to gather a list of features/PRs that we want in the
> >>>> 1.3 release. This could be a good starting point for the release
> >>>> manager (@Robert?).
> >>>>
> >>>> Best,
> >>>>
> >>>> Ufuk
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
>

Re: [DISCUSS] Feature Freeze

2017-04-28 Thread Flavio Pompermaier

Sorry, you're right Aljosha..the issue number is correct, the link is
wrong! The correct one is https://issues.apache.org/jira/browse/FLINK-6398

On Fri, Apr 28, 2017 at 11:48 AM, Aljoscha Krettek 
wrote:

> I think there might be a typo. We haven’t yet reached issue number 6389,
> if I’m not mistaken. The latest as I’m writing this is 6410.
>
> > On 28. Apr 2017, at 10:00, Flavio Pompermaier 
> wrote:
> >
> > If it's not a problem it will be great for us to include also FLINK-6398
> > <https://issues.apache.org/jira/browse/FLINK-6938> if it's not a big
> deal
> >
> > Best,
> > Flavio
> >
> > On Fri, Apr 28, 2017 at 3:32 AM, Zhuoluo Yang <
> zhuoluo@alibaba-inc.com>
> > wrote:
> >
> >> Hi Devs,
> >>
> >> Thanks for the release plan.
> >>
> >> Could you also please add the feature FLINK-6196
> >> <https://issues.apache.org/jira/browse/FLINK-6196> Support dynamic
> schema
> >> in Table Function?
> >> I’d like to update the code as comments left on PR today.
> >> I will try to make sure the code is updated before the Apr 30th.
> >>
> >>
> >> Thanks,
> >>
> >> Zhuoluo 😀
> >>
> >>
> >>
> >>
> >>
> >> 在 2017年4月28日，上午8:48，Haohui Mai  写道：
> >>
> >> Hello,
> >>
> >> Thanks for starting this thread. It would be great to see the following
> >> features available in Flink 1.3:
> >>
> >> * Support for complex schema: FLINK-6033, FLINK-6377
> >> * Various improvements on SQL over group windows: FLINK-6335, FLINK-6373
> >> * StreamTableSink for JDBC and Cassandra: FLINK-6281, FLINK-6225
> >> * Decoupling Flink and Hadoop: FLINK-5998
> >>
> >> All of them have gone through at least one round of review so I'm
> >> optimistic that they can make it to 1.3 in a day or two.
> >>
> >> Additionally it would be great to see FLINK-6232 go in, but it depends
> on
> >> FLINK-5884 so it might be a little bit tough.
> >>
> >> Regards,
> >> Haohui
> >>
> >> On Thu, Apr 27, 2017 at 12:22 PM Chesnay Schepler 
> >> wrote:
> >>
> >> Hello,
> >>
> >> FLINK-5892 (Restoring state by operator) is also nearing completion, but
> >> with only 1 day left before the weekend we're cutting it really short.
> >>
> >> Since this eliminates a major pain point when updating jobs, as it
> >> allows the modification of chains, another day or 2 would be good i
> think.
> >>
> >> Regards,
> >> Chesnay
> >>
> >> On 27.04.2017 18:55, Bowen Li wrote:
> >>
> >> Hi Ufuk,
> >> I'd like to get FLINK-6013 (Adding Datadog Http metrics reporter)
> >>
> >> into
> >>
> >> release 1.3. It's in the final state of code review in
> >> https://github.com/apache/flink/pull/3736
> >>
> >> Thanks,
> >> Bowen
> >>
> >> On Thu, Apr 27, 2017 at 8:38 AM, Zhijiang(wangzhijiang999) <
> >> wangzhijiang...@aliyun.com> wrote:
> >>
> >> Hi Ufuk,
> >> Thank you for launching this topic!
> >> I wish my latest refinement of buffer provider (
> >>
> >> https://issues.apache.org/
> >>
> >> jira/browse/FLINK-6337)  to be included in 1.3 and most of the jobs can
> >> get benefit from it. And I think it can be completed with the help of
> >>
> >> your
> >>
> >> reviews this week.
> >>
> >> Cheers,Zhijiang-
> >> -发件人：Ufuk
> >>
> >> Celebi 发送时间：2017年4月27日(星期四) 22:25收件人：dev <
> >> dev@flink.apache.org>抄 送：Robert Metzger 主
> >> 题：[DISCUSS] Feature Freeze
> >> Hey devs! :-)
> >>
> >> We decided to follow a time-based release model with the upcoming 1.3
> >> release and the planned feature freeze is on Monday, May 1st.
> >>
> >> I wanted to start a discussion to get a quick overview of the current
> >> state of things.
> >>
> >> - Is everyone on track and aware of the feature freeze? ;)
> >> - Are there any major features we want in 1.3 that
> >> have not been merged yet?
> >> - Do we need to extend the feature freeze, because of an
> >> important feature?
> >>
> >> Would be great to gather a list of features/PRs that we want in the
> >> 1.3 release. This could be a good starting point for the release
> >> manager (@Robert?).
> >>
> >> Best,
> >>
> >> Ufuk
> >>
> >>
> >>
> >>
> >>
>
>

Re: [DISCUSS] Feature Freeze

2017-04-28 Thread Flavio Pompermaier

If it's not a problem it will be great for us to include also FLINK-6398
 if it's not a big deal

Best,
Flavio

On Fri, Apr 28, 2017 at 3:32 AM, Zhuoluo Yang 
wrote:

> Hi Devs,
>
> Thanks for the release plan.
>
> Could you also please add the feature FLINK-6196
>  Support dynamic schema
> in Table Function?
> I’d like to update the code as comments left on PR today.
> I will try to make sure the code is updated before the Apr 30th.
>
>
> Thanks,
>
> Zhuoluo 😀
>
>
>
>
>
> 在 2017年4月28日，上午8:48，Haohui Mai  写道：
>
> Hello,
>
> Thanks for starting this thread. It would be great to see the following
> features available in Flink 1.3:
>
> * Support for complex schema: FLINK-6033, FLINK-6377
> * Various improvements on SQL over group windows: FLINK-6335, FLINK-6373
> * StreamTableSink for JDBC and Cassandra: FLINK-6281, FLINK-6225
> * Decoupling Flink and Hadoop: FLINK-5998
>
> All of them have gone through at least one round of review so I'm
> optimistic that they can make it to 1.3 in a day or two.
>
> Additionally it would be great to see FLINK-6232 go in, but it depends on
> FLINK-5884 so it might be a little bit tough.
>
> Regards,
> Haohui
>
> On Thu, Apr 27, 2017 at 12:22 PM Chesnay Schepler 
> wrote:
>
> Hello,
>
> FLINK-5892 (Restoring state by operator) is also nearing completion, but
> with only 1 day left before the weekend we're cutting it really short.
>
> Since this eliminates a major pain point when updating jobs, as it
> allows the modification of chains, another day or 2 would be good i think.
>
> Regards,
> Chesnay
>
> On 27.04.2017 18:55, Bowen Li wrote:
>
> Hi Ufuk,
>  I'd like to get FLINK-6013 (Adding Datadog Http metrics reporter)
>
> into
>
> release 1.3. It's in the final state of code review in
> https://github.com/apache/flink/pull/3736
>
> Thanks,
> Bowen
>
> On Thu, Apr 27, 2017 at 8:38 AM, Zhijiang(wangzhijiang999) <
> wangzhijiang...@aliyun.com> wrote:
>
> Hi Ufuk,
> Thank you for launching this topic!
> I wish my latest refinement of buffer provider (
>
> https://issues.apache.org/
>
> jira/browse/FLINK-6337)  to be included in 1.3 and most of the jobs can
> get benefit from it. And I think it can be completed with the help of
>
> your
>
> reviews this week.
>
> Cheers,Zhijiang-
> -发件人：Ufuk
>
> Celebi 发送时间：2017年4月27日(星期四) 22:25收件人：dev <
> dev@flink.apache.org>抄 送：Robert Metzger 主
> 题：[DISCUSS] Feature Freeze
> Hey devs! :-)
>
> We decided to follow a time-based release model with the upcoming 1.3
> release and the planned feature freeze is on Monday, May 1st.
>
> I wanted to start a discussion to get a quick overview of the current
> state of things.
>
> - Is everyone on track and aware of the feature freeze? ;)
> - Are there any major features we want in 1.3 that
> have not been merged yet?
> - Do we need to extend the feature freeze, because of an
> important feature?
>
> Would be great to gather a list of features/PRs that we want in the
> 1.3 release. This could be a good starting point for the release
> manager (@Robert?).
>
> Best,
>
> Ufuk
>
>
>
>
>

[jira] [Created] (FLINK-6271) NumericBetweenParametersProvider NullPinter

2017-04-06 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-6271:
-

 Summary: NumericBetweenParametersProvider NullPinter
 Key: FLINK-6271
 URL: https://issues.apache.org/jira/browse/FLINK-6271
 Project: Flink
  Issue Type: Bug
  Components: Batch Connectors and Input/Output Formats
Affects Versions: 1.2.0
Reporter: Flavio Pompermaier
Assignee: Flavio Pompermaier


creating a NumericBetweenParametersProvider using fetchSize=1000, min=0 and 
max= 999 fails with a NP



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: [VOTE] Release Apache Flink 1.2.1 (RC1)

2017-04-04 Thread Flavio Pompermaier

>>>>>>>>> it
>>>
>>>> later.
>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 29, 2017, at 23:08, Aljoscha Krettek wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I commented on FLINK-6214: I think it's working as intended,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> although
>>>>>>>>>>>
>>>>>>>>>>>> we
>>>>>>>>>>>>>
>>>>>>>>>>>>>> could fix the javadoc/doc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 29, 2017, at 17:35, Timo Walther wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A user reported that all tumbling and slinding window
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> assigners
>>>>>>>
>>>>>>>> contain
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> a pretty obvious bug about offsets.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-6214
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think we should also fix this for 1.2.1. What do you
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> think?
>>>
>>>> Regards,
>>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 29/03/17 um 11:30 schrieb Robert Metzger:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Haohui,
>>>>>>>>>>>>>>>>>> I agree that we should fix the parallelism issue.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Otherwise,
>>>
>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> 1.2.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> release would introduce a new bug.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Mar 28, 2017 at 11:59 PM, Haohui Mai <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ricet...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -1 (non-binding)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We recently found out that all jobs submitted via UI will
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> have a
>>>>>>>>>>>
>>>>>>>>>>>> parallelism of 1, potentially due to FLINK-5808.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Filed FLINK-6209 to track it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ~Haohui
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Mar 27, 2017 at 2:59 AM Chesnay Schepler <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ches...@apache.org>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If possible I would like to include FLINK-6183 &
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> FLINK-6184
>>>
>>>> as
>>>>>>>>>>>
>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> They fix 2 metric-related issues that could arise when a
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Task is
>>>>>>>>>>>
>>>>>>>>>>>> cancelled very early. (like, right away)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> FLINK-6183 fixes a memory leak where the TaskMetricGroup
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> was
>>>>>>>
>>>>>>>> never closed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FLINK-6184 fixes a NullPointerExceptions in the buffer
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>
>>>>>>>>>>>> PR here: https://github.com/apache/flink/pull/3611
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 26.03.2017 12:35, Aljoscha Krettek wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I opened a PR for FLINK-6188:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://github.com/apache/
>>>
>>>> flink/pull/3616
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/flink/pull/3616>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This improves the previously very sparse test coverage
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for
>>>
>>>> timestamp/watermark assigners and fixes the bug.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 25 Mar 2017, at 10:22, Ufuk Celebi 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree with Aljoscha.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -1 because of FLINK-6188
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sat, Mar 25, 2017 at 9:38 AM, Aljoscha Krettek <
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> aljos...@apache.org>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I filed this issue, which was observed by a user:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-6188
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think that’s blocking for 1.2.1.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 24 Mar 2017, at 18:57, Ufuk Celebi <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> u...@apache.org>
>>>
>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> RC1 doesn't contain Stefan's backport for the
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Asynchronous
>>>>>>>>>>>
>>>>>>>>>>>> snapshots
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for heap-based keyed state that has been merged.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Should
>>>
>>>> we
>>>>>>>>>>>
>>>>>>>>>>>> create
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RC2
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> with that fix since the voting period only starts on
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Monday?
>>>>>>>>>>>
>>>>>>>>>>>> I think
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> it would only mean rerunning the scripts on your
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> side,
>>>
>>>> right?
>>>>>>>>>>>>>
>>>>>>>>>>>>>> – Ufuk
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Mar 24, 2017 at 3:05 PM, Robert Metzger <
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> rmetz...@apache.org>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Dear Flink community,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Please vote on releasing the following candidate as
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> version 1.2
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> .1.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The commit to be voted on:
>>>>>>>>>>>>>>>>>>>>>>>>> *732e55bd* (*
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> http://git-wip-us.apache.org/r
>>>>>>>>>>>>>>>>>>>> epos/asf/flink/commit/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 732e55bd
>>>>>>>>>>>
>>>>>>>>>>>> <http://git-wip-us.apache.org/
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> repos/asf/flink/commit/732e55b
>>>>>>>>>>>>>
>>>>>>>>>>>>>> d>*)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Branch:
>>>>>>>>>>>>>>>>>>>>>>>>> release-1.2.1-rc1
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The release artifacts to be voted on can be found
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> at:
>>>
>>>> *http://people.apache.org/~
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> rmetzger/flink-1.2.1-rc1/
>>>
>>>> <http://people.apache.org/~
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> rmetzger/flink-1.2.1-rc1/
>>>
>>>> *
>>>>>>>>
>>>>>>>>> The release artifacts are signed with the key with
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> fingerprint
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> D9839159:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> http://www.apache.org/dist/flink/KEYS
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The staging repository for this release can be
>>>>>>>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> at:
>>>>>>>>>>>
>>>>>>>>>>>> https://repository.apache.org/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> content/repositories/orgapache
>>>>>>>>>>>
>>>>>>>>>>>> flink-1116
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The vote ends on Wednesday, March 29, 2017, 3pm
>>>>>>>>>>>>>>>>>>>>>>>>> CET.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [ ] +1 Release this package as Apache Flink 1.2.1
>>>>>>>>>>>>>>>>>>>>>>>>> [ ] -1 Do not release this package, because ...
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>
>


-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908

Re: [DISCUSS] Release Flink 1.1.5 / Flink 1.2.1

2017-03-17 Thread Flavio Pompermaier

I propose to fix https://issues.apache.org/jira/browse/FLINK-6103 before
issue a release

On Fri, Mar 17, 2017 at 8:12 AM, Ufuk Celebi  wrote:

> Cool! Thanks for taking care of this Gordon :-)
>
> On Fri, Mar 17, 2017 at 7:13 AM, Tzu-Li (Gordon) Tai
>  wrote:
> > Update for 1.1.5:
> > The last fixes for 1.1.5 are in! I will create the RC today and start
> the vote.
> >
> > Cheers,
> > Gordon
> >
> >
> > On March 17, 2017 at 1:14:53 AM, Robert Metzger (rmetz...@apache.org)
> wrote:
> >
> > The cassandra connector is probably not usable in Flink 1.2.0. I would
> like
> > to include a fix in 1.2.1:
> > https://issues.apache.org/jira/browse/FLINK-6084
> >
> > Please let me know if this fix becomes a blocker for the 1.2.1 release.
> If
> > so, I can validate the fix myself to speed up things.
> >
> > On Thu, Mar 16, 2017 at 9:41 AM, Jinkui Shi 
> wrote:
> >
> >> @Tzu-li(Fordon)Tai
> >>
> >> FLINK-5650 is fix by [1]. Chesnay Scheduler push a PR please.
> >>
> >> [1] https://github.com/zentol/flink/tree/5650_python_test_debug <
> >> https://github.com/zentol/flink/tree/5650_python_test_debug>
> >>
> >>
> >> > 在 2017年3月16日，上午3:37，Stephan Ewen  写道：
> >> >
> >> > Thanks for the update!
> >> >
> >> > Just merged to 1.2.1 also: [FLINK-5962] [checkpoints] Remove scheduled
> >> > cancel-task from timer queue to prevent memory leaks
> >> >
> >> > The remaining issue list looks good, but I would say that (5) is
> >> optional.
> >> > It is not a critical production bug.
> >> >
> >> >
> >> >
> >> > On Wed, Mar 15, 2017 at 5:38 PM, Tzu-Li (Gordon) Tai <
> >> tzuli...@apache.org>
> >> > wrote:
> >> >
> >> >> Thanks a lot for the updates so far everyone!
> >> >>
> >> >> From the discussion so far, the below is the still unfixed pending
> >> issues
> >> >> for 1.1.5 / 1.2.1 release.
> >> >>
> >> >> Since there’s only one backport for 1.1.5 left, I think having an RC
> for
> >> >> 1.1.5 near the end of this week / early next week is very promising,
> as
> >> >> basically everything is already in.
> >> >> I’d be happy to volunteer to help manage the release for 1.1.5, and
> >> >> prepare the RC when it’s ready :)
> >> >>
> >> >> For 1.2.1, we can leave the pending list here for tracking, and come
> >> back
> >> >> to update it in the near future.
> >> >>
> >> >> If there’s anything I missed, please let me know!
> >> >>
> >> >>
> >> >> === Still pending for Flink 1.1.5 ===
> >> >>
> >> >> (1) https://issues.apache.org/jira/browse/FLINK-5701
> >> >> Broken at-least-once Kafka producer.
> >> >> Status: backport PR pending - https://github.com/apache/
> flink/pull/3549
> >> .
> >> >> Since it is a relatively self-contained change, I expect this to be a
> >> fast
> >> >> fix.
> >> >>
> >> >>
> >> >>
> >> >> === Still pending for Flink 1.2.1 ===
> >> >>
> >> >> (1) https://issues.apache.org/jira/browse/FLINK-5808
> >> >> Fix Missing verification for setParallelism and setMaxParallelism
> >> >> Status: PR - https://github.com/apache/flink/pull/3509, review in
> >> progress
> >> >>
> >> >> (2) https://issues.apache.org/jira/browse/FLINK-5713
> >> >> Protect against NPE in WindowOperator window cleanup
> >> >> Status: PR - https://github.com/apache/flink/pull/3535, review
> pending
> >> >>
> >> >> (3) https://issues.apache.org/jira/browse/FLINK-6044
> >> >> TypeSerializerSerializationProxy.read() doesn't verify the read
> buffer
> >> >> length
> >> >> Status: Fixed for master, 1.2 backport pending
> >> >>
> >> >> (4) https://issues.apache.org/jira/browse/FLINK-5985
> >> >> Flink treats every task as stateful (making topology changes
> impossible)
> >> >> Status: PR - https://github.com/apache/flink/pull/3543, review in
> >> progress
> >> >>
> >> >> (5) https://issues.apache.org/jira/browse/FLINK-5650
> >> >> Flink-python tests taking up too much time
> >> >> Status: I think Chesnay currently has some progress with this one, we
> >> can
> >> >> see if we want to make this a blocker
> >> >>
> >> >>
> >> >> Cheers,
> >> >> Gordon
> >> >>
> >> >> On March 15, 2017 at 7:16:53 PM, Jinkui Shi (shijinkui...@163.com)
> >> wrote:
> >> >>
> >> >> Can we fix this issue in the 1.2.1:
> >> >>
> >> >> Flink-python tests cost too long time
> >> >> https://issues.apache.org/jira/browse/FLINK-5650 <
> >> >> https://issues.apache.org/jira/browse/FLINK-5650>
> >> >>
> >> >>> 在 2017年3月15日，下午6:29，Vladislav Pernin 
> 写道：
> >> >>>
> >> >>> I just tested in in my reproducer. It works.
> >> >>>
> >> >>> 2017-03-15 11:22 GMT+01:00 Aljoscha Krettek :
> >> >>>
> >>  I did in fact just open a PR for
> >> > https://issues.apache.org/jira/browse/FLINK-6001
> >> > NPE on TumblingEventTimeWindows with ContinuousEventTimeTrigger
> and
> >> > allowedLateness
> >> 
> >> 
> >>  On Tue, Mar 14, 2017, at 18:20, Vladislav Pernin wrote:
> >> > Hi,
> >> >
> >> > I would also include the following (not yet resolved) issue in the
> >> >> 1.2.1
> >> > scope :
> >> >
> >> > https://issues.apache.org/j

[jira] [Created] (FLINK-6103) LocalFileSystem rename() uses File.renameTo()

2017-03-17 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-6103:
-

 Summary: LocalFileSystem rename() uses File.renameTo()
 Key: FLINK-6103
 URL: https://issues.apache.org/jira/browse/FLINK-6103
 Project: Flink
  Issue Type: Bug
Reporter: Flavio Pompermaier


I've tried to move a directory to another on the LocalFilesystem and it doesn't 
work (in my case fs is an instance of java.io.UnixFileSystem).
As for Flink-1840 (there was a PR to fix the issue - 
https://github.com/apache/flink/pull/578) the problem is that 
{{File.renameTo()}} is not reliable.

Indeed, the Javadoc says:

{{
Renames the file denoted by this abstract pathname.

Many aspects of the behavior of this method are inherently platform-dependent: 
The rename operation might not be able to move a file from one filesystem to 
another, it might not be atomic, and it might not succeed if a file with the 
destination abstract pathname already exists. The return value should always be 
checked to make sure that the rename operation was successful.

Note that the java.nio.file.Files class defines the move method to move or 
rename a file in a platform independent manner
}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Flink CSV parsing

2017-03-10 Thread Flavio Pompermaier

If you already have an idea on how to proceed maybe I can try to take care
of issue a PR using commons-csv or whatever library you prefer

On 10 Mar 2017 22:07, "Fabian Hueske"  wrote:

Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016

2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of
apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>

Flink CSV parsing

2017-03-10 Thread Flavio Pompermaier

Hi to all,
I want to discuss with the dev group something about CSV parsing.
Since I started using Flink with CSVs I always faced some little problem
here and there and the new tickets about the CSV parsing seems to confirm
that this part is still problematic.
In my production jobs I gave up using Flink CSV parsing in favour of  apace
commons-csv and it works great. It's perfectly configurable ans robust.
A working example is available at [1].

Thus, why not to use that library directly and contribute back (if needed)
to another apache library if improvements are required to speed up the
parsing? Have you ever tried to compare the performances of the 2 parsers?

Best,
Flavio

[1]
https://github.com/okkam-it/flink-examples/blob/master/src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/Csv2RowExample.java

[jira] [Created] (FLINK-5988) Show start-cluster.sh failures

2017-03-07 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-5988:
-

 Summary: Show start-cluster.sh failures
 Key: FLINK-5988
 URL: https://issues.apache.org/jira/browse/FLINK-5988
 Project: Flink
  Issue Type: Bug
  Components: Startup Shell Scripts
Affects Versions: 1.2.0
Reporter: Flavio Pompermaier
Priority: Minor


Currently, if the start-cluster.sh fails to execute taskmanager.sh start 
command on the remote machine (via ssh) nothing is shown to the user and this 
can be very tricky to detect. It would be very helpful to always print the 
result of the ssh command execution



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

1 2 3 >

1 - 100 of 208 matches

Mail list logo