Scylla connector

2019-08-11 Thread Lian Jiang
Hi,

i am new to Flink. Is there scylla connector equivalent to the cassandra
connector:
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/cassandra.html?
Or can Flink use Scylla as a sink via the cassandra connector? Thanks.


Re: Why Job Manager die/restarted when Task Manager die/restarted?

2019-08-11 Thread Zhu Zhu
Another possibility is the JM is killed externally, e.g. K8s may kill JM/TM
if it exceeds the resource limit.

Thanks,
Zhu Zhu

Zhu Zhu  于2019年8月12日周一 下午1:45写道:

> Hi Cam,
>
> Flink master should not die when getting disconnected with task managers.
> It may exit for cases below:
> 1. when the job terminated(FINISHED/FAILED/CANCELED). If you job is
> configured with no restart retry, a TM failure can cause the job to be
> FAILED.
> 2. JM lost HA leadership, e.g. lost connection to ZK
> 3. encounters other unexpected fatal errors. In this case we need to check
> the log to see what happens then
>
> Thanks,
> Zhu Zhu
>
> Cam Mach  于2019年8月12日周一 下午12:15写道:
>
>> Hello Flink experts,
>>
>> We are running Flink under Kubernetes and see that Job Manager
>> die/restarted whenever Task Manager die/restarted or couldn't get connected
>> each other. Is there any specific configurations/parameters that we need to
>> turn on to stop this? Or this is expected?
>>
>> Thanks,
>> Cam
>>
>>


Re: Why Job Manager die/restarted when Task Manager die/restarted?

2019-08-11 Thread Zhu Zhu
Hi Cam,

Flink master should not die when getting disconnected with task managers.
It may exit for cases below:
1. when the job terminated(FINISHED/FAILED/CANCELED). If you job is
configured with no restart retry, a TM failure can cause the job to be
FAILED.
2. JM lost HA leadership, e.g. lost connection to ZK
3. encounters other unexpected fatal errors. In this case we need to check
the log to see what happens then

Thanks,
Zhu Zhu

Cam Mach  于2019年8月12日周一 下午12:15写道:

> Hello Flink experts,
>
> We are running Flink under Kubernetes and see that Job Manager
> die/restarted whenever Task Manager die/restarted or couldn't get connected
> each other. Is there any specific configurations/parameters that we need to
> turn on to stop this? Or this is expected?
>
> Thanks,
> Cam
>
>


Re: Why available task slots are not leveraged for pipeline?

2019-08-11 Thread Zhu Zhu
Hi Cam,
This case is expected due to slot sharing.
A slot can be shared by one instance of different tasks. So the used slot
is count of your max parallelism of a task.
You can specify the shared group with slotSharingGroup(String
slotSharingGroup) on operators.

Thanks,
Zhu Zhu

Abhishek Jain  于2019年8月12日周一 下午1:23写道:

> What you'se seeing is likely operator chaining. This is the default
> behaviour of grouping sub tasks to avoid transer overhead (from one slot to
> another). You can disable chaining if you need to. Please refer task and
> operator chains
> 
> .
>
> - Abhishek
>
> On Mon, 12 Aug 2019 at 09:56, Cam Mach  wrote:
>
>> Hello Flink expert,
>>
>> I have a cluster with 10 Task Managers, configured with 6 task slot each,
>> and a pipeline that has 13 tasks/operators with parallelism of 5. But when
>> running the pipeline I observer that only  5 slots are being used, the
>> other 55 slots are available/free. It should use all of my slots, right?
>> since I have 13 (tasks) x 5 = 65 sub-tasks? What are the configuration that
>> I missed in order to leverage all of the available slots for my pipelines?
>>
>> Thanks,
>> Cam
>>
>>
>


Re: Why available task slots are not leveraged for pipeline?

2019-08-11 Thread Abhishek Jain
What you'se seeing is likely operator chaining. This is the default
behaviour of grouping sub tasks to avoid transer overhead (from one slot to
another). You can disable chaining if you need to. Please refer task and
operator chains

.

- Abhishek

On Mon, 12 Aug 2019 at 09:56, Cam Mach  wrote:

> Hello Flink expert,
>
> I have a cluster with 10 Task Managers, configured with 6 task slot each,
> and a pipeline that has 13 tasks/operators with parallelism of 5. But when
> running the pipeline I observer that only  5 slots are being used, the
> other 55 slots are available/free. It should use all of my slots, right?
> since I have 13 (tasks) x 5 = 65 sub-tasks? What are the configuration that
> I missed in order to leverage all of the available slots for my pipelines?
>
> Thanks,
> Cam
>
>


Re: Kafka ProducerFencedException after checkpointing

2019-08-11 Thread Tony Wei
Hi,

I had the same exception recently. I want to confirm that if it is due to
transaction timeout,
then I will lose those data. Am I right? Can I make it fall back to at
least once semantic in
this situation?

Best,
Tony Wei

Piotr Nowojski  於 2018年3月21日 週三 下午10:28寫道:

> Hi,
>
> But that’s exactly the case: producer’s transaction timeout starts when
> the external transaction starts - but FlinkKafkaProducer011 keeps an active
> Kafka transaction for the whole period between checkpoints.
>
> As I wrote in the previous message:
>
> > in case of failure, your timeout must also be able to cover the
> additional downtime required for the successful job restart. Thus you
> should increase your timeout accordingly.
>
> I think that 15 minutes timeout is a way too small value. If your job
> fails because of some intermittent failure (for example worker
> crash/restart), you will only have a couple of minutes for a successful
> Flink job restart. Otherwise you will lose some data (because of the
> transaction timeouts).
>
> Piotrek
>
> On 21 Mar 2018, at 10:30, Dongwon Kim  wrote:
>
> Hi Piotr,
>
> Now my streaming pipeline is working without retries.
> I decreased Flink's checkpoint interval from 15min to 10min as you
> suggested [see screenshot_10min_ckpt.png].
>
> I though that producer's transaction timeout starts when the external
> transaction starts.
> The truth is that Producer's transaction timeout starts after the last
> external checkpoint is committed.
> Now that I have 15min for Producer's transaction timeout and 10min for
> Flink's checkpoint interval, and every checkpoint takes less than 5
> minutes, everything is working fine.
> Am I right?
>
> Anyway thank you very much for the detailed explanation!
>
> Best,
>
> Dongwon
>
>
>
> On Tue, Mar 20, 2018 at 8:10 PM, Piotr Nowojski 
> wrote:
>
>> Hi,
>>
>> Please increase transaction.timeout.ms to a greater value or decrease
>> Flink’s checkpoint interval, I’m pretty sure the issue here is that those
>> two values are overlapping. I think that’s even visible on the screenshots.
>> First checkpoint completed started at 14:28:48 and ended at 14:30:43, while
>> the second one started at 14:45:53 and ended at 14:49:16. That gives you
>> minimal transaction duration of 15 minutes and 10 seconds, with maximal
>> transaction duration of 21 minutes.
>>
>> In HAPPY SCENARIO (without any failure and restarting), you should assume
>> that your timeout interval should cover with some safety margin the period
>> between start of a checkpoint and end of the NEXT checkpoint, since this is
>> the upper bound how long the transaction might be used. In your case at
>> least ~25 minutes.
>>
>> On top of that, as described in the docs,
>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance
>>  ,
>> in case of failure, your timeout must also be able to cover the additional
>> downtime required for the successful job restart. Thus you should increase
>> your timeout accordingly.
>>
>> Piotrek
>>
>>
>> On 20 Mar 2018, at 11:58, Dongwon Kim  wrote:
>>
>> Hi Piotr,
>>
>> We have set producer's [transaction.timeout.ms] to 15 minutes and have
>> used the default setting for broker (15 mins).
>> As Flink's checkpoint interval is 15 minutes, it is not a situation where
>> Kafka's timeout is smaller than Flink's checkpoint interval.
>> As our first checkpoint just takes 2 minutes, it seems like transaction
>> is not committed properly.
>>
>> Best,
>>
>> - Dongwon
>>
>>
>>
>>
>>
>> On Tue, Mar 20, 2018 at 6:32 PM, Piotr Nowojski 
>> wrote:
>>
>>> Hi,
>>>
>>> What’s your Kafka’s transaction timeout setting? Please both check Kafka
>>> producer configuration (transaction.timeout.ms property) and Kafka
>>> broker configuration. The most likely cause of such error message is when
>>> Kafka's timeout is smaller then Flink’s checkpoint interval and
>>> transactions are not committed quickly enough before timeout occurring.
>>>
>>> Piotrek
>>>
>>> On 17 Mar 2018, at 07:24, Dongwon Kim  wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'm faced with the following ProducerFencedException after 1st, 3rd,
>>> 5th, 7th, ... checkpoints:
>>>
>>> --
>>>
>>> java.lang.RuntimeException: Error while confirming checkpoint
>>> at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1260)
>>> at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer 
>>> attempted an operation with an old epoch. Either there is a newer producer 
>>> with the same transactionalId, or the producer's transaction has been 
>>> expired by the broker.
>>>
>>>

Why available task slots are not leveraged for pipeline?

2019-08-11 Thread Cam Mach
Hello Flink expert,

I have a cluster with 10 Task Managers, configured with 6 task slot each,
and a pipeline that has 13 tasks/operators with parallelism of 5. But when
running the pipeline I observer that only  5 slots are being used, the
other 55 slots are available/free. It should use all of my slots, right?
since I have 13 (tasks) x 5 = 65 sub-tasks? What are the configuration that
I missed in order to leverage all of the available slots for my pipelines?

Thanks,
Cam


Why Job Manager die/restarted when Task Manager die/restarted?

2019-08-11 Thread Cam Mach
Hello Flink experts,

We are running Flink under Kubernetes and see that Job Manager
die/restarted whenever Task Manager die/restarted or couldn't get connected
each other. Is there any specific configurations/parameters that we need to
turn on to stop this? Or this is expected?

Thanks,
Cam


[ANNOUNCE] Weekly Community Update 2019/32

2019-08-11 Thread Konstantin Knauf
Dear community,

happy to share this week's community update. The first real release
candidate of Apache Flink 1.9.0 has been published and the developer
community has moved more towards the 1.10 development cycle with more
discussion threads popping up again.

As always, please feel free to add additional updates and news to this
thread!

*Personal Note:* There will be no weekly community update in the next three
weeks as I will be on vacation :). The next update will be on the 8th of
September, which will then cover four weeks instead of one.

Flink Development
===

* [releases] The voting thread on *RC2 for Apache Flink 1.9.0* has just
started. It will be open for at least 72 hours. [1,2]

* [development] Chesnay has started a discussion on *splitting up Flink's
main repository* to tackle build time issues and an ever increasing number
of pull requests. So far, the majority of answers is in favor of keeping
the mono repo and tackling the problems differently via a more
sophisticated build system, project structure, etc. It seems these
alternatives deserve a more detailed investigation. [3]

* [development process] Stephan has raised the point of *post-feature-freeze
contributions* to Flink 1.9.0 in a discussion thread on the mailing list
last week. The thread also contains a few proposals e.g. to formalize the
process in the future, which probably need to be discussed in separate
threads in the future. [4]

* [resource management] Xintong has startd a discussion on FLIP-49 to
improve and unify the *memory configuration of TaskManagers*. Its goals are
to unify the memory configuration for batch and stream processing, to
simplify automatically configure RocksDB and to simplify the configuration
of the different memory pools. [5]

* [documentation] Marta has written a *style guide for our documentation*
based on best practices and other successful Open Source projects. More
feedback welcome! [6]

* [documentation] Myself and Fabian have recently contributed a *docker-based
playground* to the Getting Started section of the documentation. This lead
to a follow up discussion of how to publish the required Docker Images. [7]

* [development process] The *style and quality guide* has finally been
published on the project website. [8]

* [sources] Zhijang has started a discussion to start with the integration
of the *SourceReader* with the mailbox of the StreamTask, which is proposed
as part of FLIP-27 (Refactor Source Interface). [9]

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Apache-Flink-Release-1-9-0-release-candidate-2-tp31542.html
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Progress-updates-for-Apache-Flink-1-9-0-release-tp30565.html
[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Repository-split-tp31417.html
[4]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Merging-new-features-post-feature-freeze-tp31469.html
[5]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-49-Unified-Memory-Configuration-for-TaskExecutors-tp31436.html
[6]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Discuss-FLIP-42-Documentation-Style-Guide-tp31370.html
[7]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Docker-Playgrounds-tp31496.html
[8]
https://flink.apache.org/contributing/code-style-and-quality-preamble.html
[9]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-new-SourceReader-with-Mailbox-Model-in-StreamTask-tp31549.html

Notable Bugs
===

* [FLINK-13586] [1.8.1] The signature of
org.apache.flink.api.java.ClosureCleaner#clean was changed between 1.8.0
and 1.8.1. Hence, no application calling this method compiled against Flink
1.8.0 can be run on a Flink 1.8.1 cluster. For example, this includes any
application using the Kafka or Cassandra Connector. [10]

* [FLINK-13537] [1.8.1] [1.7.2] [1.6.4] If an exactly-once Kafka Producer
is rescaled and the "pool size" (# of transactionanl ids per subtask) is
changed at the same time, the transactional ids of different subtasks might
overlap, which should never happen. Fortunately, users can easily avoid
such a scenario until this is fixed. [11]

[10] https://issues.apache.org/jira/browse/FLINK-13537
[11] https://issues.apache.org/jira/browse/FLINK-13586

Events, Blog Posts, Misc


* *Hequn* is now an Apache Flink Committer. Congratulations! [12]
* Upcoming Meetups
* *Roshan Naik* of Uber and *Sijie Guo* of StreamNative speak the next
Seattle Flink Meetup on the 22nd of August about Kappa+ and Flink with
Pulsar respectively. [13]
* *Luca Giovagnoli *of Yelp will speak at the ACM San Francisco meetup
about a machine learning application on Apache Flink. [14]
* There is a full-day of Apache Pulsar talks at the Apache Pulsar
Meetup Beijing on the 17th of August. This also includes a talk on state
management in Apache Flink

Apache flink 1.7.2 security issues

2019-08-11 Thread V N, Suchithra (Nokia - IN/Bangalore)
Hello,

We are using Apache Flink 1.7.2 version. During our security scans following 
issues are reported by our scan tool. Please let us know your comments on these 
issues.

[1] 150085 Slow HTTP POST vulnerability
Severity Potential Vulnerability - Level 3
Group Information Disclosure

Threat
The web application is possibly vulnerable to a "slow HTTP POST" Denial of 
Service (DoS) attack. This is an application-level DoS that consumes server
resources by maintaining open connections for an extended period of time by 
slowly sending traffic to the server. If the server maintains too many 
connections
open at once, then it may not be able to respond to new, legitimate connections.

#1 Request
Payload N/A
Request POST https://:/
#1 Host: :
#3 Accept: */*
#4 Content-Type: application/x-www-form-urlencoded

#1 Response
Vulnerable to slow HTTP POST attack
Connection with partial POST body remained open for: 312932 milliseconds

[2] 150124 Clickjacking - Framable Page (10)
Severity Confirmed Vulnerability - Level 3
Group Information Disclosure
CVSS Base 6.4 CVSS Temporal5.8

Threat
The web page can be framed. This means that clickjacking attacks against users 
are possible.

#1 Request
Payload N/A
Request GET https://:/
#1 Host: :
#3 Accept: */*

#1 Response
The URI was framed.

Below url's have also reported the same issues and response was same.

Request GET 
https://:/partials/jobs/running-jobs.html
Request GET 
https://:/partials/submit.html
Request GET 
https://:/partials/jobmanager/stdout.html
Request GET 
https://:/partials/jobs/completed-jobs.html
Request GET 
https://:/partials/taskmanager/index.html
Request GET 
https://:/partials/jobmanager/log.html
Request GET 
https://:/partials/jobmanager/index.html
Request GET 
https:///partials/overview.html
Request GET 
https://:/partials/jobmanager/config.html

[3] 150162 Use of JavaScript Library with Known Vulnerability (4)

Threat
The web application is using a JavaScript library that is known to contain at 
least one vulnerability.

#1 Request
Payload -
Request GET https://:/
#1 Host: :
#3 Accept: */*

#1 Response
Vulnerable javascript library: jQuery
version: 2.2.0
Details:
CVE-2015-9251: jQuery versions on or above 1.4.0 and below 1.12.0 (version 
1.12.3 and above but below 3.0.0-beta1 as well) are vulnerable to XSS via 3rd 
party text/javascript responses(3rd party
CORS request may execute). (https://github.com/jquery/jquery/issues/2432).
Solution: jQuery version 3.0.0 has been released to address the issue 
(http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/). Please refer 
to vendor documentation (https://blog.jquery.com/)
for the latest security updates.

Found on the following pages (only first 10 pages are reported):
https://:/
https://:/#/completed-jobs
https://:/#/jobmanager/config
https://:/#/overview
https://:/#/running-jobs
https://:/#/submit
https://:/#/taskmanagers
https://:/#/jobmanager/log
https://:/#/jobmanager/stdout
https://:/#/taskmanager/100474b27dcd8eeb9f3ff38c952977c9/log


#1 Response
Vulnerable javascript library: Angular
version: 1.4.8
Details:
In angular versions below 1.6.5 both Firefox and Safari are vulnerable to XSS 
in $sanitize if an inert document created via 
`document.implementation.createHTMLDocument()` is used. Angular version
1.6.5 checks for these vulnerabilities and then use a DOMParser or XHR strategy 
if needed. Please refer to vendor documentation 
(https://github.com/angular/angular.js/commit/
8f31f1ff43b673a24f84422d5c13d6312b2c4d94) for latest security updates.
Found on the following pages (only first 10 pages are reported):
https://:/
https://:/#/completed-jobs
https://:/#/jobmanager/config
https://:/#/overview
https://:/#/running-jobs
https://:/#/submit
https://:/#/taskmanagers
https://:/#/jobmanager/log
https://:/#/jobmanager/stdout
https://:/#/taskmanager/100474b27dcd8eeb9f3ff38c952977c9/log

#1 Response
Vulnerable javascript library: Bootstrap
version: 3.3.6
Details:
The data-target attribute in bootstrap versions below 3.4.0 is vulnerable to 
Cross-Site Scripting(XSS) attacks. Please refer to vendor documentation 
(https://github.com/twbs/bootstrap/pull/23687, https://
github.com/twbs/bootstrap/issues/20184) for the latest security updates.
--
CVE-2019-8331: In bootstrap versions before 3.4.1, data-template, data-content 
and data-title properties of too

Re: How many task managers can Flink efficiently scale to?

2019-08-11 Thread Zhu Zhu
Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs.
In our experience, the main regression caused by large scale TMs is the in
TM allocation stage in ResourceManager, that some times it fails to
allocate enough TMs before the allocation timeout.
It does not deteriorate much once the Flink cluster has reached a stable
state.

The main loads, In my mind, increases with the task scale and edge scale of
a submitted job.
JM can be overwhelmed by frequent and slow GCs caused by task deployment if
the JM memory is not fine tuned.
The JM can also be slower due to more PRCs to JM main thread and increased
computation complexity of each RPC handling.

Thanks,
Zhu Zhu

qi luo  于2019年8月11日周日 下午6:17写道:

> Hi Chad,
>
> In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In
> general, the CPU/memory of Job Manager should be increased with more TMs.
>
> Regards,
> Qi
>
> > On Aug 11, 2019, at 2:03 AM, Chad Dombrova  wrote:
> >
> > Hi,
> > I'm still on my task management investigation, and I'm curious to know
> how many task managers people are reliably using with Flink.  We're
> currently using AWS | Thinkbox Deadline, and we're able to easily utilize
> over 300 workers, and I've heard from other customers who use several
> thousand, so I'm curious how Flink compares in this regard.  Also, what
> aspects of the system begin to deteriorate at higher scales?
> >
> > thanks in advance!
> >
> > -chad
> >
>
>


Re: How many task managers can Flink efficiently scale to?

2019-08-11 Thread qi luo
Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In 
general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova  wrote:
> 
> Hi,
> I'm still on my task management investigation, and I'm curious to know how 
> many task managers people are reliably using with Flink.  We're currently 
> using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 
> workers, and I've heard from other customers who use several thousand, so I'm 
> curious how Flink compares in this regard.  Also, what aspects of the system 
> begin to deteriorate at higher scales?
> 
> thanks in advance!
> 
> -chad
>