[jira] [Created] (FLINK-10836) Add repositories and pluginRepositories in root pom.xml

2018-11-08 Thread bjkonglu (JIRA)
bjkonglu created FLINK-10836:


 Summary: Add repositories and pluginRepositories in root pom.xml 
 Key: FLINK-10836
 URL: https://issues.apache.org/jira/browse/FLINK-10836
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Affects Versions: 1.8.0
Reporter: bjkonglu


h3. Background

When developer want to build flink project use maven, they may encounter 
dependency problem which maven can't download dependencies from remote repo.
h3. Analyse

There maybe a way to resolve this problem. That's add remote maven repo in root 
pom.xml, for example:
{code:java}


   
  alimaven
  http://maven.aliyun.com/nexus/content/groups/public/
   
   
  central
  Maven Repository
  https://repo.maven.apache.org/maven2
  
 true
  
  
 false
  
   




   
  alimaven
  http://maven.aliyun.com/nexus/content/groups/public/
  
 true
  
  
 true
  
   
   
  central
  https://repo.maven.apache.org/maven2
  
 true
  
  
 false
  
   

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: how get job id which job run slot

2018-11-08 Thread vino yang
Hi lining,

Yes, currently you can't get slot information via the
"/taskmanagers/:taskmanagerid" rest API.

In addition, please ask questions in the user mailing list. The dev mailing
list mainly discusses information related to Flink development.

Thanks, vino.

lining jing  于2018年11月9日周五 上午5:42写道:

> Hi, dev. I have a question. Now rest api can get TaskManagerInfo, but can
> not get  information of task which run on slot.
>


[jira] [Created] (FLINK-10835) Remove duplicated Round-robin ChannelSelector implementation in RecordWriterTest

2018-11-08 Thread zhijiang (JIRA)
zhijiang created FLINK-10835:


 Summary: Remove duplicated Round-robin ChannelSelector 
implementation in RecordWriterTest
 Key: FLINK-10835
 URL: https://issues.apache.org/jira/browse/FLINK-10835
 Project: Flink
  Issue Type: Sub-task
  Components: Network
Affects Versions: 1.8.0
Reporter: zhijiang
Assignee: zhijiang


{{RoundRobinChannelSelector}} exists for default selector in {{RecordWriter}} 
mainly for tests. Another similar {{RoundRobin}} implementation exists in 
{{RecordWriterTest}}, only because the difference in starting channel index for 
round-robin.

We can adjust the test verify logic to keep the same behavior with 
{{RoundRobinChannelSelector}}, and then remove the duplicated {{RoundRobin}}.

It can make simple in following work 
[FLINK-10622|https://issues.apache.org/jira/browse/FLINK-10662]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10834) TableAPI flatten calculated value error

2018-11-08 Thread sunjincheng (JIRA)
sunjincheng created FLINK-10834:
---

 Summary: TableAPI flatten calculated value error
 Key: FLINK-10834
 URL: https://issues.apache.org/jira/browse/FLINK-10834
 Project: Flink
  Issue Type: Bug
  Components: Table API & SQL
Reporter: sunjincheng
 Fix For: 1.7.1


We have a UDF as follows:



object FuncRow extends ScalarFunction {
    def eval(v: Int): Row = {
         val version = "" + new Random().nextInt()
         val row = new Row(3)
         row.setField(0, version)
         row.setField(1, version)
         row.setField(2, version)
         row
 }


 override def isDeterministic: Boolean = false


 override def getResultType(signature: Array[Class[_]]): TypeInformation[_] =
 Types.ROW(Types.STRING, Types.STRING, Types.STRING)

}





...

val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
val ds = env.fromCollection(data).toTable(tEnv, 'a, 'b,'c)
 .select(FuncRow('a).flatten()).as('v1, 'v2, 'v3)

...



The result is : -1189206469,-151367792,1988676906

The result expected by the user should be:  v1==v2==v3 .

 

It looks the real reason is that there is no result of the reuse in codegen.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-08 Thread Zhang, Xuefu
Hi Piotr,

That seems to be good idea!

Since the google doc for the design is currently under extensive review, I will 
leave it as it is for now. However, I'll convert it to two different FLIPs when 
the time comes.

How does it sound to you?

Thanks,
Xuefu


--
Sender:Piotr Nowojski 
Sent at:2018 Nov 9 (Fri) 02:31
Recipient:dev 
Cc:Bowen Li ; Xuefu ; Shuyi Chen 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Maybe we should split this topic (and the design doc) into couple of smaller 
ones, hopefully independent. The questions that you have asked Fabian have for 
example very little to do with reading metadata from Hive Meta Store?

Piotrek 

> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
> 
> Hi Xuefu and all,
> 
> Thanks for sharing this design document!
> I'm very much in favor of restructuring / reworking the catalog handling in
> Flink SQL as outlined in the document.
> Most changes described in the design document seem to be rather general and
> not specifically related to the Hive integration.
> 
> IMO, there are some aspects, especially those at the boundary of Hive and
> Flink, that need a bit more discussion. For example
> 
> * What does it take to make Flink schema compatible with Hive schema?
> * How will Flink tables (descriptors) be stored in HMS?
> * How do both Hive catalogs differ? Could they be integrated into to a
> single one? When to use which one?
> * What meta information is provided by HMS? What of this can be leveraged
> by Flink?
> 
> Thank you,
> Fabian
> 
> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
> 
>> After taking a look at how other discussion threads work, I think it's
>> actually fine just keep our discussion here. It's up to you, Xuefu.
>> 
>> The google doc LGTM. I left some minor comments.
>> 
>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>> 
>>> Hi all,
>>> 
>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>> ... Hive integration design ..." on only dev mailing list for community
>>> devs to review. The current thread sends to both dev and user list.
>>> 
>>> This email thread is more like validating the general idea and direction
>>> with the community, and it's been pretty long and crowded so far. Since
>>> everyone is pro for the idea, we can move forward with another thread to
>>> discuss and finalize the design.
>>> 
>>> Thanks,
>>> Bowen
>>> 
>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu 
>>> wrote:
>>> 
 Hi Shuiyi,
 
 Good idea. Actually the PDF was converted from a google doc. Here is its
 link:
 
 https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
 Once we reach an agreement, I can convert it to a FLIP.
 
 Thanks,
 Xuefu
 
 
 
 --
 Sender:Shuyi Chen 
 Sent at:2018 Nov 1 (Thu) 02:47
 Recipient:Xuefu 
 Cc:vino yang ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 
 Hi Xuefu,
 
 Thanks a lot for driving this big effort. I would suggest convert your
 proposal and design doc into a google doc, and share it on the dev mailing
 list for the community to review and comment with title like "[DISCUSS] ...
 Hive integration design ..." . Once approved,  we can document it as a FLIP
 (Flink Improvement Proposals), and use JIRAs to track the implementations.
 What do you think?
 
 Shuyi
 
 On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu 
 wrote:
 Hi all,
 
 I have also shared a design doc on Hive metastore integration that is
 attached here and also to FLINK-10556[1]. Please kindly review and share
 your feedback.
 
 
 Thanks,
 Xuefu
 
 [1] https://issues.apache.org/jira/browse/FLINK-10556
 --
 Sender:Xuefu 
 Sent at:2018 Oct 25 (Thu) 01:08
 Recipient:Xuefu ; Shuyi Chen <
 suez1...@gmail.com>
 Cc:yanghua1127 ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 
 Hi all,
 
 To wrap up the discussion, I have attached a PDF describing the
 proposal, which is also attached to FLINK-10556 [1]. Please feel free to
 watch that JIRA to track the progress.
 
 Please also let me know if you have additional comments or questions.
 
 Thanks,
 Xuefu
 
 [1] https://issues.apache.org/jira/browse/FLINK-10556
 
 
 --
 Sender:Xuefu 
 Sent at:2018 Oct 16 (Tue) 03:40
 Recipient:Shuyi Chen 
 Cc:yanghua1127 ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DI

[jira] [Created] (FLINK-10832) StreamExecutionEnvironment.execute() does not return when all sources end

2018-11-08 Thread Arnaud Linz (JIRA)
Arnaud Linz created FLINK-10832:
---

 Summary: StreamExecutionEnvironment.execute() does not return when 
all sources end
 Key: FLINK-10832
 URL: https://issues.apache.org/jira/browse/FLINK-10832
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 1.6.2, 1.5.5
Reporter: Arnaud Linz


In 1.5.5, 1.6.1 and 1.6.2 (it works in 1.6.0 and 1.3.2), 

This code never ends : 

 

{{    public void testFlink() throws Exception {}}
{{    // get the execution environment}}
{{    final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();}}
{{    // get input data}}
{{final DataStreamSource text = env.addSource(new 
SourceFunction() {}}
{{    @Override}}
{{    public void run(final SourceContext ctx) throws Exception 
{}}
{{    for (int count = 0; count < 5; count++) {}}
{{    ctx.collect(String.valueOf(count));}}
{{    }}}
{{    }}}
{{    @Override}}
{{    public void cancel() {}}
{{    }}}
{{    });}}
{{    text.print().setParallelism(1);}}
{{    env.execute("Simple Test");}}
{{    // Never ends !}}
{{    }}}{{ }}

 

It's critical for us as we heavily rely on this "source exhaustion stop" 
mechanism to achieve proper stop of streaming applications from their own code, 
so it prevents us from using the last flink versions.

 

The log extract shows that the local cluster tried to shut down, but could not 
do it for no apparent reason:

 

{{[2018-11-07 11:11:13,837] INFO Source: Custom Source -> Sink: Print to Std. 
Out (1/1) (07ae66bef91de06205cf22a337ea1fe2) switched from DEPLOYING to 
RUNNING. (org.apache.flink.runtime.executiongraph.ExecutionGraph:1316)}}
{{[2018-11-07 11:11:13,840] INFO No state backend has been configured, using 
default (Memory / JobManager) MemoryStateBackend (data in heap memory / 
checkpoints to JobManager) (checkpoints: 'null', savepoints: 'null', 
asynchronous: TRUE, maxStateSize: 5242880) 
(org.apache.flink.streaming.runtime.tasks.StreamTask:230)}}
{{0}}
{{1}}
{{2}}
{{3}}
{{4}}
{{[2018-11-07 11:11:13,897] INFO Source: Custom Source -> Sink: Print to Std. 
Out (1/1) (07ae66bef91de06205cf22a337ea1fe2) switched from RUNNING to FINISHED. 
(org.apache.flink.runtime.taskmanager.Task:915)}}
{{[2018-11-07 11:11:13,897] INFO Freeing task resources for Source: Custom 
Source -> Sink: Print to Std. Out (1/1) (07ae66bef91de06205cf22a337ea1fe2). 
(org.apache.flink.runtime.taskmanager.Task:818)}}
{{[2018-11-07 11:11:13,898] INFO Ensuring all FileSystem streams are closed for 
task Source: Custom Source -> Sink: Print to Std. Out (1/1) 
(07ae66bef91de06205cf22a337ea1fe2) [FINISHED] 
(org.apache.flink.runtime.taskmanager.Task:845)}}
{{[2018-11-07 11:11:13,899] INFO Un-registering task and sending final 
execution state FINISHED to JobManager for task Source: Custom Source -> Sink: 
Print to Std. Out 07ae66bef91de06205cf22a337ea1fe2. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:1337)}}
{{[2018-11-07 11:11:13,904] INFO Source: Custom Source -> Sink: Print to Std. 
Out (1/1) (07ae66bef91de06205cf22a337ea1fe2) switched from RUNNING to FINISHED. 
(org.apache.flink.runtime.executiongraph.ExecutionGraph:1316)}}
{{[2018-11-07 11:11:13,907] INFO Job Simple Test 
(0ef8697ca98f6d2b565ed928d17c8a49) switched from state RUNNING to FINISHED. 
(org.apache.flink.runtime.executiongraph.ExecutionGraph:1356)}}
{{[2018-11-07 11:11:13,908] INFO Stopping checkpoint coordinator for job 
0ef8697ca98f6d2b565ed928d17c8a49. 
(org.apache.flink.runtime.checkpoint.CheckpointCoordinator:320)}}
{{[2018-11-07 11:11:13,908] INFO Shutting down 
(org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore:102)}}
{{[2018-11-07 11:11:23,579] INFO Shutting down Flink Mini Cluster 
(org.apache.flink.runtime.minicluster.MiniCluster:427)}}
{{[2018-11-07 11:11:23,580] INFO Shutting down rest endpoint. 
(org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint:265)}}
{{[2018-11-07 11:11:23,582] INFO Stopping TaskExecutor 
akka://flink/user/taskmanager_0. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:291)}}
{{[2018-11-07 11:11:23,583] INFO Shutting down 
TaskExecutorLocalStateStoresManager. 
(org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager:213)}}
{{[2018-11-07 11:11:23,591] INFO I/O manager removed spill file directory 
C:\Users\alinz\AppData\Local\Temp\flink-io-6a19871b-3b86-4a47-9b82-28eef7e55814 
(org.apache.flink.runtime.io.disk.iomanager.IOManager:110)}}
{{[2018-11-07 11:11:23,591] INFO Shutting down the network environment and its 
components. (org.apache.flink.runtime.io.network.NetworkEnvironment:344)}}
{{[2018-11-07 11:11:23,591] INFO Removing cache directory 
C:\Users\alinz\AppData\Local\Temp\flink-web-ui 
(org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint:733)}}
{{[2018-11-07 11:11:23,593] I

[jira] [Created] (FLINK-10829) Extend FlinkDistribution to support running jars

2018-11-08 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-10829:


 Summary: Extend FlinkDistribution to support running jars
 Key: FLINK-10829
 URL: https://issues.apache.org/jira/browse/FLINK-10829
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.8.0


To facilitate better adoption of the {{FlinkDistribution}} we should add 
support for running jars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10819) The instability problem of CI, JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure test fail.

2018-11-08 Thread sunjincheng (JIRA)
sunjincheng created FLINK-10819:
---

 Summary: The instability problem of CI, 
JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure test fail.
 Key: FLINK-10819
 URL: https://issues.apache.org/jira/browse/FLINK-10819
 Project: Flink
  Issue Type: Test
  Components: Tests
Reporter: sunjincheng
 Fix For: 1.7.1


Found the following error in the process of CI:

Results :
Tests in error: 
 JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:331 » 
IllegalArgument
Tests run: 1463, Failures: 0, Errors: 1, Skipped: 29
18:40:55.828 [INFO] 

18:40:55.829 [INFO] BUILD FAILURE
18:40:55.829 [INFO] 

18:40:55.830 [INFO] Total time: 30:19 min
18:40:55.830 [INFO] Finished at: 2018-11-07T18:40:55+00:00
18:40:56.294 [INFO] Final Memory: 92M/678M
18:40:56.294 [INFO] 

18:40:56.294 [WARNING] The requested profile "include-kinesis" could not be 
activated because it does not exist.
18:40:56.295 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (integration-tests) 
on project flink-tests_2.11: There are test failures.
18:40:56.295 [ERROR] 
18:40:56.295 [ERROR] Please refer to 
/home/travis/build/sunjincheng121/flink/flink-tests/target/surefire-reports for 
the individual test results.
18:40:56.295 [ERROR] -> [Help 1]
18:40:56.295 [ERROR] 
18:40:56.295 [ERROR] To see the full stack trace of the errors, re-run Maven 
with the -e switch.
18:40:56.295 [ERROR] Re-run Maven using the -X switch to enable full debug 
logging.
18:40:56.295 [ERROR] 
18:40:56.295 [ERROR] For more information about the errors and possible 
solutions, please read the following articles:
18:40:56.295 [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
MVN exited with EXIT CODE: 1.
Trying to KILL watchdog (11329).
./tools/travis_mvn_watchdog.sh: line 269: 11329 Terminated watchdog
PRODUCED build artifacts.

But after the rerun, the error disappeared. 

Currently,no specific reasons are found, and will continue to pay attention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


how get job id which job run slot

2018-11-08 Thread lining jing
Hi, dev. I have a question. Now rest api can get TaskManagerInfo, but can
not get  information of task which run on slot.


[jira] [Created] (FLINK-10820) Simplify the RebalancePartitioner implementation

2018-11-08 Thread zhijiang (JIRA)
zhijiang created FLINK-10820:


 Summary: Simplify the RebalancePartitioner implementation
 Key: FLINK-10820
 URL: https://issues.apache.org/jira/browse/FLINK-10820
 Project: Flink
  Issue Type: Sub-task
  Components: Network
Affects Versions: 1.8.0
Reporter: zhijiang
Assignee: zhijiang


The current {{RebalancePartitioner}} implementations seems a little hacky for 
selecting a random number as the first channel index, and the following 
selections based on this random index in round-robin fashion.

We can define a constant as the first channel index to make the implementation 
simple and readable. To do so, it will not change the rebalance semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10822) Configurable MetricQueryService interval

2018-11-08 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-10822:


 Summary: Configurable MetricQueryService interval
 Key: FLINK-10822
 URL: https://issues.apache.org/jira/browse/FLINK-10822
 Project: Flink
  Issue Type: Improvement
  Components: Metrics
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.8.0


The {{MetricQueryService}} is used for transmitting metrics from TaskManagers 
to the JobManager, in order to expose them via REST and by extension the WebUI.

By default the JM will poll metrics at most every 10 seconds. This has an 
adverse effect on the duration of our end-to-end tests, which for example query 
metrics via the REST API to determine whether the cluster has started. If 
during the first poll no TM is available it will take another 10 second for 
updated information to be available.

By making this interval configurable we could this reduce the test duration. 
Additionally this could serve as a switch to disable the {{MetricQueryService}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10828) Enforce that all TypeSerializers are tested through SerializerTestBase

2018-11-08 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-10828:
--

 Summary: Enforce that all TypeSerializers are tested through 
SerializerTestBase
 Key: FLINK-10828
 URL: https://issues.apache.org/jira/browse/FLINK-10828
 Project: Flink
  Issue Type: Test
  Components: Tests
Reporter: Stefan Richter


As pointed out in FLINK-10827, type serializers are a common source of bugs and 
we should try to enforce that every type serializer (that is not exclusive to 
tests) is tested at least through a test that extends the 
{{SerializerTestBase}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10826) Heavy deployment end-to-end test produces no output on Travis

2018-11-08 Thread Timo Walther (JIRA)
Timo Walther created FLINK-10826:


 Summary: Heavy deployment end-to-end test produces no output on 
Travis
 Key: FLINK-10826
 URL: https://issues.apache.org/jira/browse/FLINK-10826
 Project: Flink
  Issue Type: Bug
  Components: E2E Tests
Reporter: Timo Walther


The Heavy deployment end-to-end test produces no output on Travis such that it 
is killed after 10 minutes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10818) RestartStrategies.fixedDelayRestart Occur NoResourceAvailableException: Not enough free slots available to run the job.

2018-11-08 Thread ambition (JIRA)
ambition created FLINK-10818:


 Summary: RestartStrategies.fixedDelayRestart Occur  
NoResourceAvailableException: Not enough free slots available to run the job.
 Key: FLINK-10818
 URL: https://issues.apache.org/jira/browse/FLINK-10818
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 1.6.2
 Environment: JDK 1.8

Flink 1.6.0 

Hadoop 2.7.3
Reporter: ambition


 Our Online Flink on Yarn environment operation  job,code set restart tactic 
like 
{code:java}
exeEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(5,1000l));
{code}
But job running some days, Occur Exception is :
{code:java}
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not 
enough free slots available to run the job. You can decrease the operator 
parallelism or increase the number of slots per TaskManager in the 
configuration. Task to schedule: < Attempt #5 (Source: KafkaJsonTableSource -> 
Map -> where: (AND(OR(=(app_key, _UTF-16LE'C4FAF9CE1569F541'), =(app_key, 
_UTF-16LE'F5C7F68C7117630B'), =(app_key, _UTF-16LE'57C6FF4B5A064D29')), 
OR(=(LOWER(TRIM(FLAG(BOTH), _UTF-16LE' ', os_type)), _UTF-16LE'ios'), 
=(LOWER(TRIM(FLAG(BOTH), _UTF-16LE' ', os_type)), _UTF-16LE'android')), IS NOT 
NULL(server_id))), select: (MT_Date_Format_Mode(receive_time, 
_UTF-16LE'MMddHHmm', 10) AS date_p, LOWER(TRIM(FLAG(BOTH), _UTF-16LE' ', 
os_type)) AS os_type, MT_Date_Format_Mode(receive_time, _UTF-16LE'HHmm', 10) AS 
date_mm, server_id) (1/6)) @ (unassigned) - [SCHEDULED] > with groupID < 
cbc357ccb763df2852fee8c4fc7d55f2 > in sharing group < 
690dbad267a8ff37c8cb5e9dbedd0a6d >. Resources available to scheduler: Number of 
instances=6, total number of slots=6, available slots=0
   at 
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:281)
   at 
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.allocateSlot(Scheduler.java:155)
   at 
org.apache.flink.runtime.executiongraph.Execution.lambda$allocateAndAssignSlotForExecution$2(Execution.java:491)
   at 
org.apache.flink.runtime.executiongraph.Execution$$Lambda$44/1664178385.apply(Unknown
 Source)
   at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
   at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2116)
   at 
org.apache.flink.runtime.executiongraph.Execution.allocateAndAssignSlotForExecution(Execution.java:489)
   at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.allocateResourcesForAll(ExecutionJobVertex.java:521)
   at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleEager(ExecutionGraph.java:945)
   at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:875)
   at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1262)
   at 
org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
   at 
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
{code}
 

this Exception happened when the job started. issue links to 

https://issues.apache.org/jira/browse/FLINK-4486

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10823) Missing scala suffixes

2018-11-08 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-10823:


 Summary: Missing scala suffixes
 Key: FLINK-10823
 URL: https://issues.apache.org/jira/browse/FLINK-10823
 Project: Flink
  Issue Type: Bug
  Components: Batch Connectors and Input/Output Formats, Metrics
Affects Versions: 1.7.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.7.0


The jdbc connector and jmx/prometheus reporter have provided dependencies to 
scala-infected modules (streaming-java/runtime) and thus also require a scala 
sufix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10830) Consider making processing time provider pluggable

2018-11-08 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-10830:
---

 Summary: Consider making processing time provider pluggable
 Key: FLINK-10830
 URL: https://issues.apache.org/jira/browse/FLINK-10830
 Project: Flink
  Issue Type: Improvement
Reporter: Andrey Zagrebin


At the moment, the processing time is basically implemented in a fixed way as 

System.currentTimeMillis() and not configurable by users.

If this implementation does not fit application business logic for some reason 
there is no way for users to change it.

Examples:
 * The timestamp provided by currentTimeMillis is not guaranteed to be 
monotonically increasing. It can jump back for a while because of possible 
periodic synchronisation of local clock with other more accurate system. It can 
be a problem for application business logic if we say that the general notion 
of time is that it always increases.
 * Hard to implement end-to-end tests because synchronisation between time in 
test and in Flink is out of control.

We can make it configurable and let users optionally set their own factory to 
create processing time provider. All features which depend on querying current 
processing time can use this implementation. The default one can still stay 
System.currentTimeMillis().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10831) Consider making processing time monotonically increasing by default

2018-11-08 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-10831:
---

 Summary: Consider making processing time monotonically increasing 
by default
 Key: FLINK-10831
 URL: https://issues.apache.org/jira/browse/FLINK-10831
 Project: Flink
  Issue Type: Improvement
Reporter: Andrey Zagrebin


At the moment, the processing time is basically implemented in a fixed way as 

System.currentTimeMillis() and not configurable by users.

The timestamp provided this way is not guaranteed to be monotonically 
increasing. It can jump back for a while because of possible periodic 
synchronisation of local clock with other more accurate system. It can be a 
problem for application business logic if we say that the general notion of 
time is that it always increases.

We can change SystemProcessingTimeService to emit only timestamp which is not 
less than the latest emitted one, at least for current JVM process.

This change in behaviour can be also configurable if somebody e.g. relies on 
rather accurate time.
Other option is that if user needs monotonic processing time then custom 
processing time service should be provided as suggested in FLINK-10830.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-08 Thread Piotr Nowojski
Hi,

Maybe we should split this topic (and the design doc) into couple of smaller 
ones, hopefully independent. The questions that you have asked Fabian have for 
example very little to do with reading metadata from Hive Meta Store?

Piotrek 

> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
> 
> Hi Xuefu and all,
> 
> Thanks for sharing this design document!
> I'm very much in favor of restructuring / reworking the catalog handling in
> Flink SQL as outlined in the document.
> Most changes described in the design document seem to be rather general and
> not specifically related to the Hive integration.
> 
> IMO, there are some aspects, especially those at the boundary of Hive and
> Flink, that need a bit more discussion. For example
> 
> * What does it take to make Flink schema compatible with Hive schema?
> * How will Flink tables (descriptors) be stored in HMS?
> * How do both Hive catalogs differ? Could they be integrated into to a
> single one? When to use which one?
> * What meta information is provided by HMS? What of this can be leveraged
> by Flink?
> 
> Thank you,
> Fabian
> 
> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
> 
>> After taking a look at how other discussion threads work, I think it's
>> actually fine just keep our discussion here. It's up to you, Xuefu.
>> 
>> The google doc LGTM. I left some minor comments.
>> 
>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>> 
>>> Hi all,
>>> 
>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>> ... Hive integration design ..." on only dev mailing list for community
>>> devs to review. The current thread sends to both dev and user list.
>>> 
>>> This email thread is more like validating the general idea and direction
>>> with the community, and it's been pretty long and crowded so far. Since
>>> everyone is pro for the idea, we can move forward with another thread to
>>> discuss and finalize the design.
>>> 
>>> Thanks,
>>> Bowen
>>> 
>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu 
>>> wrote:
>>> 
 Hi Shuiyi,
 
 Good idea. Actually the PDF was converted from a google doc. Here is its
 link:
 
 https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
 Once we reach an agreement, I can convert it to a FLIP.
 
 Thanks,
 Xuefu
 
 
 
 --
 Sender:Shuyi Chen 
 Sent at:2018 Nov 1 (Thu) 02:47
 Recipient:Xuefu 
 Cc:vino yang ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 
 Hi Xuefu,
 
 Thanks a lot for driving this big effort. I would suggest convert your
 proposal and design doc into a google doc, and share it on the dev mailing
 list for the community to review and comment with title like "[DISCUSS] ...
 Hive integration design ..." . Once approved,  we can document it as a FLIP
 (Flink Improvement Proposals), and use JIRAs to track the implementations.
 What do you think?
 
 Shuyi
 
 On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu 
 wrote:
 Hi all,
 
 I have also shared a design doc on Hive metastore integration that is
 attached here and also to FLINK-10556[1]. Please kindly review and share
 your feedback.
 
 
 Thanks,
 Xuefu
 
 [1] https://issues.apache.org/jira/browse/FLINK-10556
 --
 Sender:Xuefu 
 Sent at:2018 Oct 25 (Thu) 01:08
 Recipient:Xuefu ; Shuyi Chen <
 suez1...@gmail.com>
 Cc:yanghua1127 ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 
 Hi all,
 
 To wrap up the discussion, I have attached a PDF describing the
 proposal, which is also attached to FLINK-10556 [1]. Please feel free to
 watch that JIRA to track the progress.
 
 Please also let me know if you have additional comments or questions.
 
 Thanks,
 Xuefu
 
 [1] https://issues.apache.org/jira/browse/FLINK-10556
 
 
 --
 Sender:Xuefu 
 Sent at:2018 Oct 16 (Tue) 03:40
 Recipient:Shuyi Chen 
 Cc:yanghua1127 ; Fabian Hueske ;
 dev ; user 
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 
 Hi Shuyi,
 
 Thank you for your input. Yes, I agreed with a phased approach and like
 to move forward fast. :) We did some work internally on DDL utilizing babel
 parser in Calcite. While babel makes Calcite's grammar extensible, at
 first impression it still seems too cumbersome for a project when too
 much extensions are made. It's even challenging to find where the extension
 is needed! It would be certa

Re: [DISCUSS] Table API Enhancement Outline

2018-11-08 Thread jincheng sun
Hi all,

We are discussing very detailed content about this proposal. We are trying
to design the API in many aspects (functionality, compatibility, ease of
use, etc.). I think this is a very good process. Only such a detailed
discussion, In order to develop PR more clearly and smoothly in the later
stage. I am very grateful to @Fabian and  @Xiaowei for sharing a lot of
good ideas.
About the definition of method signatures I want to share my points here
which I am discussing with fabian in google doc (not yet completed), as
follows:

Assume we have a table:
val tab = util.addTable[(Long, String)]("MyTable", 'long, 'string,
'proctime.proctime)

Approach 1:
case1: Map follows Source Table
val result =
  tab.map(udf('string)).as('proctime, 'col1, 'col2)// proctime implied in
the output
  .window(Tumble over 5.millis on 'proctime as 'w)

case2: FatAgg follows Window (Fabian mentioned above)
val result =
tab.window(Tumble ... as 'w)
   .groupBy('w, 'k1, 'k2) // 'w should be a group key.
   .flatAgg(tabAgg('a)).as('k1, 'k2, 'w, 'col1, 'col2)
   .select('k1, 'col1, 'w.rowtime as 'rtime)

Approach 2: Similar to Fabian‘s approach, which the result schema would be
clearly defined, but add a built-in append UDF. That make
map/flatmap/agg/flatAgg interface only accept one Expression.
val result =
tab.map(append(udf('string), 'long, 'proctime)) as ('col1, 'col2,
'long, 'proctime)
 .window(Tumble over 5.millis on 'proctime as 'w)

Note: Append is a special UDF for built-in that can pass through any
column.

So, May be we can defined the as  table.map(Expression)  first, If
necessary, we can extend to table.map(Expression*)  in the future ?  Of
course, I also hope that we can do more perfection in this proposal through
discussion.

Thanks,
Jincheng





Xiaowei Jiang  于2018年11月7日周三 下午11:45写道:

> Hi Fabian,
>
> I think that the key question you raised is if we allow extra parameters in
> the methods map/flatMap/agg/flatAgg. I can see why allowing that may appear
> more convenient in some cases. However, it might also cause some confusions
> if we do that. For example, do we allow multiple UDFs in these expressions?
> If we do, the semantics may be weird to define, e.g. what does
> table.groupBy('k).flatAgg(TableAggA('a), TableAggB('b)) mean? Even though
> not allowing it may appear less powerful, but it can make things more
> intuitive too. In the case of agg/flatAgg, we can define the keys to be
> implied in the result table and appears at the beginning. You can use a
> select method if you want to modify this behavior. I think that eventually
> we will have some API which allows other expressions as additional
> parameters, but I think it's better to do that after we introduce the
> concept of nested tables. A lot of things we suggested here can be
> considered as special cases of that. But things are much simpler if we
> leave that to later.
>
> Regards,
> Xiaowei
>
> On Wed, Nov 7, 2018 at 5:18 PM Fabian Hueske  wrote:
>
> > Hi,
> >
> > * Re emit:
> > I think we should start with a well understood semantics of full
> > replacement. This is how the other agg functions work.
> > As was said before, there are open questions regarding an append mode
> > (checkpointing, whether supporting retractions or not and if yes how to
> > declare them, ...).
> > Since this seems to be an optimization, I'd postpone it.
> >
> > * Re grouping keys:
> > I don't think we should automatically add them because the result schema
> > would not be intuitive.
> > Would they be added at the beginning of the tuple or at the end? What
> > metadata fields of windows would be added? In which order would they be
> > added?
> >
> > However, we could support syntax like this:
> > val t: Table = ???
> > t
> >   .window(Tumble ... as 'w)
> >   .groupBy('a, 'b)
> >   .flatAgg('b, 'a, myAgg(row('*)), 'w.end as 'wend, 'w.rowtime as 'rtime)
> >
> > The result schema would be clearly defined as [b, a, f1, f2, ..., fn,
> wend,
> > rtime]. (f1, f2, ...fn) are the result attributes of the UDF.
> >
> > * Re Multi-staged evaluation:
> > I think this should be an optimization that can be applied if the UDF
> > implements the merge() method.
> >
> > Best, Fabian
> >
> > Am Mi., 7. Nov. 2018 um 08:01 Uhr schrieb Shaoxuan Wang <
> > wshaox...@gmail.com
> > >:
> >
> > > Hi xiaowei,
> > >
> > > Yes, I agree with you that the semantics of TableAggregateFunction emit
> > is
> > > much more complex than AggregateFunction. The fundamental difference is
> > > that TableAggregateFunction emits a "table" while AggregateFunction
> > outputs
> > > (a column of) a "row". In the case of AggregateFunction it only has one
> > > mode which is “replacing” (complete update). But for
> > > TableAggregateFunction, it could be incremental (only emit the new
> > updated
> > > results) update or complete update (always emit the entire table when
> > > “emit" is triggered).  From the performance perspective, we might want
> to
> > > use incremental update. But we nee

[jira] [Created] (FLINK-10827) Add test for duplicate() to SerializerTestBase

2018-11-08 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-10827:
--

 Summary: Add test for duplicate() to SerializerTestBase
 Key: FLINK-10827
 URL: https://issues.apache.org/jira/browse/FLINK-10827
 Project: Flink
  Issue Type: Test
  Components: Tests
Reporter: Stefan Richter
Assignee: Stefan Richter
 Fix For: 1.7.0


In the past, we had many bugs from type serializers that have not properly 
implemented the {{duplicate()}} method in a proper way. A very common error is 
to forget about creating a deep copy of some fields that can lead to 
concurrency problems in the backend.

We should add a test case for that tests duplicated serializer from different 
threads to expose concurrency problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] FLIP-27: Refactor Source Interface

2018-11-08 Thread Becket Qin
Hi Piotrek,

> But I don’t see a reason why we should expose both blocking `take()` and
non-blocking `poll()` methods to the Flink engine. Someone (Flink engine or
connector) would have to do the same busy
> looping anyway and I think it would be better to have a simpler connector
API (that would solve our problems) and force connectors to comply one way
or another.

If we let the block happen inside the connector, the blocking does not have
to be a busy loop. For example, to do the block waiting efficiently, the
connector can use java NIO selector().select which relies on OS syscall
like epoll[1] instead of busy looping. But if Flink engine blocks outside
the connector, it pretty much has to do the busy loop. So if there is only
one API to get the element, a blocking getNextElement() makes more sense.
In any case, we should avoid ambiguity. It has to be crystal clear about
whether a method is expected to be blocking or non-blocking. Otherwise it
would be very difficult for Flink engine to do the right thing with the
connectors. At the first glance at getCurrent(), the expected behavior is
not quite clear.

That said, I do agree that functionality wise, poll() and take() kind of
overlap. But they are actually not quite different from
isBlocked()/getNextElement(). Compared with isBlocked(), the only
difference is that poll() also returns the next record if it is available.
But I agree that the isBlocked() + getNextElement() is more flexible as
users can just check the record availability, but not fetch the next
element.

> In case of thread-less readers with only non-blocking `queue.poll()` (is
that really a thing? I can not think about a real implementation that
enforces such constraints)
Right, it is pretty much a syntax sugar to allow user combine the
check-and-take into one method. It could be achieved with isBlocked() +
getNextElement().

[1] http://man7.org/linux/man-pages/man7/epoll.7.html

Thanks,

Jiangjie (Becket) Qin

On Wed, Nov 7, 2018 at 11:58 PM Piotr Nowojski 
wrote:

> Hi Becket,
>
> With my proposal, both of your examples would have to be solved by the
> connector and solution to both problems would be the same:
>
> Pretend that connector is never blocked (`isBlocked() { return
> NOT_BLOCKED; }`) and implement `getNextElement()` in blocking fashion (or
> semi blocking with return of control from time to time to allow for
> checkpointing, network flushing and other resource management things to
> happen in the same main thread). In other words, exactly how you would
> implement `take()` method or how the same source connector would be
> implemented NOW with current source interface. The difference with current
> interface would be only that main loop would be outside of the connector,
> and instead of periodically releasing checkpointing lock, periodically
> `return null;` or `return Optional.empty();` from `getNextElement()`.
>
> In case of thread-less readers with only non-blocking `queue.poll()` (is
> that really a thing? I can not think about a real implementation that
> enforces such constraints), we could provide a wrapper that hides the busy
> looping. The same applies how to solve forever blocking readers - we could
> provider another wrapper running the connector in separate thread.
>
> But I don’t see a reason why we should expose both blocking `take()` and
> non-blocking `poll()` methods to the Flink engine. Someone (Flink engine or
> connector) would have to do the same busy looping anyway and I think it
> would be better to have a simpler connector API (that would solve our
> problems) and force connectors to comply one way or another.
>
> Piotrek
>
> > On 7 Nov 2018, at 10:55, Becket Qin  wrote:
> >
> > Hi Piotr,
> >
> > I might have misunderstood you proposal. But let me try to explain my
> > concern. I am thinking about the following case:
> > 1. a reader has the following two interfaces,
> >boolean isBlocked()
> >T getNextElement()
> > 2. the implementation of getNextElement() is non-blocking.
> > 3. The reader is thread-less, i.e. it does not have any internal thread.
> > For example, it might just delegate the getNextElement() to a
> queue.poll(),
> > and isBlocked() is just queue.isEmpty().
> >
> > How can Flink efficiently implement a blocking reading behavior with this
> > reader? Either a tight loop or a backoff interval is needed. Neither of
> > them is ideal.
> >
> > Now let's say in the reader mentioned above implements a blocking
> > getNextElement() method. Because there is no internal thread in the
> reader,
> > after isBlocked() returns false. Flink will still have to loop on
> > isBlocked() to check whether the next record is available. If the next
> > record reaches after 10 min, it is a tight loop for 10 min. You have
> > probably noticed that in this case, even isBlocked() returns a future,
> that
> > future() will not be completed if Flink does not call some method from
> the
> > reader, because the reader has no internal thread to complete that fu

[jira] [Created] (FLINK-10821) Resuming Externalized Checkpoint E2E test does not Resume from Externalized Checkpoint

2018-11-08 Thread Gary Yao (JIRA)
Gary Yao created FLINK-10821:


 Summary: Resuming Externalized Checkpoint E2E test does not Resume 
from Externalized Checkpoint
 Key: FLINK-10821
 URL: https://issues.apache.org/jira/browse/FLINK-10821
 Project: Flink
  Issue Type: Bug
  Components: E2E Tests
Affects Versions: 1.7.0
Reporter: Gary Yao
 Fix For: 1.7.0


Path to externalized checkpoint is not passed as the {{-s}} argument:
https://github.com/apache/flink/blob/483507a65c7547347eaafb21a24967c470f94ed6/flink-end-to-end-tests/test-scripts/test_resume_externalized_checkpoints.sh#L128

That is, the test currently restarts the job without checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Kinesis consumer e2e test

2018-11-08 Thread Stefan Richter
Hi,

I was also just planning to work on it before Stephan contacted Thomas to ask 
about this test.

Thomas, you are right about the structure, the test should also go into the 
`run-nightly-tests.sh`. What I was planning to do is a simple job that consists 
of a Kinesis consumer, a mapper that fails once after n records, and a kinesis 
producer. I was hoping that creation, filling, and validation of the Kinesis 
topics can be done with the Java API, not by invoking commands in a bash 
script. In general I would try to minimise the amount of scripting and do as 
much in Java as possible. It would also be nice if the test was generalised, 
e.g. that abstract Producer/Consumer are created from a Supplier and also the 
validation is done over some abstraction that lets us iterate over the produced 
output. Ideally, this would be a test that we can reuse for all 
Consumer/Producer cases and we could also port the tests for Kafka to that. 
What do you think?

Best,
Stefan

> On 8. Nov 2018, at 07:22, Tzu-Li (Gordon) Tai  wrote:
> 
> Hi Thomas,
> 
> I think Stefan Richter is also working on the Kinesis end-to-end test, and
> seems to be planning to implement it against a real Kinesis service instead
> of Kinesalite.
> Perhaps efforts should be synced here.
> 
> Cheers,
> Gordon
> 
> 
> On Thu, Nov 8, 2018 at 1:38 PM Thomas Weise  wrote:
> 
>> Hi,
>> 
>> I'm planning to add an end-to-end test for the Kinesis consumer. We have
>> done something similar at Lyft, using Kinesalite, which can be run as
>> Docker container.
>> 
>> I see that some tests already make use of Docker, so we can assume it to be
>> present in the target environment(s)?
>> 
>> I also found the following ticket:
>> https://issues.apache.org/jira/browse/FLINK-9007
>> 
>> It suggest to also cover the producer, which may be a good way to create
>> the input data as well. The stream itself can be created with the Kinesis
>> Java SDK.
>> 
>> Following the existing layout, there would be a new module
>> flink-end-to-end-tests/flink-kinesis-test
>> 
>> Are there any suggestions or comments regarding this?
>> 
>> Thanks,
>> Thomas
>> 



Re: [VOTE] Release 1.7.0, release candidate #1

2018-11-08 Thread jincheng sun
Hi Till,

Today when I do the CI before merge code, I find a instability test case:
JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure test
fail.
I'm not sure if this is issue should blocking the release 1.7, but I think
it's best to find the cause and fix it before releasing 1.7.
The corresponding JIRA issue is:
https://issues.apache.org/jira/browse/FLINK-10819

Thanks,
Jincheng

Till Rohrmann  于2018年11月7日周三 下午10:13写道:

> I hereby cancel the release vote because of the Scala suffix problems. I
> will create the next RC in the next days. Until then, please continue
> testing with the current release candidate.
>
> Cheers,
> Till
>
> On Wed, Nov 7, 2018 at 2:39 PM Till Rohrmann  wrote:
>
> > Thanks for spotting and addressing the Scala problem Chesnay. The
> > corresponding JIRA issue is
> > https://issues.apache.org/jira/browse/FLINK-10811.
> >
> > Cheers,
> > Till
> >
> > On Wed, Nov 7, 2018 at 12:36 PM Chesnay Schepler 
> > wrote:
> >
> >> This isn't quite correct (as test-scoped dependencies are not
> >> transitive, but all compile dependencies still are, even for the
> >> test-jar).
> >>
> >> But effectively this means we don't need additional rules for test-jars
> >> as compile dependencies already have to be taken care of separately from
> >> tests anyway.
> >>
> >> I'll open JIRA for the hcatalog issue and scan through the remaining
> >> modules for other violations.
> >>
> >> On 07.11.2018 11:46, Aljoscha Krettek wrote:
> >> > I looked into this issue and my conclusion was that test-jars don't
> >> pull in transitive dependencies when you depend on them. I verified this
> >> with an example maven project where I also verified that a test-jar
> built
> >> with Scala 2.12 works on a project that uses Scala 2.11.
> >> >
> >> > On the hcatalog connector: This is unfortunate and we should add the
> >> Scala suffix here. It's unfortunate since flink-hcatalog and
> >> flink-hadoop-compatibility wouldn't have to have a Scala suffix, they
> don't
> >> depend on any other suffixed dependencies, they only reason is that they
> >> themselves contain Scala code. This could have been avoided by putting
> the
> >> Scala code in a separate module.
> >> >
> >> > Aljoscha
> >> >
> >> >> On 7. Nov 2018, at 10:55, Chesnay Schepler 
> wrote:
> >> >>
> >> >> What was the conclusion in regards to modules requiring a
> scala-suffix
> >> if their test-jar depends on scala-infected modules? (Which basically
> >> affects all modules)
> >> >>
> >> >> Beyond that, the hcatalog connector has a dependency on
> >> flink-hadoop-compatibility_2.12, and should thus also have a scala
> suffix.
> >> There are probably other instances as well.
> >> >>
> >> >> On 05.11.2018 22:26, Till Rohrmann wrote:
> >> >>> Hi everyone,
> >> >>> Please review and vote on the release candidate #1 for the version
> >> 1.7.0,
> >> >>> as follows:
> >> >>> [ ] +1, Approve the release
> >> >>> [ ] -1, Do not approve the release (please provide specific
> comments)
> >> >>>
> >> >>>
> >> >>> The complete staging area is available for your review, which
> >> includes:
> >> >>> * JIRA release notes [1],
> >> >>> * the official Apache source release and binary convenience releases
> >> to be
> >> >>> deployed to dist.apache.org [2], which are signed with the key with
> >> >>> fingerprint 1F302569A96CFFD5 [3],
> >> >>> * all artifacts to be deployed to the Maven Central Repository [4],
> >> >>> * source code tag "release-1.7.0-rc1" [5],
> >> >>>
> >> >>> Please use this document for coordinating testing efforts: [6]
> >> >>>
> >> >>> The vote will be open for at least 72 hours. It is adopted by
> majority
> >> >>> approval, with at least 3 PMC affirmative votes.
> >> >>>
> >> >>> Thanks,
> >> >>> Till
> >> >>>
> >> >>> [1]
> >> >>>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12343585
> >> >>> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.7.0/
> >> >>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >>> [4]
> >> https://repository.apache.org/content/repositories/orgapacheflink-1191
> >> >>> [5] https://github.com/apache/flink/tree/release-1.7.0-rc1
> >> >>> [6]
> >> >>>
> >>
> https://docs.google.com/document/d/12JY_Xyy6umGR1vvrBFbqtDvf6ZdAYSAiljchrnsMUZs/edit?usp=sharing
> >> >>>
> >> >>> Pro-tip: you can create a settings.xml file with these contents:
> >> >>>
> >> >>> 
> >> >>> 
> >> >>>flink-1.7.0
> >> >>> 
> >> >>> 
> >> >>>
> >> >>>  flink-1.7.0
> >> >>>  
> >> >>>
> >> >>>  flink-1.7.0
> >> >>>  
> >> >>>
> >> >>>
> >> https://repository.apache.org/content/repositories/orgapacheflink-1191/
> >> >>>  
> >> >>>
> >> >>>
> >> >>>  archetype
> >> >>>  
> >> >>>
> >> >>>
> >> https://repository.apache.org/content/repositories/orgapacheflink-1191/
> >> >>>  
> >> >>>
> >> >>>  
> >> >>>
> >> >>> 
> >> >>> 
> >> >>>
> >> >>> And reference that in you maven commands via -

[jira] [Created] (FLINK-10824) Compile Flink unit test failed with 1.6.0 branch

2018-11-08 Thread Hongtao Zhang (JIRA)
Hongtao Zhang created FLINK-10824:
-

 Summary: Compile Flink unit test failed with 1.6.0 branch
 Key: FLINK-10824
 URL: https://issues.apache.org/jira/browse/FLINK-10824
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.6.2
 Environment: IDE: Intellij

JDK Version:  "1.8.0_111"

scala Version: 2.11
Reporter: Hongtao Zhang


reproduce step:
 # checkout flink 1.6.0 branch
 # compile the test file PageRankITCase.java under flink-tests module and 
org.apache.flink.test.example.java package
 # the compile message report that some packages not exists
 # all the unit test file that reference to the flink-examples package will be 
compile failed

 

Information:java: Errors occurred while compiling module 'flink-tests_2.11'
Information:javac 1.8.0_111 was used to compile java sources
Information:2018/11/8 下午6:10 - Compilation completed with 17 errors and 0 
warnings in 3 s 311 ms
/Users/hongtaozhang/workspace/flink/flink-tests/src/test/java/org/apache/flink/test/example/java/PageRankITCase.java
 Error:Error:line (22)java: 程序包org.apache.flink.examples.java.graph不存在
 Error:Error:line (23)java: 程序包org.apache.flink.test.testdata不存在
 Error:Error:line (24)java: 程序包org.apache.flink.test.util不存在
 Error:Error:line (25)java: 程序包org.apache.flink.util不存在
 Error:Error:line (42)java: 找不到符号
 符号: 类 MultipleProgramsTestBase
 Error:Error:line (44)java: 找不到符号
 符号: 类 TestExecutionMode
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (63)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (63)java: 找不到符号
 符号: 变量 FileUtils
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (66)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (66)java: 找不到符号
 符号: 变量 FileUtils
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (74)java: 找不到符号
 符号: 方法 
compareKeyValuePairsWithDelta(java.lang.String,java.lang.String,java.lang.String,double)
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (83)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (79)java: 找不到符号
 符号: 变量 PageRank
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (85)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (94)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (90)java: 找不到符号
 符号: 变量 PageRank
 位置: 类 org.apache.flink.test.example.java.PageRankITCase
 Error:Error:line (96)java: 找不到符号
 符号: 变量 PageRankData
 位置: 类 org.apache.flink.test.example.java.PageRankITCase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10825) ConnectedComponents test instable on Travis

2018-11-08 Thread Timo Walther (JIRA)
Timo Walther created FLINK-10825:


 Summary: ConnectedComponents test instable on Travis
 Key: FLINK-10825
 URL: https://issues.apache.org/jira/browse/FLINK-10825
 Project: Flink
  Issue Type: Bug
  Components: E2E Tests
Reporter: Timo Walther


The "ConnectedComponents iterations with high parallelism end-to-end test" 
succeeds on Travis but the log contains with the following exception:

{code}
2018-11-08 10:15:13,698 ERROR 
org.apache.flink.runtime.taskexecutor.rpc.RpcResultPartitionConsumableNotifier  
- Could not schedule or update consumers at the JobManager.
org.apache.flink.runtime.executiongraph.ExecutionGraphException: Cannot find 
execution for execution Id 5b02c2f51e51f68b66bfab07afc1bf17.
at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleOrUpdateConsumers(ExecutionGraph.java:1635)
at 
org.apache.flink.runtime.jobmaster.JobMaster.scheduleOrUpdateConsumers(JobMaster.java:637)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor.aroundReceive(Actor.scala:502)
at akka.actor.Actor.aroundReceive$(Actor.scala:500)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS]Rethink the rescale operation, can we do it async

2018-11-08 Thread Stefan Richter
Hi,

in general, I think it is clearly a good idea to make rescaling as cheap and as 
dynamic as possible and this is a goal that is surely on the long term roadmap 
of Flink.
However, from a technical point of view I think that things are not as simple 
if you go into details, and details is what the proposal lacks so far. For 
example, right now it is not yet possible to even modify the shape of an 
execution dynamically while the job is running (changing parallelism without 
restart) and the scheduling is not really aware of the position of keyed state 
partitions. Also the state repartitioning itself has some tricky details, like 
after repartitioning state in a consistent way the job is still making 
progress, so how does the state of new operators catch up with those changes, 
and all of that in a consistent way that does not violate exactly once. We have 
a bunch of ideas how to tackle those problems in different stages towards a 
goal that might be similar to what you describe. For example, an intermediate 
step could be that you still need to briefly stop and restart the job, but we 
leverage local recovery to speed up the redeployment and each operator is 
scheduled to an instance that is preloaded with the repartitioned state to 
continue, to minimise downtime. I think we would also solve it in a general way 
that does not have limitations like being only able to rescale up and down by a 
factor of 2. So you can expect to see many steps towards this in the future, 
but I doubt that there is a quick fix by “just make it async”.

Best,
Stefan

> On 8. Nov 2018, at 03:13, shimin yang  wrote:
> 
> Currently, the rescale operation is to stop the whole job and restart it
> with different parrellism. But the rescale operation cost a lot and took
> lots of time to recover if the state size is quite big.
> 
> And a long-time rescale might cause other problems like latency increase
> and back pressure. For some circumstances like a streaming computing cloud
> service, users may be very sensitive to latency and resource usage. So it
> would be better to make the rescale a cheaper operation.
> 
> I wonder if we could make it an async operation just like checkpoint. But
> how to deal with the keyed state would be a pain in the ass. Currently I
> just want to make some assumption to make things simpler. The asnyc rescale
> operation can only double the parrellism or make it half.
> 
> In the scale up circumstance, we can copy the state to the newly created
> worker and change the partitioner of the upstream. The best timing might be
> get notified of checkpoint completed. But we still need to change the
> partitioner of upstream. So the upstream should buffer the result or block
> the computation util the state copy finished. Then make the partitioner to
> send differnt elements with the same key to the same downstream operator.
> 
> In the scale down scenario, we can merge the keyed state of two operators
> and also change the partitioner of upstream.



[jira] [Created] (FLINK-10833) FlinkKafkaProducerITCase.testFlinkKafkaProducerFailTransactionCoordinatorBeforeNotify failed on Travis

2018-11-08 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-10833:
---

 Summary: 
FlinkKafkaProducerITCase.testFlinkKafkaProducerFailTransactionCoordinatorBeforeNotify
 failed on Travis
 Key: FLINK-10833
 URL: https://issues.apache.org/jira/browse/FLINK-10833
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: 1.7.0
Reporter: Andrey Zagrebin
 Fix For: 1.7.0


FlinkKafkaProducerITCase.testFlinkKafkaProducerFailTransactionCoordinatorBeforeNotify
 failed on Travis

https://travis-ci.org/apache/flink/jobs/452290475

https://api.travis-ci.org/v3/job/452290475/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: StreamingFileSink Bug? Committing results on stream close

2018-11-08 Thread Aljoscha Krettek
Hi Addison,

unfortunately, there is a long-standing problem that user functions cannot 
differentiate between successful and erroneous shutdown [1]. I had this high on 
my private list of things that I finally want to see fixed in Flink 1.8. And 
your message further confirms this.

Best,
Aljoscha

[1] https://issues.apache.org/jira/browse/FLINK-2646

> On 8. Nov 2018, at 13:39, Till Rohrmann  wrote:
> 
> Hi Addison,
> 
> thanks for reporting this issue. I've pulled in Kostas who worked on the
> StreamingFileSink and knows the current behaviour as well as its
> limitations best.
> 
> Cheers,
> Till
> 
> On Wed, Nov 7, 2018 at 11:49 PM Addison Higham  wrote:
> 
>> Hi all,
>> 
>> Just run into a bit of a problem and I am not sure what the behavior should
>> be and if this should be considered a bug? Or if there is some other way
>> this should be handled?
>> 
>> I have a streaming job with a stream that eventually closes, this job sinks
>> to a StreamingFileSink.
>> The problem I am experiencing is that any data written to the sink between
>> the last checkpoint and the close of the stream is list.
>> 
>> This happens (AFAICT) because the StreamingFileSink relies on checkpoints
>> to commit files and closing the stream currently does not try and commit
>> anything.
>> 
>> It seems like just making close call
>> `buckets.commitUpToCheckpoint(Long.MAX_VALUE)` would work pretty well
>> assuming it is a an actual stream close, but could be problematic in the
>> events of a savepoint/cancel and resuming later (it may only mean some
>> files would be prematurely committed). Ideally, we would be able to
>> differentiate between the two different types of close (an actual stream
>> finishing vs a cancel), but at the moment that doesn't seem supported.
>> 
>> If this considered a bug, please let me know and I will file a Jira, if
>> not, what is the "correct" way to handle getting all the data out with any
>> sinks that rely on a checkpoint to commit data?
>> 
>> Thanks
>> 



Re: Kinesis consumer e2e test

2018-11-08 Thread Till Rohrmann
Hi Thomas,

the community is really interested in adding an end-to-end test for the
Kinesis connector (producer as well as consumer). Thus, it would be really
helpful if you could contribute your work you've already done.

Using Kinesalite sounds good to me and you're right and that we assume that
Docker is available in our testing environment. The testing job would go
into a separate module as you've suggested
(flink-end-to-end-tests/flink-kinesis-test) and the entrypoint to the test
would go into flink-end-to-end-tests/test-scripts/ plus
flink-end-to-end-tests/run-nightly-tests.sh.

I think Stefan stopped his work on the end-to-end test but he had some
ideas about reusing testing infrastructure for the Kafka and Kinesis tests
(e.g. having a test base for similar connectors). This is something we can
also address after the release if it would entail too much work.

Cheers,
Till

On Thu, Nov 8, 2018 at 7:22 AM Tzu-Li (Gordon) Tai 
wrote:

> Hi Thomas,
>
> I think Stefan Richter is also working on the Kinesis end-to-end test, and
> seems to be planning to implement it against a real Kinesis service instead
> of Kinesalite.
> Perhaps efforts should be synced here.
>
> Cheers,
> Gordon
>
>
> On Thu, Nov 8, 2018 at 1:38 PM Thomas Weise  wrote:
>
> > Hi,
> >
> > I'm planning to add an end-to-end test for the Kinesis consumer. We have
> > done something similar at Lyft, using Kinesalite, which can be run as
> > Docker container.
> >
> > I see that some tests already make use of Docker, so we can assume it to
> be
> > present in the target environment(s)?
> >
> > I also found the following ticket:
> > https://issues.apache.org/jira/browse/FLINK-9007
> >
> > It suggest to also cover the producer, which may be a good way to create
> > the input data as well. The stream itself can be created with the Kinesis
> > Java SDK.
> >
> > Following the existing layout, there would be a new module
> > flink-end-to-end-tests/flink-kinesis-test
> >
> > Are there any suggestions or comments regarding this?
> >
> > Thanks,
> > Thomas
> >
>


Re: StreamingFileSink Bug? Committing results on stream close

2018-11-08 Thread Till Rohrmann
Hi Addison,

thanks for reporting this issue. I've pulled in Kostas who worked on the
StreamingFileSink and knows the current behaviour as well as its
limitations best.

Cheers,
Till

On Wed, Nov 7, 2018 at 11:49 PM Addison Higham  wrote:

> Hi all,
>
> Just run into a bit of a problem and I am not sure what the behavior should
> be and if this should be considered a bug? Or if there is some other way
> this should be handled?
>
> I have a streaming job with a stream that eventually closes, this job sinks
> to a StreamingFileSink.
> The problem I am experiencing is that any data written to the sink between
> the last checkpoint and the close of the stream is list.
>
> This happens (AFAICT) because the StreamingFileSink relies on checkpoints
> to commit files and closing the stream currently does not try and commit
> anything.
>
> It seems like just making close call
> `buckets.commitUpToCheckpoint(Long.MAX_VALUE)` would work pretty well
> assuming it is a an actual stream close, but could be problematic in the
> events of a savepoint/cancel and resuming later (it may only mean some
> files would be prematurely committed). Ideally, we would be able to
> differentiate between the two different types of close (an actual stream
> finishing vs a cancel), but at the moment that doesn't seem supported.
>
> If this considered a bug, please let me know and I will file a Jira, if
> not, what is the "correct" way to handle getting all the data out with any
> sinks that rely on a checkpoint to commit data?
>
> Thanks
>