Re: Consuming data from dynamoDB streams to flink

2018-06-27 Thread Tzu-Li (Gordon) Tai
Hi!

I think it would be definitely nice to have this feature.

No actual previous work has been made on this issue, but AFAIK, we should
be able to build this on top of the FlinkKinesisConsumer.
Whether this should live within the Kinesis connector module or an
independent module of its own is still TBD.
If you want, I would be happy to look at any concrete design proposals you
have for this before you start the actual development efforts.

Cheers,
Gordon

On Thu, Jun 28, 2018 at 2:12 AM Ying Xu  wrote:

> Thanks Fabian for the suggestion.
>
> *Ying Xu*
> Software Engineer
> 510.368.1252 <+15103681252>
> [image: Lyft] 
>
> On Wed, Jun 27, 2018 at 2:01 AM, Fabian Hueske  wrote:
>
> > Hi Ying,
> >
> > I'm not aware of any effort for this issue.
> > You could check with the assigned contributor in Jira if there is some
> > previous work.
> >
> > Best, Fabian
> >
> > 2018-06-26 9:46 GMT+02:00 Ying Xu :
> >
> > > Hello Flink dev:
> > >
> > > We have a number of use cases which involves pulling data from DynamoDB
> > > streams into Flink.
> > >
> > > Given that this issue is tracked by Flink-4582
> > > . we would like to
> > check
> > > if any prior work has been completed by the community.   We are also
> very
> > > interested in contributing to this effort.  Currently, we have a
> > high-level
> > > proposal which is based on extending the existing FlinkKinesisConsumer
> > and
> > > making it work with DynamoDB streams (via integrating with the
> > > AmazonDynamoDBStreams API).
> > >
> > > Any suggestion is welcome. Thank you very much.
> > >
> > >
> > > -
> > > Ying
> > >
> >
>


Re: FLINK-6222

2018-06-27 Thread zhangminglei
Hi, Craig

The patch you attached there seems do not follow flink community specification. 
Could you please link to a Github pull request in there ?

Cheers
Minglei

> 在 2018年6月28日,上午3:56,Foster, Craig  写道:
> 
> Pinging. Is it possible for someone to take a look at this or is this message 
> going into a black hole?
> 
> Thanks.
> 
> From: "Foster, Craig" 
> Date: Thursday, June 14, 2018 at 11:22 AM
> To: "dev@flink.apache.org" 
> Subject: FLINK-6222
> 
> Greetings:
> I’m wondering how to get someone to review my patch? The JIRA itself was 
> created over a year ago with no response…so wanted to ping here.
> 
> 
> https://issues.apache.org/jira/browse/FLINK-6222
> 
> 
> Thanks!
> Craig
> 
> 
> 
> 




[jira] [Created] (FLINK-9685) Flink should support hostname-substitution for security.kerberos.login.principal

2018-06-27 Thread Ethan Li (JIRA)
Ethan Li created FLINK-9685:
---

 Summary: Flink should support hostname-substitution for 
security.kerberos.login.principal
 Key: FLINK-9685
 URL: https://issues.apache.org/jira/browse/FLINK-9685
 Project: Flink
  Issue Type: Improvement
Reporter: Ethan Li


[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/SecurityConfiguration.java#L83]

 

We can have something like this
{code:java}
String rawPrincipal = 
flinkConf.getString(SecurityOptions.KERBEROS_LOGIN_PRINCIPAL);
if (rawPrincipal != null) {
   try {
  rawPrincipal = rawPrincipal.replace("HOSTNAME", 
InetAddress.getLocalHost().getCanonicalHostName());
   } catch (UnknownHostException e) {
  LOG.error("Failed to replace HOSTNAME with localhost because {}", e);
   }
}
this.principal = rawPrincipal;
{code}

So it will be easier to deploy flink to cluster. Instead of setting different 
principal on every node, we can have the same principal 
headless_user/HOSTNAME@DOMAIN .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9684) HistoryServerArchiveFetcher not working properly with secure hdfs cluster

2018-06-27 Thread Ethan Li (JIRA)
Ethan Li created FLINK-9684:
---

 Summary: HistoryServerArchiveFetcher not working properly with 
secure hdfs cluster
 Key: FLINK-9684
 URL: https://issues.apache.org/jira/browse/FLINK-9684
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.4.2
Reporter: Ethan Li


With my current setup, jobmanager and taskmanager are able to talk to hdfs 
cluster (with kerberos setup). However, running history server gets:

 

 
{code:java}
2018-06-27 19:03:32,080 WARN org.apache.hadoop.ipc.Client - Exception 
encountered while connecting to the server : 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name
2018-06-27 19:03:32,085 ERROR 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher - 
Failed to access job archive location for path 
hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive.
java.io.IOException: Failed on local exception: java.io.IOException: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: 
"openstorm10blue-n2.blue.ygrid.yahoo.com/10.215.79.35"; destination host is: 
"openqe11blue-n2.blue.ygri
d.yahoo.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:515)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1726)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:650)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.listStatus(HadoopFileSystem.java:146)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to 
specify server's Kerberos principal name
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 28 more
{code}
 

 

Changed LOG Level to DEBUG and seeing 

 
{code:java}
2018-06-27 19:03:30,931 INFO 
org.apache.flink.runtime.webmonitor.history.HistoryServer - Enabling SSL for 
the history server.
2018-06-27 19:03:30,931 DEBUG org.apache.flink.runtime.net.SSLUtils - Creating 
server SSL context from configuration
2018-06-27 19:03:31,091 DEBUG org.apache.flink.core.fs.FileSystem - Loading 
extension file systems via services
2018-06-27 19:03:31,094 DEBUG 

Re: FLINK-6222

2018-06-27 Thread Foster, Craig
Pinging. Is it possible for someone to take a look at this or is this message 
going into a black hole?

Thanks.

From: "Foster, Craig" 
Date: Thursday, June 14, 2018 at 11:22 AM
To: "dev@flink.apache.org" 
Subject: FLINK-6222

Greetings:
I’m wondering how to get someone to review my patch? The JIRA itself was 
created over a year ago with no response…so wanted to ping here.


https://issues.apache.org/jira/browse/FLINK-6222


Thanks!
Craig






[jira] [Created] (FLINK-9683) inconsistent behaviors when setting historyserver.archive.fs.dir

2018-06-27 Thread Ethan Li (JIRA)
Ethan Li created FLINK-9683:
---

 Summary: inconsistent behaviors when setting 
historyserver.archive.fs.dir
 Key: FLINK-9683
 URL: https://issues.apache.org/jira/browse/FLINK-9683
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.4.2
Reporter: Ethan Li


I am using release-1.4.2,

 

With fs.default-scheme and  fs.hdfs.hadoopconf set correctly, 

when setting 
{code:java}
historyserver.archive.fs.dir: /tmp/flink/cluster-name/jmarchive
{code}
I am seeing
{code:java}
2018-06-27 18:51:12,692 WARN 
org.apache.flink.runtime.webmonitor.history.HistoryServer - Failed to create 
Path or FileSystem for directory '/tmp/flink/cluster-name/jmarchive'. Directory 
will not be monitored.
java.lang.IllegalArgumentException: The scheme (hdfs://, file://, etc) is null. 
Please specify the file system scheme explicitly in the URI.
at 
org.apache.flink.runtime.webmonitor.WebMonitorUtils.validateAndNormalizeUri(WebMonitorUtils.java:300)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer.(HistoryServer.java:168)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer.(HistoryServer.java:132)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:113)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:110)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer.main(HistoryServer.java:110)
{code}
 

And then if I set 
{code:java}
historyserver.archive.fs.dir: hdfs:///tmp/flink/cluster-name/jmarchive{code}
I am seeing:

 
{code:java}
java.io.IOException: The given file system URI 
(hdfs:///tmp/flink/cluster-name/jmarchive) did not describe the authority (like 
for example HDFS NameNode address/port or S3 host). The attempt to use a 
configured default authority failed: Hadoop configuration for default file 
system ('fs.default.name' or 'fs.defaultFS') contains no valid authority 
component (like hdfs namenode, S3 host, etc)
at 
org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:149)
at 
org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:401)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:320)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer.(HistoryServer.java:169)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServer.(HistoryServer.java:132)
{code}
The only way it works is to provide the full path of hdfs like:
{code:java}
#historyserver.archive.fs.dir: 
hdfs:///tmp/flink/cluster-name/jmarchive
{code}
 

Above situations are because there are two parts of code treating "scheme" 
differently. 

https://github.com/apache/flink/blob/release-1.4.2/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/WebMonitorUtils.java#L299-L302

https://github.com/apache/flink/blob/release-1.4.2/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L335-L338

I believe the first case should be supported if users have set 
fs.default-scheme 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Flink-9407] Question about proposed ORC Sink !

2018-06-27 Thread sagar loke
Thanks @zhangminglei and @Fabian for confirming.

Even I looked at the ORC parsing code and it seems that using  type
is mandatory for now.

Thanks,
Sagar

On Wed, Jun 27, 2018 at 12:59 AM, Fabian Hueske  wrote:

> Hi Sagar,
>
> That's more a question for the ORC community, but AFAIK, the top-level
> type is always a struct because it needs to wrap the fields, e.g.,
> struct(name:string, age:int)
>
> Best, Fabian
>
> 2018-06-26 22:38 GMT+02:00 sagar loke :
>
>> @zhangminglei,
>>
>> Question about the schema for ORC format:
>>
>> 1. Does it always need to be of complex type "" ?
>>
>> 2. Or can it be created with individual data types directly ?
>> eg. "name:string, age:int" ?
>>
>>
>> Thanks,
>> Sagar
>>
>> On Fri, Jun 22, 2018 at 11:56 PM, zhangminglei <18717838...@163.com>
>> wrote:
>>
>>> Yes, it should be exit. Thanks to Ted Yu. Very exactly!
>>>
>>> Cheers
>>> Zhangminglei
>>>
>>> 在 2018年6月23日,下午12:40,Ted Yu  写道:
>>>
>>> For #1, the word exist should be exit, right ?
>>> Thanks
>>>
>>>  Original message 
>>> From: zhangminglei <18717838...@163.com>
>>> Date: 6/23/18 10:12 AM (GMT+08:00)
>>> To: sagar loke 
>>> Cc: dev , user 
>>> Subject: Re: [Flink-9407] Question about proposed ORC Sink !
>>>
>>> Hi, Sagar.
>>>
>>> 1. It solves the issue partially meaning files which have finished
>>> checkpointing don't show .pending status but the files which were in
>>> progress
>>> when the program exists are still in .pending state.
>>>
>>>
>>> Ans:
>>>
>>> Yea, Make the program exists and in that time if a checkpoint does not
>>> finished will lead the status keeps in .pending state then. Under the
>>> normal circumstances, the programs that running in the production env will
>>> never be stoped or existed if everything is fine.
>>>
>>> 2. Ideally, writer should work with default settings correct ? Meaning
>>> we don't have to explicitly set these parameters to make it work.
>>> Is this assumption correct ?
>>>
>>>
>>> Ans:
>>>
>>> Yes. Writer should work with default settings correct.
>>> Yes. We do not have to explicitly set these parameters to make it work.
>>> Yes. Assumption correct indeed.
>>>
>>> However, you know, flink is a real time streaming framework, so under
>>> normal circumstances,you don't really go to use the default settings when
>>> it comes to a specific business. Especially together work with *offline
>>> end*(Like hadoop mapreduce). In this case, you need to tell the offline
>>> end when time a bucket is close and when time the data for the specify
>>> bucket is ready. So, you can take a look on https://issues.apache.org/j
>>> ira/browse/FLINK-9609.
>>>
>>> Cheers
>>> Zhangminglei
>>>
>>>
>>> 在 2018年6月23日,上午8:23,sagar loke  写道:
>>>
>>> Hi Zhangminglei,
>>>
>>> Thanks for the reply.
>>>
>>> 1. It solves the issue partially meaning files which have finished
>>> checkpointing don't show .pending status but the files which were in
>>> progress
>>> when the program exists are still in .pending state.
>>>
>>> 2. Ideally, writer should work with default settings correct ? Meaning
>>> we don't have to explicitly set these parameters to make it work.
>>> Is this assumption correct ?
>>>
>>> Thanks,
>>> Sagar
>>>
>>> On Fri, Jun 22, 2018 at 3:25 AM, zhangminglei <18717838...@163.com>
>>> wrote:
>>>
 Hi, Sagar. Please use the below code and you will find the part files
 status from _part-0-107.in-progress   to _part-0-107.pending and
 finally to part-0-107. [For example], you need to run the program for a
 while. However, we need set some parameters, like the following. Moreover,
 *enableCheckpointing* IS also needed. I know why you always see the
 *.pending* file since the below parameters default value is 60 seconds
 even though you set the enableCheckpoint. So, that is why you can not see
 the finished file status until 60 seconds passed.

 Attached is the ending on my end, and you will see what you want!

 Please let me know if you still have the problem.

 Cheers
 Zhangminglei

 setInactiveBucketCheckInterval(2000)
 .setInactiveBucketThreshold(2000);


 public class TestOrc {
public static void main(String[] args) throws Exception {
   StreamExecutionEnvironment env = 
 StreamExecutionEnvironment.getExecutionEnvironment();
   env.setParallelism(1);
   env.enableCheckpointing(1000);
   env.setStateBackend(new MemoryStateBackend());

   String orcSchemaString = 
 "struct";
   String path = "hdfs://10.199.196.0:9000/data/hive/man";

   BucketingSink bucketingSink = new BucketingSink<>(path);

   bucketingSink
  .setWriter(new OrcFileWriter<>(orcSchemaString))
  .setInactiveBucketCheckInterval(2000)
  .setInactiveBucketThreshold(2000);

   DataStream dataStream = env.addSource(new ManGenerator());

   

Re: Consuming data from dynamoDB streams to flink

2018-06-27 Thread Ying Xu
Thanks Fabian for the suggestion.

*Ying Xu*
Software Engineer
510.368.1252 <+15103681252>
[image: Lyft] 

On Wed, Jun 27, 2018 at 2:01 AM, Fabian Hueske  wrote:

> Hi Ying,
>
> I'm not aware of any effort for this issue.
> You could check with the assigned contributor in Jira if there is some
> previous work.
>
> Best, Fabian
>
> 2018-06-26 9:46 GMT+02:00 Ying Xu :
>
> > Hello Flink dev:
> >
> > We have a number of use cases which involves pulling data from DynamoDB
> > streams into Flink.
> >
> > Given that this issue is tracked by Flink-4582
> > . we would like to
> check
> > if any prior work has been completed by the community.   We are also very
> > interested in contributing to this effort.  Currently, we have a
> high-level
> > proposal which is based on extending the existing FlinkKinesisConsumer
> and
> > making it work with DynamoDB streams (via integrating with the
> > AmazonDynamoDBStreams API).
> >
> > Any suggestion is welcome. Thank you very much.
> >
> >
> > -
> > Ying
> >
>


[jira] [Created] (FLINK-9682) Add setDescription to execution environment and display it in the UI

2018-06-27 Thread Elias Levy (JIRA)
Elias Levy created FLINK-9682:
-

 Summary: Add setDescription to execution environment and display 
it in the UI
 Key: FLINK-9682
 URL: https://issues.apache.org/jira/browse/FLINK-9682
 Project: Flink
  Issue Type: Improvement
  Components: DataStream API, Webfrontend
Affects Versions: 1.5.0
Reporter: Elias Levy


Currently you can provide a job name to {{execute}} in the execution 
environment.  In an environment where many version of a job may be executing, 
such as a development or test environment, identifying which running job is of 
a specific version via the UI can be difficult unless the version is embedded 
into the job name given the {{execute}}.  But the job name is uses for other 
purposes, such as for namespacing metrics.  Thus, it is not ideal to modify the 
job name, as that could require modifying metric dashboards and monitors each 
time versions change.

I propose a new method be added to the execution environment, 
{{setDescription}}, that would allow a user to pass in an arbitrary description 
that would be displayed in the dashboard, allowing users to distinguish jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-06-27 Thread Hequn Cheng (JIRA)
Hequn Cheng created FLINK-9681:
--

 Summary: Make sure minRetentionTime not equal to maxRetentionTime
 Key: FLINK-9681
 URL: https://issues.apache.org/jira/browse/FLINK-9681
 Project: Flink
  Issue Type: Improvement
Reporter: Hequn Cheng
Assignee: Hequn Cheng


Currently, for a group by, if minRetentionTime equals to maxRetentionTime, the 
group by operator will register a timer for each record causing performance 
problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9680) Reduce heartbeat timeout for E2E tests

2018-06-27 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-9680:
---

 Summary: Reduce heartbeat timeout for E2E tests
 Key: FLINK-9680
 URL: https://issues.apache.org/jira/browse/FLINK-9680
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Affects Versions: 1.5.0, 1.6.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.6.0, 1.5.1


Several end-to-end tests shoot down job- and taskmanagers and wait for them to 
come back up before continuing the testing process.

{{heartbeat.timeout}} controls how long a container has to be unreachable to be 
considered lost. The default for this option is 50 seconds, causing significant 
idle times during the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9679) Implement AvroSerializationSchema

2018-06-27 Thread Yazdan Shirvany (JIRA)
Yazdan Shirvany created FLINK-9679:
--

 Summary: Implement AvroSerializationSchema
 Key: FLINK-9679
 URL: https://issues.apache.org/jira/browse/FLINK-9679
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Yazdan Shirvany
Assignee: Yazdan Shirvany


Implement AvroSerializationSchema using Confluent Schema Registry



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9678) Remove hard-coded sleeps in HA E2E test

2018-06-27 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-9678:
---

 Summary: Remove hard-coded sleeps in HA E2E test
 Key: FLINK-9678
 URL: https://issues.apache.org/jira/browse/FLINK-9678
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination, Tests
Affects Versions: 1.5.0, 1.6.0
Reporter: Chesnay Schepler


{{test_ha.sh}} uses 2 hard-coded sleeps.
{code:java}
# let the job run for a while to take some checkpoints
sleep 20

for (( c=0; c<${JM_KILLS}; c++ )); do
# kill the JM and wait for watchdog to
# create a new one which will take over
kill_jm
sleep 60
done{code}
These sleeps are always troublesome as they either make the test brittle by 
being to small, or causing the test to idle when they are to large.

The first sleep should be replaced with {{wait_num_checkpoints.}}

I'm not entirely sure about the semantics of the second sleep, but I guess 
we're waiting for the new JM to continue the job execution. In this case I 
suggest to instead query the job status via REST and wait until the job is 
running.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9677) RestClient fails for large uploads

2018-06-27 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-9677:
---

 Summary: RestClient fails for large uploads
 Key: FLINK-9677
 URL: https://issues.apache.org/jira/browse/FLINK-9677
 Project: Flink
  Issue Type: Bug
  Components: REST
Affects Versions: 1.6.0, 1.5.1
Reporter: Chesnay Schepler
 Fix For: 1.6.0, 1.5.1


Uploading a large file via the {{RestClient}} can lead to an exception on the 
server:
{code:java}
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostRequestDecoder$ErrorDataDecoderException:
 java.io.IOException: Out of size: 67115495 > 67108864
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.loadDataMultipart(HttpPostMultipartRequestDecoder.java:1377)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.getFileUpload(HttpPostMultipartRequestDecoder.java:907)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.decodeMultipart(HttpPostMultipartRequestDecoder.java:551)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.parseBodyMultipart(HttpPostMultipartRequestDecoder.java:442)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.parseBody(HttpPostMultipartRequestDecoder.java:411)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.offer(HttpPostMultipartRequestDecoder.java:336)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostMultipartRequestDecoder.offer(HttpPostMultipartRequestDecoder.java:53)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.http.multipart.HttpPostRequestDecoder.offer(HttpPostRequestDecoder.java:227)
at 
org.apache.flink.runtime.rest.FileUploadHandler.channelRead0(FileUploadHandler.java:112)
at 
org.apache.flink.runtime.rest.FileUploadHandler.channelRead0(FileUploadHandler.java:66)
at 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at 

Re: Streaming

2018-06-27 Thread zhangminglei
Forward shiming mail to Aitozi.

Aitozi

We are using hyperloglog to count daily uv, but it only provided an approximate 
value. I also tried the count distinct in flink table without window, but need 
to set the retention time.

However, the time resolution of this operator is 1 millisecond, so it ends up 
with too many timers in the java heap which might leads to OOM.

Cheers
Shimin


> 在 2018年6月27日,下午5:34,zhangminglei <18717838...@163.com> 写道:
> 
> Aitozi
> 
> From my side, I do not think distinct is very easy to deal with. Even though 
> together work with kafka support exactly-once.
> 
> For uv, we can use a bloomfilter to filter pv for geting uv in the end. 
> 
> Window is usually used in an aggregate operation, so I think all should be 
> realized by windows.
> 
> I am not familiar with this fields, so I still want to know what others 
> response this question.
> 
> Cheers
> Minglei
> 
> 
> 
>> 在 2018年6月27日,下午5:12,aitozi  写道:
>> 
>> Hi, community
>> 
>> I am using flink to deal with some situation.
>> 
>> 1. "distinct count" to calculate the uv/pv.
>> 2.  calculate the topN of the past 1 hour or 1 day time.
>> 
>> Are these all realized by window? Or is there a best practice on doing this?
>> 
>> 3. And when deal with the distinct, if there is no need to do the keyBy
>> previous, how does the window deal with this.
>> 
>> Thanks 
>> Aitozi.
>> 
>> 
>> 
>> --
>> Sent from: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
> 




Re: Streaming

2018-06-27 Thread zhangminglei
To aitozi. 

Cheers
Minglei

> 在 2018年6月27日,下午5:46,shimin yang  写道:
> 
> Aitozi
> 
> We are using hyperloglog to count daily uv, but it only provided an 
> approximate value. I also tried the count distinct in flink table without 
> window, but need to set the retention time.
> 
> However, the time resolution of this operator is 1 millisecond, so it ends up 
> with too many timers in the java heap which might leads to OOM.
> 
> Cheers
> Shimin
> 
> 2018-06-27 17:34 GMT+08:00 zhangminglei <18717838...@163.com 
> >:
> Aitozi
> 
> From my side, I do not think distinct is very easy to deal with. Even though 
> together work with kafka support exactly-once.
> 
> For uv, we can use a bloomfilter to filter pv for geting uv in the end. 
> 
> Window is usually used in an aggregate operation, so I think all should be 
> realized by windows.
> 
> I am not familiar with this fields, so I still want to know what others 
> response this question.
> 
> Cheers
> Minglei
> 
> 
> 
> > 在 2018年6月27日,下午5:12,aitozi  > > 写道:
> > 
> > Hi, community
> > 
> > I am using flink to deal with some situation.
> > 
> > 1. "distinct count" to calculate the uv/pv.
> > 2.  calculate the topN of the past 1 hour or 1 day time.
> > 
> > Are these all realized by window? Or is there a best practice on doing this?
> > 
> > 3. And when deal with the distinct, if there is no need to do the keyBy
> > previous, how does the window deal with this.
> > 
> > Thanks 
> > Aitozi.
> > 
> > 
> > 
> > --
> > Sent from: 
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
> > 
> 
> 
> 



[jira] [Created] (FLINK-9676) Deadlock during canceling task and recycling exclusive buffer

2018-06-27 Thread zhijiang (JIRA)
zhijiang created FLINK-9676:
---

 Summary: Deadlock during canceling task and recycling exclusive 
buffer
 Key: FLINK-9676
 URL: https://issues.apache.org/jira/browse/FLINK-9676
 Project: Flink
  Issue Type: Bug
  Components: Network
Affects Versions: 1.5.0
Reporter: zhijiang
 Fix For: 1.5.1


It may cause deadlock between task canceler thread and task thread.

The detail is as follows:

Task canceler thread -> {{IC1#releaseAllResources}} -> recycle floating buffers 
-> {color:#FF}lock{color} ({{LocalBufferPool#availableMemorySegments) -> 
}}{{IC2#notifyBufferAvailable}}{{ -> {color:#FF}try to 
lock{color:#33}({color}{color}}}{{IC2#bufferQueue)}}

{{Task thread -> IC2#recycle -> {color:#d04437}lock{color}(IC2#bufferQueue) -> 
}}{{bufferQueue#addExclusiveBuffer}} -> {{floatingBuffer#recycleBuffer}} -> 
{color:#d04437}try to lock{color}(LocalBufferPool#availableMemorySegments)

One solution is that {{listener#notifyBufferAvailable}} can be called outside 
the {{synchronized(availableMemorySegments) in 
}}{{LocalBufferPool#recycle.}}{{}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9675) Avoid FileInputStream/FileOutputStream

2018-06-27 Thread Ted Yu (JIRA)
Ted Yu created FLINK-9675:
-

 Summary: Avoid FileInputStream/FileOutputStream
 Key: FLINK-9675
 URL: https://issues.apache.org/jira/browse/FLINK-9675
 Project: Flink
  Issue Type: Improvement
Reporter: Ted Yu


They rely on finalizers (before Java 11), which create unnecessary GC load. The 
alternatives, Files.newInputStream, are as easy to use and don't have this 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-06-27 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-9674:
---

 Summary: Remove 65s sleep in QueryableState E2E test
 Key: FLINK-9674
 URL: https://issues.apache.org/jira/browse/FLINK-9674
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State, Tests
Affects Versions: 1.5.0, 1.6.0
Reporter: Chesnay Schepler


The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
loss to be noticed, starts a new tm and waits for the job to continue.
{code}
kill_random_taskmanager

[...]

sleep 65 # this is a little longer than the heartbeat timeout so that the TM is 
gone

start_and_wait_for_tm
{code}

Instead of waiting for a fixed amount of time that is tied to some config value 
we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9673) Improve State efficiency of bounded OVER window operators

2018-06-27 Thread Fabian Hueske (JIRA)
Fabian Hueske created FLINK-9673:


 Summary: Improve State efficiency of bounded OVER window operators
 Key: FLINK-9673
 URL: https://issues.apache.org/jira/browse/FLINK-9673
 Project: Flink
  Issue Type: Improvement
  Components: Table API  SQL
Reporter: Fabian Hueske


Currently, the implementations of bounded OVER window aggregations store the 
complete input for the bound interval. For example for the query:

{code}
SELECT user_id, count(action) OVER (PARTITION BY user_id ORDER BY rowtime RANGE 
INTERVAL '14' DAY PRECEDING) action_count, rowtime
FROM 
SELECT rowtime, user_id, action, val1, val2, val3, val4 FROM user
{code}

The whole records with schema {{(rowtime, user_id, action, val1, val2, val3, 
val4)}} are stored for 14 days in order to retract them after 14 days from the 
accumulators.

However, it would be sufficient to only store those fields that are required 
for the aggregtions, i.e., {{action}} in the example above. All other fields 
could be set to {{null}} and hence significantly reduce the amount of data that 
needs to be stored in state.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Consuming data from dynamoDB streams to flink

2018-06-27 Thread Fabian Hueske
Hi Ying,

I'm not aware of any effort for this issue.
You could check with the assigned contributor in Jira if there is some
previous work.

Best, Fabian

2018-06-26 9:46 GMT+02:00 Ying Xu :

> Hello Flink dev:
>
> We have a number of use cases which involves pulling data from DynamoDB
> streams into Flink.
>
> Given that this issue is tracked by Flink-4582
> . we would like to check
> if any prior work has been completed by the community.   We are also very
> interested in contributing to this effort.  Currently, we have a high-level
> proposal which is based on extending the existing FlinkKinesisConsumer and
> making it work with DynamoDB streams (via integrating with the
> AmazonDynamoDBStreams API).
>
> Any suggestion is welcome. Thank you very much.
>
>
> -
> Ying
>


[jira] [Created] (FLINK-9672) Fail fatally if we cannot submit job on added JobGraph signal

2018-06-27 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-9672:


 Summary: Fail fatally if we cannot submit job on added JobGraph 
signal
 Key: FLINK-9672
 URL: https://issues.apache.org/jira/browse/FLINK-9672
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.5.0, 1.6.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.6.0, 1.5.1


The {{SubmittedJobGraphStore}} signals when new {{JobGraphs}} are added. If 
this happens, then the leader should recover this job and submit it. If the 
recovery/submission should fail for some reason, then we should fail fatally to 
restart the process which will then try to recover the jobs again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9671) Add configuration to enable task manager isolation.

2018-06-27 Thread Renjie Liu (JIRA)
Renjie Liu created FLINK-9671:
-

 Summary: Add configuration to enable task manager isolation.
 Key: FLINK-9671
 URL: https://issues.apache.org/jira/browse/FLINK-9671
 Project: Flink
  Issue Type: New Feature
  Components: Distributed Coordination, Scheduler
Affects Versions: 1.5.0
Reporter: Renjie Liu
Assignee: Renjie Liu
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9670) Introduce slot manager factory

2018-06-27 Thread Renjie Liu (JIRA)
Renjie Liu created FLINK-9670:
-

 Summary: Introduce slot manager factory
 Key: FLINK-9670
 URL: https://issues.apache.org/jira/browse/FLINK-9670
 Project: Flink
  Issue Type: New Feature
  Components: Distributed Coordination, Scheduler
Affects Versions: 1.5.0
Reporter: Renjie Liu
Assignee: Renjie Liu
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9669) Introduce task manager assignment store

2018-06-27 Thread Renjie Liu (JIRA)
Renjie Liu created FLINK-9669:
-

 Summary: Introduce task manager assignment store
 Key: FLINK-9669
 URL: https://issues.apache.org/jira/browse/FLINK-9669
 Project: Flink
  Issue Type: New Feature
  Components: Distributed Coordination, Scheduler
Affects Versions: 1.5.0
Reporter: Renjie Liu
Assignee: Renjie Liu
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9668) In some case Trigger.onProcessingTime don't exectue

2018-06-27 Thread qi quan (JIRA)
qi quan created FLINK-9668:
--

 Summary: In some case Trigger.onProcessingTime don't exectue
 Key: FLINK-9668
 URL: https://issues.apache.org/jira/browse/FLINK-9668
 Project: Flink
  Issue Type: Bug
Reporter: qi quan


For example, I would like to achieve a statistical window of one day, and I 
want to output the result of the indicator every 1 minute.
So I implemented my Trigger like this.
onElement: check if valuestate has stored the nextfiretime, register the 
nextfiretime,
onProcessingTime: Registers the nextfiretime (time+1min),update valuestate, 
return FIRE_AND_PURGE.
(The amount of data in one day is too large. I don't want to store such a large 
window state.)
{code:java}
public class PayAmountTrigger extends Trigger, 
TimeWindow> {
private static final Logger LOGGER = 
LoggerFactory.getLogger(PayAmountTrigger.class);
private static final Long PERIOD = 1000L * 5;
ValueStateDescriptor stateDesc = new 
ValueStateDescriptor("fire-time", LongSerializer.INSTANCE);

@Override
public TriggerResult onElement(Tuple2 tuple2, long l, 
TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
ValueState firstTimeState = 
triggerContext.getPartitionedState(stateDesc);
long time = triggerContext.getCurrentProcessingTime();
if (firstTimeState.value() == null) {
long start = time - (time % PERIOD);
long nextFireTimestamp = start + PERIOD;
triggerContext.registerProcessingTimeTimer(nextFireTimestamp);
firstTimeState.update(nextFireTimestamp);
return TriggerResult.CONTINUE;
}
return TriggerResult.CONTINUE;
}


@Override
public TriggerResult onProcessingTime(long l, TimeWindow timeWindow, 
TriggerContext triggerContext) throws Exception {
ValueState state = triggerContext.getPartitionedState(stateDesc);
if (state.value().equals(l)) {
state.clear();
state.update(l + PERIOD);
triggerContext.registerProcessingTimeTimer(l + PERIOD);
return TriggerResult.FIRE_AND_PURGE;
}
return TriggerResult.CONTINUE;
}

@Override
public TriggerResult onEventTime(long l, TimeWindow timeWindow, 
TriggerContext triggerContext) throws Exception {
return TriggerResult.CONTINUE;
}

@Override
public void clear(TimeWindow timeWindow, TriggerContext triggerContext) 
throws Exception {
System.out.println("PayAmountTrigger_clear");
ValueState firstTimeState = 
triggerContext.getPartitionedState(stateDesc);
long timestamp = firstTimeState.value();
triggerContext.deleteProcessingTimeTimer(timestamp);
firstTimeState.clear();
}
}{code}
Then I found out that if there is no data in this minute, onProcessingTime will 
not be executed and you will miss the trigger time forever.
Then I dig through the code and find in the WindowOperator.onProcessingTime
{code:java}
ACC contents = null;
if (windowState != null) {
   contents = windowState.get();
}

if (contents != null) {
   TriggerResult triggerResult = 
triggerContext.onProcessingTime(timer.getTimestamp());
   if (triggerResult.isFire()) {
  emitWindowContents(triggerContext.window, contents);
   }
   if (triggerResult.isPurge()) {
  windowState.clear();
   }
}{code}
This means that if no data comes up for this minute,And I also purge the window 
data, triggerContext.onProcessingTime will never be executed.I think this is a 
bug in flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Flink-9407] Question about proposed ORC Sink !

2018-06-27 Thread Fabian Hueske
Hi Sagar,

That's more a question for the ORC community, but AFAIK, the top-level type
is always a struct because it needs to wrap the fields, e.g.,
struct(name:string, age:int)

Best, Fabian

2018-06-26 22:38 GMT+02:00 sagar loke :

> @zhangminglei,
>
> Question about the schema for ORC format:
>
> 1. Does it always need to be of complex type "" ?
>
> 2. Or can it be created with individual data types directly ?
> eg. "name:string, age:int" ?
>
>
> Thanks,
> Sagar
>
> On Fri, Jun 22, 2018 at 11:56 PM, zhangminglei <18717838...@163.com>
> wrote:
>
>> Yes, it should be exit. Thanks to Ted Yu. Very exactly!
>>
>> Cheers
>> Zhangminglei
>>
>> 在 2018年6月23日,下午12:40,Ted Yu  写道:
>>
>> For #1, the word exist should be exit, right ?
>> Thanks
>>
>>  Original message 
>> From: zhangminglei <18717838...@163.com>
>> Date: 6/23/18 10:12 AM (GMT+08:00)
>> To: sagar loke 
>> Cc: dev , user 
>> Subject: Re: [Flink-9407] Question about proposed ORC Sink !
>>
>> Hi, Sagar.
>>
>> 1. It solves the issue partially meaning files which have finished
>> checkpointing don't show .pending status but the files which were in
>> progress
>> when the program exists are still in .pending state.
>>
>>
>> Ans:
>>
>> Yea, Make the program exists and in that time if a checkpoint does not
>> finished will lead the status keeps in .pending state then. Under the
>> normal circumstances, the programs that running in the production env will
>> never be stoped or existed if everything is fine.
>>
>> 2. Ideally, writer should work with default settings correct ? Meaning we
>> don't have to explicitly set these parameters to make it work.
>> Is this assumption correct ?
>>
>>
>> Ans:
>>
>> Yes. Writer should work with default settings correct.
>> Yes. We do not have to explicitly set these parameters to make it work.
>> Yes. Assumption correct indeed.
>>
>> However, you know, flink is a real time streaming framework, so under
>> normal circumstances,you don't really go to use the default settings when
>> it comes to a specific business. Especially together work with *offline
>> end*(Like hadoop mapreduce). In this case, you need to tell the offline
>> end when time a bucket is close and when time the data for the specify
>> bucket is ready. So, you can take a look on https://issues.apache.org/j
>> ira/browse/FLINK-9609.
>>
>> Cheers
>> Zhangminglei
>>
>>
>> 在 2018年6月23日,上午8:23,sagar loke  写道:
>>
>> Hi Zhangminglei,
>>
>> Thanks for the reply.
>>
>> 1. It solves the issue partially meaning files which have finished
>> checkpointing don't show .pending status but the files which were in
>> progress
>> when the program exists are still in .pending state.
>>
>> 2. Ideally, writer should work with default settings correct ? Meaning we
>> don't have to explicitly set these parameters to make it work.
>> Is this assumption correct ?
>>
>> Thanks,
>> Sagar
>>
>> On Fri, Jun 22, 2018 at 3:25 AM, zhangminglei <18717838...@163.com>
>> wrote:
>>
>>> Hi, Sagar. Please use the below code and you will find the part files
>>> status from _part-0-107.in-progress   to _part-0-107.pending and
>>> finally to part-0-107. [For example], you need to run the program for a
>>> while. However, we need set some parameters, like the following. Moreover,
>>> *enableCheckpointing* IS also needed. I know why you always see the
>>> *.pending* file since the below parameters default value is 60 seconds
>>> even though you set the enableCheckpoint. So, that is why you can not see
>>> the finished file status until 60 seconds passed.
>>>
>>> Attached is the ending on my end, and you will see what you want!
>>>
>>> Please let me know if you still have the problem.
>>>
>>> Cheers
>>> Zhangminglei
>>>
>>> setInactiveBucketCheckInterval(2000)
>>> .setInactiveBucketThreshold(2000);
>>>
>>>
>>> public class TestOrc {
>>>public static void main(String[] args) throws Exception {
>>>   StreamExecutionEnvironment env = 
>>> StreamExecutionEnvironment.getExecutionEnvironment();
>>>   env.setParallelism(1);
>>>   env.enableCheckpointing(1000);
>>>   env.setStateBackend(new MemoryStateBackend());
>>>
>>>   String orcSchemaString = 
>>> "struct";
>>>   String path = "hdfs://10.199.196.0:9000/data/hive/man";
>>>
>>>   BucketingSink bucketingSink = new BucketingSink<>(path);
>>>
>>>   bucketingSink
>>>  .setWriter(new OrcFileWriter<>(orcSchemaString))
>>>  .setInactiveBucketCheckInterval(2000)
>>>  .setInactiveBucketThreshold(2000);
>>>
>>>   DataStream dataStream = env.addSource(new ManGenerator());
>>>
>>>   dataStream.addSink(bucketingSink);
>>>
>>>   env.execute();
>>>}
>>>
>>>public static class ManGenerator implements SourceFunction {
>>>
>>>   @Override
>>>   public void run(SourceContext ctx) throws Exception {
>>>  for (int i = 0; i < 2147483000; i++) {
>>> Row row = new Row(3);
>>> row.setField(0, "Sagar");
>>> 

[jira] [Created] (FLINK-9667) Scala docs are only generated for flink-scala

2018-06-27 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-9667:
---

 Summary: Scala docs are only generated for flink-scala
 Key: FLINK-9667
 URL: https://issues.apache.org/jira/browse/FLINK-9667
 Project: Flink
  Issue Type: Bug
  Components: CEP, Documentation, Gelly, Machine Learning Library, 
Scala API
Affects Versions: 1.4.2, 1.5.0, 1.6.0
Reporter: Chesnay Schepler


The documentation buildbot currently only generates the scala docs for 
{{flink-scala}}, and completely ignores other modules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9666) short-circuit logic should be used in boolean contexts

2018-06-27 Thread lamber-ken (JIRA)
lamber-ken created FLINK-9666:
-

 Summary: short-circuit logic should be used in boolean contexts
 Key: FLINK-9666
 URL: https://issues.apache.org/jira/browse/FLINK-9666
 Project: Flink
  Issue Type: Improvement
  Components: Core, DataStream API
Affects Versions: 1.5.0
Reporter: lamber-ken
 Fix For: 1.6.0


short-circuit logic should be used in boolean contexts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)