Re: Over Window Not Processing Messages

2018-06-27 Thread Hequn Cheng
Hi Gregory, What's the cause of your problem. It would be great if you can share your experience which I think will definitely help others. On Thu, Jun 28, 2018 at 11:30 AM, Gregory Fee wrote: > Yep, it was definitely a watermarking issue. I have that sorted out now. > Thanks! > > On Wed, Jun

Re: Over Window Not Processing Messages

2018-06-27 Thread Gregory Fee
Yep, it was definitely a watermarking issue. I have that sorted out now. Thanks! On Wed, Jun 27, 2018 at 6:49 PM, Hequn Cheng wrote: > Hi Gregory, > > As you are using the rowtime over window. It is probably a watermark > problem. The window only output when watermarks make a progress. You can

回复:DataSet with Multiple reduce Actions

2018-06-27 Thread Zhijiang(wangzhijiang999)
Hi Osh, As I know, currently one dataset source can not be consumed by several different vertexs and from the API you can not construct the topology for your request. I think your way to merge different reduce functions into one UDF is feasible. Maybe someone has better solution. :) zhijiang

Re: Streaming

2018-06-27 Thread zhangminglei
Hi, Sihua & Aitozi I would like add more here, As @Sihua said, we need to query the state frequently. Assume if you use redis to store these states, it will consume a lot of your redis resources. So, you can use a bloomfilter before access to redis. If a pv is told to exist by bloomfilter,

Re: Over Window Not Processing Messages

2018-06-27 Thread Hequn Cheng
Hi Gregory, As you are using the rowtime over window. It is probably a watermark problem. The window only output when watermarks make a progress. You can use processing-time(instead of row-time) to verify the assumption. Also, make sure there are data in each of you source partition, the

Re: [Flink-9407] Question about proposed ORC Sink !

2018-06-27 Thread sagar loke
Thanks @zhangminglei and @Fabian for confirming. Even I looked at the ORC parsing code and it seems that using type is mandatory for now. Thanks, Sagar On Wed, Jun 27, 2018 at 12:59 AM, Fabian Hueske wrote: > Hi Sagar, > > That's more a question for the ORC community, but AFAIK, the

Re: Over Window Not Processing Messages

2018-06-27 Thread Gregory Fee
Thanks for your answers! Yes, it was based on watermarks. Fabian, the state does indeed grow quite a bit in my scenario. I've observed in the range of 5GB. That doesn't seem to be an issue in itself. However, in my scenario I'm loading a lot of data from a historic store that is only partitioned

DataSet with Multiple reduce Actions

2018-06-27 Thread Osian Hedd Hughes
Hi, I am new to Flink, and I'd like to firstly use it to perform some in memory aggregation in batch mode (in some months this will be migrated to permanent streaming, hence the choice of Flink). For this, I can successfully create the complex key that I require using KeySelector & returning a

Re: Streaming

2018-06-27 Thread Hequn Cheng
For the above two non-window approaches, the second one achieves a better performance. => For the above two non-window approaches, the second one achieves a better performance in most cases especially when there are many same rows. On Thu, Jun 28, 2018 at 12:25 AM, Hequn Cheng wrote: > Hi

Re: Streaming

2018-06-27 Thread Hequn Cheng
Hi aitozi, 1> CountDistinct Currently (flink-1.5), CountDistinct is supported in SQL only under window as RongRong described. There are ways to implement non-window CountDistinct, for example: a) you can write a CountDistinct udaf using MapView or b) Use two groupBy to achieve it. The first

Re: Streaming

2018-06-27 Thread Rong Rong
Hi , Stream distinct accumulator is actually supported in SQL API [1]. The syntax is pretty much identical to the batch case. A simple example using the tumbling window will be. > SELECT COUNT(DISTINCT col) > FROM t > GROUP BY TUMBLE(rowtime, INTERVAL '1' MINUTE) I haven't added the support but

Re:Streaming

2018-06-27 Thread sihua zhou
Hi aitozi, I think it can be implemented by window or non-window, but it can not be implemented without keyBy(). A general approach to implement this is as follow. {code} process(Record records) { for (Record record : records) ( if (!isFilter(record)) { agg(record);

Re: Streaming

2018-06-27 Thread zhangminglei
Forward shiming mail to Aitozi. Aitozi We are using hyperloglog to count daily uv, but it only provided an approximate value. I also tried the count distinct in flink table without window, but need to set the retention time. However, the time resolution of this operator is 1 millisecond, so

Re: Streaming

2018-06-27 Thread zhangminglei
To aitozi. Cheers Minglei > 在 2018年6月27日,下午5:46,shimin yang 写道: > > Aitozi > > We are using hyperloglog to count daily uv, but it only provided an > approximate value. I also tried the count distinct in flink table without > window, but need to set the retention time. > > However, the

Re: high-availability.storageDir clean up?

2018-06-27 Thread Till Rohrmann
Hi Elias, Flink will remove these files if the job reached a globally terminal state (FINISHED, FAILED, CANCELLED). The files should only remain if the cluster crashed. This should give you the opportunity to restart the cluster which can then recover the jobs which have not yet reached a

Re: Streaming

2018-06-27 Thread zhangminglei
Aitozi From my side, I do not think distinct is very easy to deal with. Even though together work with kafka support exactly-once. For uv, we can use a bloomfilter to filter pv for geting uv in the end. Window is usually used in an aggregate operation, so I think all should be realized by

Re: Restore state from save point with add new flink sql

2018-06-27 Thread Till Rohrmann
As long as the inputs don't change, this should be correct. On Tue, Jun 26, 2018 at 2:35 PM Hequn Cheng wrote: > Hi > > I'm not sure about the answer. I have a feeling that if we only add new > code below the old code(i.e., append new code after old code), the uid will > not be changed. > > On

Re: Over Window Not Processing Messages

2018-06-27 Thread Fabian Hueske
Hi, The OVER window operator can only emit result when the watermark is advanced, due to SQL semantics which define that all records with the same timestamp need to be processed together. Can you check if the watermarks make sufficient progress? Btw. did you observe state size or IO issues? The

Streaming

2018-06-27 Thread aitozi
Hi, community I am using flink to deal with some situation. 1. "distinct count" to calculate the uv/pv. 2. calculate the topN of the past 1 hour or 1 day time. Are these all realized by window? Or is there a best practice on doing this? 3. And when deal with the distinct, if there is no need

Re: high-availability.storageDir clean up?

2018-06-27 Thread Fabian Hueske
Hi Elias, Till (in CC) is familiar with Flink's HA implementation. He might be able to answer your question. Thanks, Fabian 2018-06-25 23:24 GMT+02:00 Elias Levy : > I noticed in one of our cluster that they are relatively old > submittedJobGraph* and completedCheckpoint* files. I was

Re: env.setStateBackend deprecated in 1.5 for FsStateBackend

2018-06-27 Thread Fabian Hueske
Hi, You can just add a cast to StateBackend to get rid of the deprecation warning: env.setStateBackend((StateBackend) new FsStateBackend("hdfs://myhdfsmachine:9000/flink/checkpoints")); Best, Fabian 2018-06-27 5:47 GMT+02:00 Rong Rong : > Hmm. > > If you have a wrapper function like this,

Re: [Flink-9407] Question about proposed ORC Sink !

2018-06-27 Thread Fabian Hueske
Hi Sagar, That's more a question for the ORC community, but AFAIK, the top-level type is always a struct because it needs to wrap the fields, e.g., struct(name:string, age:int) Best, Fabian 2018-06-26 22:38 GMT+02:00 sagar loke : > @zhangminglei, > > Question about the schema for ORC format: >

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

2018-06-27 Thread jelmer
Thanks for taking a look. I took the liberty of creating a pull request for this. https://github.com/apache/flink/pull/6211 It would be great if you guys could take a look at it and see if it makes sense. I tried it out on our servers and it seems to do the job On Tue, 26 Jun 2018 at 18:47,