Hi Gregory,
What's the cause of your problem. It would be great if you can share your
experience which I think will definitely help others.
On Thu, Jun 28, 2018 at 11:30 AM, Gregory Fee wrote:
> Yep, it was definitely a watermarking issue. I have that sorted out now.
> Thanks!
>
> On Wed, Jun
Yep, it was definitely a watermarking issue. I have that sorted out now.
Thanks!
On Wed, Jun 27, 2018 at 6:49 PM, Hequn Cheng wrote:
> Hi Gregory,
>
> As you are using the rowtime over window. It is probably a watermark
> problem. The window only output when watermarks make a progress. You can
Hi Osh,
As I know, currently one dataset source can not be consumed by several
different vertexs and from the API you can not construct the topology for your
request.
I think your way to merge different reduce functions into one UDF is feasible.
Maybe someone has better solution. :)
zhijiang
Hi, Sihua & Aitozi
I would like add more here, As @Sihua said, we need to query the state
frequently. Assume if you use redis to store these states, it will consume a
lot of your redis resources. So, you can use a bloomfilter before access to
redis.
If a pv is told to exist by bloomfilter,
Hi Gregory,
As you are using the rowtime over window. It is probably a watermark
problem. The window only output when watermarks make a progress. You can
use processing-time(instead of row-time) to verify the assumption. Also,
make sure there are data in each of you source partition, the
Thanks @zhangminglei and @Fabian for confirming.
Even I looked at the ORC parsing code and it seems that using type
is mandatory for now.
Thanks,
Sagar
On Wed, Jun 27, 2018 at 12:59 AM, Fabian Hueske wrote:
> Hi Sagar,
>
> That's more a question for the ORC community, but AFAIK, the
Thanks for your answers! Yes, it was based on watermarks.
Fabian, the state does indeed grow quite a bit in my scenario. I've
observed in the range of 5GB. That doesn't seem to be an issue in itself.
However, in my scenario I'm loading a lot of data from a historic store
that is only partitioned
Hi,
I am new to Flink, and I'd like to firstly use it to perform some in memory
aggregation in batch mode (in some months this will be migrated to
permanent streaming, hence the choice of Flink).
For this, I can successfully create the complex key that I require using
KeySelector & returning a
For the above two non-window approaches, the second one achieves a better
performance. => For the above two non-window approaches, the second one
achieves a better performance in most cases especially when there are many
same rows.
On Thu, Jun 28, 2018 at 12:25 AM, Hequn Cheng wrote:
> Hi
Hi aitozi,
1> CountDistinct
Currently (flink-1.5), CountDistinct is supported in SQL only under window
as RongRong described.
There are ways to implement non-window CountDistinct, for example: a) you
can write a CountDistinct udaf using MapView or b) Use two groupBy to
achieve it. The first
Hi ,
Stream distinct accumulator is actually supported in SQL API [1]. The
syntax is pretty much identical to the batch case. A simple example using
the tumbling window will be.
> SELECT COUNT(DISTINCT col)
> FROM t
> GROUP BY TUMBLE(rowtime, INTERVAL '1' MINUTE)
I haven't added the support but
Hi aitozi,
I think it can be implemented by window or non-window, but it can not be
implemented without keyBy(). A general approach to implement this is as follow.
{code}
process(Record records) {
for (Record record : records) (
if (!isFilter(record)) {
agg(record);
Forward shiming mail to Aitozi.
Aitozi
We are using hyperloglog to count daily uv, but it only provided an approximate
value. I also tried the count distinct in flink table without window, but need
to set the retention time.
However, the time resolution of this operator is 1 millisecond, so
To aitozi.
Cheers
Minglei
> 在 2018年6月27日,下午5:46,shimin yang 写道:
>
> Aitozi
>
> We are using hyperloglog to count daily uv, but it only provided an
> approximate value. I also tried the count distinct in flink table without
> window, but need to set the retention time.
>
> However, the
Hi Elias,
Flink will remove these files if the job reached a globally terminal state
(FINISHED, FAILED, CANCELLED). The files should only remain if the cluster
crashed. This should give you the opportunity to restart the cluster which
can then recover the jobs which have not yet reached a
Aitozi
From my side, I do not think distinct is very easy to deal with. Even though
together work with kafka support exactly-once.
For uv, we can use a bloomfilter to filter pv for geting uv in the end.
Window is usually used in an aggregate operation, so I think all should be
realized by
As long as the inputs don't change, this should be correct.
On Tue, Jun 26, 2018 at 2:35 PM Hequn Cheng wrote:
> Hi
>
> I'm not sure about the answer. I have a feeling that if we only add new
> code below the old code(i.e., append new code after old code), the uid will
> not be changed.
>
> On
Hi,
The OVER window operator can only emit result when the watermark is
advanced, due to SQL semantics which define that all records with the same
timestamp need to be processed together.
Can you check if the watermarks make sufficient progress?
Btw. did you observe state size or IO issues? The
Hi, community
I am using flink to deal with some situation.
1. "distinct count" to calculate the uv/pv.
2. calculate the topN of the past 1 hour or 1 day time.
Are these all realized by window? Or is there a best practice on doing this?
3. And when deal with the distinct, if there is no need
Hi Elias,
Till (in CC) is familiar with Flink's HA implementation.
He might be able to answer your question.
Thanks,
Fabian
2018-06-25 23:24 GMT+02:00 Elias Levy :
> I noticed in one of our cluster that they are relatively old
> submittedJobGraph* and completedCheckpoint* files. I was
Hi,
You can just add a cast to StateBackend to get rid of the deprecation
warning:
env.setStateBackend((StateBackend) new
FsStateBackend("hdfs://myhdfsmachine:9000/flink/checkpoints"));
Best, Fabian
2018-06-27 5:47 GMT+02:00 Rong Rong :
> Hmm.
>
> If you have a wrapper function like this,
Hi Sagar,
That's more a question for the ORC community, but AFAIK, the top-level type
is always a struct because it needs to wrap the fields, e.g.,
struct(name:string, age:int)
Best, Fabian
2018-06-26 22:38 GMT+02:00 sagar loke :
> @zhangminglei,
>
> Question about the schema for ORC format:
>
Thanks for taking a look. I took the liberty of creating a pull request for
this.
https://github.com/apache/flink/pull/6211
It would be great if you guys could take a look at it and see if it makes
sense. I tried it out on our servers and it seems to do the job
On Tue, 26 Jun 2018 at 18:47,
23 matches
Mail list logo