Re: KeyedSream question

2018-04-04 Thread Amit Jain
Hi, KeyBy operation partition the data on given key and make sure same slot will get all future data belonging to same key. In default implementation, it can also map subset of keys in your DataStream to same slot. Assuming you have number of keys equal to number running slot then you may

KeyedSream question

2018-04-04 Thread TechnoMage
I am new to Flink and trying to understand the keyBy and KeyedStream. From the short doc description I expected it to partition the data such that the following flatMap would only see elements with the same key. That events with different keys would be presented to different instances of

Re: Collect event which arrive after watermark

2018-04-04 Thread Fabian Hueske
Window operators drop late events by default. When they receive a late event, they already computed and emitted a result. Since there is not good default behavior to hay ndle a late event in this case, they are simply dropped. However, Flink offers multiple ways to explicitly handle late events

Re: Graph Analytics on HBase With HGraphDB and Apache Flink Gelly

2018-04-04 Thread Jörn Franke
Have you checked janusgraph source code , it used also hbase as a storage backend: http://janusgraph.org/ It combines it with elasticsearch for indexing. Maybe you can inspire from the architecture there. Generally, hbase it depends a lot on how the data is written to regions, the order of

Re: Graph Analytics on HBase With HGraphDB and Apache Flink Gelly

2018-04-04 Thread santoshg
Restarting this thread since it is relevant to us. We are thinking of using HBase/Cassandra to store graph data and then load the data from here into Flink/Gelly. One of the issues we are concerned about is the read performance. So far we tried our tests with data residing on HDFS and that worked

Re: Collect event which arrive after watermark

2018-04-04 Thread shishal
Thanks Fabian, My understand was that late event older than watermark is dropped. So processFunction wont be called for late event. So I guess my understanding was wrong. Or there is something more to it? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Bucketing Sink does not complete files, when source is from a collection

2018-04-04 Thread Fabian Hueske
Hi Josh, You are right, FLINK-2646 is related to the problem of non-finialized files. If we could distinguish the cases why close() is called, we could do a proper clean-up if the job terminated because all data was processed. Right now, the source and sink interfaces of the DataStream API are

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Fabian Hueske
Yes, metrics won't work for this use case. Before we had proper metrics support, accumulators were often used as a work around. In general, the Table API & SQL in Flink are designed to keep all data in tables and not "leak" data on the side. Best, Fabian 2018-04-04 22:08 GMT+02:00 Darshan Singh

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Darshan Singh
I doubt the metrics will work as I will need to get the String output. I will need to use figure out something else. Basically I will pass my function a string and will get some columns back but if my string is something special. Then I will need to get extra information. I am using the data set

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Fabian Hueske
Hi Darshan, Accumulators are not exposed to UDFs of the Table API / SQL. What's your use case for these? Would metrics do the job as well? Best, Fabian 2018-04-04 21:31 GMT+02:00 Darshan Singh : > Hi, > > I would like to use accumulators with table /scalar functions.

Re: Bucketing Sink does not complete files, when source is from a collection

2018-04-04 Thread joshlemer
Actually sorry, I have found that this is most likely a manifestation of https://issues.apache.org/jira/browse/FLINK-2646 as discussed elsewhere on the mailing list. That is, in the second example "fromCollection" the entire stream ends before a checkpoint is made. Let's hope this is fixed some

Getting runtime context from scalar and table functions

2018-04-04 Thread Darshan Singh
Hi, I would like to use accumulators with table /scalar functions. However, I am not able to figure out how to get the runtime context from inside of scalar function open method. Only thing i see is function context which can not provide the runtime context. I tried using

Bucketing Sink does not complete files, when source is from a collection

2018-04-04 Thread Josh Lemer
Hello, I was wondering if I could get some pointers on what I'm doing wrong here. I posted this on stack overflow , but I thought I'd also ask here. I'm trying to generate

Re: TaskManager deadlock on NetworkBufferPool

2018-04-04 Thread Amit Jain
+user@flink.apache.org On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain wrote: > Hi, > > We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink 1.3.2. > We have set of ETL's merge jobs for a number of tables and stuck with above > issue randomly daily. > > I'm

Re: Watermark Question on Failed Process

2018-04-04 Thread Chengzhi Zhao
Thanks Fabian! This is very helpful! Best, Chengzhi On Wed, Apr 4, 2018 at 9:02 AM, Fabian Hueske wrote: > Hi Chengzhi, > > You can access the current watermark from the Context object of a > ProcessFunction [1] and store it in operator state [2]. > In case of a restart,

Re: Side outputs never getting consumed

2018-04-04 Thread Julio Biason
Hey Timo, To be completely honest, I _think_ they are POJO, although I use case classes (because I want our data to be immutable). I wrote a sample code, which basically reflects our pipeline:

Re: REST API "broken" on YARN because POST is not allowed via YARN proxy

2018-04-04 Thread Fabian Hueske
Hi Juho, Thanks for raising this point! I'll add Chesnay and Till to the thread who contributed to the REST API. Best, Fabian 2018-04-04 15:02 GMT+02:00 Juho Autio : > I just learned that Flink savepoints API was refactored to require using > HTTP POST. > > That's fine

Re: Collect event which arrive after watermark

2018-04-04 Thread Fabian Hueske
Hi, you can do that with a ProcessFunction [1]. The Context parameter of the ProcessFunction.processElement() method gives access to the current watermark and the timestamp of the current element. In case you don't just want to log the late data but send it to a different DataStream (and sink it

Re: cancel-with-savepoint: 404 Not Found

2018-04-04 Thread Fabian Hueske
Thanks for reporting this Juho. I've created FLINK-9130 [1] to address the issue. Best Fabian [1] https://issues.apache.org/jira/browse/FLINK-9130 2018-04-04 12:03 GMT+02:00 Juho Autio : > Thank you, it works! > > I would still expect this to be documented. > > If I

Re: Enriching DataStream using static DataSet in Flink streaming

2018-04-04 Thread Fabian Hueske
Hi, This type of applications are not super well supported by Flink, yet. The missing feature is on the roadmap and called Side Inputs [1]. There are (at least) two alternatives but both have some drawbacks: 1) Ingest the static data set as regular DataStream, keyBy the static and the actual

Re: Temporary failure in name resolution

2018-04-04 Thread Fabian Hueske
Hi, The issue might be related to garbage collection pauses during which the TM JVM cannot communicate with the JM. The metrics contain a stats for memory consumpion [1] and GC activity [2] that can help to diagnose the problem. Best, Fabian [1]

Re: Updating Broadcast Variables

2018-04-04 Thread Fabian Hueske
Hi Pete, Broadcast variables are a feature of the DataSet API [1], i.e., available for batch processing. Broadcast variables are computed based on the complete input (which is possible because they are only available for bounded data sets and not for unbounded streams) and shared with all

Re: Watermark Question on Failed Process

2018-04-04 Thread Fabian Hueske
Hi Chengzhi, You can access the current watermark from the Context object of a ProcessFunction [1] and store it in operator state [2]. In case of a restart, the state will be restored with the watermark that was active when the checkpoint (or savepoint) was taken. Note, this won't be the last

REST API "broken" on YARN because POST is not allowed via YARN proxy

2018-04-04 Thread Juho Autio
I just learned that Flink savepoints API was refactored to require using HTTP POST. That's fine otherwise, but makes life harder when Flink is run on top of YARN. I've added example calls below to show how POST is declined by the hadoop-yarn-server-web-proxy*, which only supports GET and PUT.

Re: Multiple Async IO

2018-04-04 Thread Fabian Hueske
Hi Maxim, I think Ken's approach is a good idea. However, you would need to a add a stateful operator to join the results of the individual queries if that is needed. In order to join the results, you would need a unique id on which you can keyBy() to collect all 20 records that originated from

Re: SSL config on Kubernetes - Dynamic IP

2018-04-04 Thread Fabian Hueske
Thank you Edward and Christophe! 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo < edward.roja...@gmail.com>: > Hi all, > > I did some tests based on the PR Christophe mentioned above and by making > a change on the NettyClient to use CanonicalHostName instead of > HostNameAddress to

Re: Anyway to read Cassandra as DataStream/DataSet in Flink?

2018-04-04 Thread Fabian Hueske
Hi James, The answer to your question depends on your use case. The AsyncIOFunction approach works if you have a DataStream that you would like to enrich with data in a Cassandra table but not if you would like to create a DataStream from a Cassandra table. The Flink code base contains a

Re: how to query the output of the scalar table function

2018-04-04 Thread Fabian Hueske
Hi Darshan, What you observe is the result of what's supposed to be an optimization. By fusing the two select() calls, we reduce the number of operators in the resulting plan (one MapFunction less). This optimization is only applied for ScalarFunctions but not for TableFunctions. With a better

Collect event which arrive after watermark

2018-04-04 Thread shishal
Hi Flink community members, I am new to flink stream processing. I am using event time processing and keystream. Sorry if my question sound silly but Is there a way to collect (or log) the late event which arrived after watermark. So somehow I need to gather this stats for further analysis.

Re: Record timestamp from kafka

2018-04-04 Thread Fabian Hueske
Hi Navneeth, Flink's KafkaConsumer automatically attaches Kafka's ingestion timestamp if you configure EventTime for an application [1]. Since Flink treats record timestamps as meta data, they are not directly accessible by most functions. You can implement a ProcessFunction [2] to access the

Re: Slow flink checkpoint

2018-04-04 Thread makeyang
the test is very promising. the time sync part takes from couple of seconds to couple of mill-seconds. 1000x time reduce(overall time not save since it is just move from sync to async) are u guys interested in this change? -- Sent from:

Re: SideOutput Issue

2018-04-04 Thread Chesnay Schepler
Hi, which version of Flink are you using? Could you provide us with a reproducing example? I tried reproducing it based on the information you provided in the following code, but it runs fine for me: private static final OutputTag tag = new OutputTag("test"){}; public static void

Re: cancel-with-savepoint: 404 Not Found

2018-04-04 Thread Juho Autio
Thank you, it works! I would still expect this to be documented. If I understood correctly, the documentation is generated from these Javadocs:

Re: Restore from a savepoint is very slow

2018-04-04 Thread Dongwon Kim
It was due to too low parallelism. I increase parallelism large enough (actually set it to the total number of task slots on the cluster) and it makes restore from a savepoint much faster. This is somewhat related to the previous discussion I had with Robert and Aljoscha. Having a standalone

Re: Task Manager fault tolerance does not work

2018-04-04 Thread dhirajpraj
As suggested by Till, it works perfectly fine after increasing the no. of retries. Thanks people. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Job restart hook

2018-04-04 Thread Kostas Kloudas
Hi Navneeth, I am sending the answer to the user mailing list so that we keep the discussion public. There may also be other users interested in the question. So the answer to the question is that you cannot restart from an externalized checkpoint with a different parallelism. To be able to

Enriching DataStream using static DataSet in Flink streaming

2018-04-04 Thread vijay kansal
Hi All I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB). For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean

Re: Temporary failure in name resolution

2018-04-04 Thread miki haiat
HI , i checked the code again the figure out where the problem can be i just wondered if im implementing the Evictor correctly ? full code https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237 public static class EsbTraceEvictor implements Evictor {