Re: 1.4.3 release/roadmap

2018-04-20 Thread Fabian Hueske
Hi Daniel, The discussion for releasing Flink 1.4.3 hasn't been started (until now). The community is still working on the 1.5.0 release but AFAIK, there are no blockers for 1.4.3. Development and release discussions take place on the dev@f.a.o list. Would you kicking off a discussion there?

Re: gonna need more logs when task manager is shutting down

2018-04-20 Thread Fabian Hueske
Hi Makeyang, Would you mind opening a JIRA issue [1] for your improvement suggestion? It would be good to add the Flink version that you are running. Thanks, Fabian [1] https://issues.apache.org/jira/projects/FLINK 2018-04-20 6:21 GMT+02:00 makeyang : > one of my

Re: debug for Flink

2018-04-19 Thread Fabian Hueske
Hi, You can run Flink applications locally in your IDE and debug a Flink program just like a regular Java/Scala application. Best, Fabian 2018-04-19 0:53 GMT+02:00 Qian Ye : > Hi > > I’m wondering if new debugging methods/tools are urgent for Flink > development. I know

Re: Efficiency with different approaches of aggregation in Flink

2018-04-19 Thread Fabian Hueske
Hi Teena, I'd go with approach 2. The performance difference shouldn't be significant compared to 1. but it is much easier to implement, IMO. Avoid approach 3. It will be much slower because you need at least one call to an external data store and more difficult to implement. Flink's

Re: State-machine-based search logic in Flink ?

2018-04-18 Thread Fabian Hueske
> > > I did mean like “finding series of consecutive events”, as it was > described in [2]. > > > > Are these features already in Flink and how well they are documented ? > > > > Can I use Scala or only Java ? > > > > I would like some example codes,

Re: why doesn't the over-window-aggregation sort the element(considering watermark) before processing?

2018-04-18 Thread Fabian Hueske
The over window operates on an unbounded stream of data. Hence it is not possible to sort the complete stream. Instead we can sort ranges of the stream. Flink uses watermarks to define these ranges. The operator processes the records in timestamp order that are not late, i.e., have timestamps

Re: assign time attribute after first window group when using Flink SQL

2018-04-18 Thread Fabian Hueske
most close preceding event in w2. > If condition meets, I'd like to use the price value and timestamp in that > event to get one matching event from another raw stream (r2). > > CEP sounds to be a good idea. But I need to refer to event in other stream > (r2) in current pattern condition

Re: State-machine-based search logic in Flink ?

2018-04-17 Thread Fabian Hueske
Hi Esa, What do you mean by "individual searches in the Table API"? There is some work (a pending PR [1]) to integrate the MATCH_RECOGNIZE clause (SQL 2016) [2] into Flink's SQL which basically adds a SQL syntax for the CEP library. Best, Fabian [1] https://github.com/apache/flink/pull/4502 [2]

Re: Question about parallelism

2018-04-16 Thread Fabian Hueske
retest without the explicit parallelism > when I get a chance. > > Michael > > > On Apr 16, 2018, at 2:05 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > (re-adding user mailing list) > > A non-serializable function object should cause the job to fail, but not > to ignor

Re: Annotation in UDF dropped

2018-04-16 Thread Fabian Hueske
Hi Viktor, Flink does not modify user code. It distributes the job JAR file to the cluster and serializes the function objects using Java serialization to ship them to the worker nodes where they are deserialized. What type of annotation gets dropped? Can you show us a small example of the code?

Re: Kafka consumer to sync topics by event time?

2018-04-16 Thread Fabian Hueske
; > On Mon, Apr 16, 2018 at 3:14 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Fully agree Juho! >> >> Do you want to contribute the docs fix? >> If yes, we should update FLINK-5479 to make sure that the warning is >> removed once the bug is fixed. >&g

Re: User-defined aggregation function and parallelism

2018-04-16 Thread Fabian Hueske
Hi Bill, Flink's built-in aggregation functions are implemented against the same interface as UDAGGs and are applied in parallel. The performance depends of course on the implementation of the UDAGG. For example, you should try to keep the size of the accumulator as small as possible because it

Re: How to rebalance a table without converting to dataset

2018-04-16 Thread Fabian Hueske
Hi Darshan, You are right. there's currently no rebalancing operation on the Table API. I see that this might be a good feature, not sure though how easy it would be to integrate because we need to pass it through the Calcite optimizer and rebalancing is not a relational operation. For now,

Re: Unable to launch job with 100 SQL queries in yarn cluster

2018-04-16 Thread Fabian Hueske
Hi Adrian, Thanks reaching out to the community. I don't think that this is an issue with Flink's SQL support. SQL queries are translated into regular streaming (or batch) jobs. The JM might just be overloaded by too many jobs. Since you are running in a YARN environment, it might make sense to

Re: Kafka consumer to sync topics by event time?

2018-04-16 Thread Fabian Hueske
ts stalled on an idle partition > it blocks everything. > > Link to current documentation: > https://ci.apache.org/projects/flink/flink-docs- > master/dev/connectors/kafka.html#kafka-consumers-and- > timestamp-extractionwatermark-emission > > On Mon, Dec 4, 2017 at 4:29 PM, Fabi

Re: Tracking deserialization errors

2018-04-16 Thread Fabian Hueske
Thanks for starting the discussion Elias. I see two ways to address this issue. 1) Add an interface that a deserialization schema can implement to register metrics. Each source would need to check for the interface and call it to setup metrics. 2) Check for null returns in the source functions

Re: 答复: Slow flink checkpoint

2018-04-16 Thread Fabian Hueske
wse/FLINK-9182?jql= > project%20%3D%20FLINK%20AND%20issuetype%20%3D% > 20Improvement%20AND%20created%20%3E%3D%20-10m > >I'll pull the code ASAP. > > > > > MaKeyang > TIG.JD.COM > 京东基础架构 <http://tig.jd.com/> > tig.jd.com > TIG官网 > > > -

Re: Question about parallelism

2018-04-16 Thread Fabian Hueske
. > > > Michael > > On Apr 14, 2018, at 6:12 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi, > > The number of Taskmanagers is irrelevant für the parallelism of a job or > operator. The scheduler only cares about the number of slots. > > How did you set

Re: assign time attribute after first window group when using Flink SQL

2018-04-16 Thread Fabian Hueske
Sorry, I forgot to CC the user mailing list in my reply. 2018-04-12 17:27 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > Hi, > > Assuming you are using event time, the right function to generate a row > time attribute from a window would be "w1.rowtime" instead of &q

Re: Volume question

2018-04-16 Thread Fabian Hueske
Sorry, I forgot to CC the user mailing list in my reply. 2018-04-12 17:26 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > Hi, > > Sorry for the long delay. Many contributors are traveling due to Flink > Forward. > > Your use case should be well supported by Fli

Re: Json KAFKA producer

2018-04-14 Thread Fabian Hueske
Hi, SerializationSchema is a public interface that you can implement. It has a single method to turn an object into a byte array. I would suggest to implement your own SerializationSchema. Best, Fabian 2018-04-11 15:56 GMT+02:00 Luigi Sgaglione : > Hi, > > I'm

Re: Is Flink able to do real time stock market analysis?

2018-04-13 Thread Fabian Hueske
Hi Ivan, You can certainly do these things with Flink. Michael pointed you in a good direction if you want to implement the logic in the DataStream API / ProcessFunctions. Flink's SQL support should also be able to handle the use case you described. The "ingredients" would be - a TUMBLE window

Re: Get get file name when reading from files? Or assign timestamps from file creation time?

2018-04-06 Thread Fabian Hueske
Hi Josh, FileInputFormat stores the currently read FileInputSplit in a protected variable FileInputFormat.currentSplit. You can override your FileInputFormat and access the path of the read file from the FileInputSplit in the method that emits the records from the format. Best, Fabian

Re: KeyedSream question

2018-04-06 Thread Fabian Hueske
t; Thanks for the clarification. I was just trying to understand the intended > behavior. It would have been nice if Flink tracked state for downstream > operators by key, but I can do that with a map in the downstream functions. > > Michael > > Sent from my iPad > > On Apr 5, 2018

Re: KeyedSream question

2018-04-05 Thread Fabian Hueske
Amit is correct. keyBy() ensures that all records with the same key are processed by the same paralllel instance of a function. This is different from "a parallel instance only sees records of one key". I had a look at the docs [1]. I agree that "Logically partitions a stream into disjoint

Re: Task Manager fault tolerance does not work

2018-04-05 Thread Fabian Hueske
Hi, Thanks for the feedback! As Till explained, the problem is that the JM first tries to schedule the job to the failed TM (which hasn't been detected as failed yet). The configured three restart attempts are "consumed" by these attempts and the job fails afterwards. Best, Fabian 2018-04-05

Re: Collect event which arrive after watermark

2018-04-04 Thread Fabian Hueske
Window operators drop late events by default. When they receive a late event, they already computed and emitted a result. Since there is not good default behavior to hay ndle a late event in this case, they are simply dropped. However, Flink offers multiple ways to explicitly handle late events

Re: Bucketing Sink does not complete files, when source is from a collection

2018-04-04 Thread Fabian Hueske
Hi Josh, You are right, FLINK-2646 is related to the problem of non-finialized files. If we could distinguish the cases why close() is called, we could do a proper clean-up if the job terminated because all data was processed. Right now, the source and sink interfaces of the DataStream API are

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Fabian Hueske
ata. The amount of appended >> data is quite small so I wanted to see if I can use accumulators and once >> the job is done I can get all the data and then use the way I want to use. >> >> I guess I will try the metrics as well to see how that goes. >> >> Thanks &g

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Fabian Hueske
Hi Darshan, Accumulators are not exposed to UDFs of the Table API / SQL. What's your use case for these? Would metrics do the job as well? Best, Fabian 2018-04-04 21:31 GMT+02:00 Darshan Singh : > Hi, > > I would like to use accumulators with table /scalar functions.

Re: REST API "broken" on YARN because POST is not allowed via YARN proxy

2018-04-04 Thread Fabian Hueske
Hi Juho, Thanks for raising this point! I'll add Chesnay and Till to the thread who contributed to the REST API. Best, Fabian 2018-04-04 15:02 GMT+02:00 Juho Autio : > I just learned that Flink savepoints API was refactored to require using > HTTP POST. > > That's fine

Re: Collect event which arrive after watermark

2018-04-04 Thread Fabian Hueske
Hi, you can do that with a ProcessFunction [1]. The Context parameter of the ProcessFunction.processElement() method gives access to the current watermark and the timestamp of the current element. In case you don't just want to log the late data but send it to a different DataStream (and sink it

Re: cancel-with-savepoint: 404 Not Found

2018-04-04 Thread Fabian Hueske
Thanks for reporting this Juho. I've created FLINK-9130 [1] to address the issue. Best Fabian [1] https://issues.apache.org/jira/browse/FLINK-9130 2018-04-04 12:03 GMT+02:00 Juho Autio : > Thank you, it works! > > I would still expect this to be documented. > > If I

Re: Enriching DataStream using static DataSet in Flink streaming

2018-04-04 Thread Fabian Hueske
Hi, This type of applications are not super well supported by Flink, yet. The missing feature is on the roadmap and called Side Inputs [1]. There are (at least) two alternatives but both have some drawbacks: 1) Ingest the static data set as regular DataStream, keyBy the static and the actual

Re: Temporary failure in name resolution

2018-04-04 Thread Fabian Hueske
Hi, The issue might be related to garbage collection pauses during which the TM JVM cannot communicate with the JM. The metrics contain a stats for memory consumpion [1] and GC activity [2] that can help to diagnose the problem. Best, Fabian [1]

Re: Updating Broadcast Variables

2018-04-04 Thread Fabian Hueske
Hi Pete, Broadcast variables are a feature of the DataSet API [1], i.e., available for batch processing. Broadcast variables are computed based on the complete input (which is possible because they are only available for bounded data sets and not for unbounded streams) and shared with all

Re: Watermark Question on Failed Process

2018-04-04 Thread Fabian Hueske
Hi Chengzhi, You can access the current watermark from the Context object of a ProcessFunction [1] and store it in operator state [2]. In case of a restart, the state will be restored with the watermark that was active when the checkpoint (or savepoint) was taken. Note, this won't be the last

Re: Multiple Async IO

2018-04-04 Thread Fabian Hueske
Hi Maxim, I think Ken's approach is a good idea. However, you would need to a add a stateful operator to join the results of the individual queries if that is needed. In order to join the results, you would need a unique id on which you can keyBy() to collect all 20 records that originated from

Re: SSL config on Kubernetes - Dynamic IP

2018-04-04 Thread Fabian Hueske
Thank you Edward and Christophe! 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo < edward.roja...@gmail.com>: > Hi all, > > I did some tests based on the PR Christophe mentioned above and by making > a change on the NettyClient to use CanonicalHostName instead of > HostNameAddress to

Re: Anyway to read Cassandra as DataStream/DataSet in Flink?

2018-04-04 Thread Fabian Hueske
Hi James, The answer to your question depends on your use case. The AsyncIOFunction approach works if you have a DataStream that you would like to enrich with data in a Cassandra table but not if you would like to create a DataStream from a Cassandra table. The Flink code base contains a

Re: how to query the output of the scalar table function

2018-04-04 Thread Fabian Hueske
Hi Darshan, What you observe is the result of what's supposed to be an optimization. By fusing the two select() calls, we reduce the number of operators in the resulting plan (one MapFunction less). This optimization is only applied for ScalarFunctions but not for TableFunctions. With a better

Re: Record timestamp from kafka

2018-04-04 Thread Fabian Hueske
Hi Navneeth, Flink's KafkaConsumer automatically attaches Kafka's ingestion timestamp if you configure EventTime for an application [1]. Since Flink treats record timestamps as meta data, they are not directly accessible by most functions. You can implement a ProcessFunction [2] to access the

Re: Strange behavior on filter, group and reduce DataSets

2018-03-26 Thread Fabian Hueske
Hi, Yes, I've updated the PR. It needs a review and should be included in Flink 1.5. Cheers, Fabian simone <simone.povosca...@gmail.com> schrieb am Mo., 26. März 2018, 12:01: > Hi Fabian, > > any update on this? Did you fix it? > > Best, Simone. > > On 22/03/2018

Re: Example PopularPlacesFromKafka fails to run

2018-03-23 Thread Fabian Hueske
Thanks for the feedback! Fabian 2018-03-23 8:02 GMT+01:00 James Yu : > Hi, > > When I proceed further to timePrediction exercise (http://training.data- > artisans.com/exercises/timePrediction.html), I realize that the > nycTaxiRides.gz's > format is fine. > The problem is in

Re: Restart hook and checkpoint

2018-03-23 Thread Fabian Hueske
Yes, that would be great! Thank you, Fabian 2018-03-23 3:06 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>: > Fabian, that sounds good. Should I recap some bullets in an email and > start a new thread then? > > Thanks, Ashish > > > On Mar 22, 2018, at 5:16 AM, Fabia

Re: Rowtime

2018-03-22 Thread Fabian Hueske
Hi Gregory, Event-time timestamps are handled a bit differently in Flink's SQL compared to the DataStream API. In the DataStream API, timestamps are hidden from the user and implicitly used for time-based operations such as windows. In SQL, the query semantics cannot depend on hidden fields.

Re: how does SQL mode work with PopularPlaces example?

2018-03-22 Thread Fabian Hueske
Hi James, the exercise does not require to filter on pickup events. It says: "This is done by counting every five minutes the number of taxi rides that started and ended in the same area within the last 15 minutes. Arrival and departure locations should be separately counted." That is achieved

Re: ListCheckpointed function - what happens prior to restoreState() being called?

2018-03-22 Thread Fabian Hueske
in restoreState() if you don't want to do it in the constructor. That method is always called when the operator is started. Best, Fabian 2018-03-21 19:55 GMT+01:00 Ken Krugler <kkrugler_li...@transpac.com>: > Hi Fabian, > > On Mar 20, 2018, at 6:38 AM, Fabian Hueske <fhue..

Re: Restart hook and checkpoint

2018-03-22 Thread Fabian Hueske
r a sub-set of job graph that are “healthy”. > > Thanks, Ashish > > > On Mar 20, 2018, at 9:53 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Well, that's not that easy to do, because checkpoints must be coordinated > and triggered the JobManager. > Also, the checkpointing

Re: CsvSink

2018-03-22 Thread Fabian Hueske
7. >>>>> val namedStream = dataStream.map((value:String) => { >>>>> >>>>> >>>>> Should i file a bug report ? >>>>> >>>>> On Tue, Mar 20, 2018 at 9:30 AM, karim amer <karim.amer...@gmail.com> >>

Re: Flink CEP window for 1 working day

2018-03-22 Thread Fabian Hueske
Hi, I don't think the CEP library is that flexible, but I loop in Kostas (CC) who knows more about it. I'm not exactly sure what you mean by "manipulate" event-time, but I don't think that's necessary. You can implement rules also with state and timers in the ProcessFunction. The function

Re: Flink remote debug not working

2018-03-22 Thread Fabian Hueske
Hi Ankit, The env.java.opts parameter is used for all JVMs started by Flink, i.e., JM and TM. Since the JM process is started before the TM, the port is already in use when you start the TM. You can use env.java.opts.taskmanager to pass parameters only for TM JVMs. Best, Fabian 2018-03-20

Re: Strange behavior on filter, group and reduce DataSets

2018-03-21 Thread Fabian Hueske
Hi, That was a bit too early. I found an issue with my approach. Will come back once I solved that. Best, Fabian 2018-03-21 23:45 GMT+01:00 Fabian Hueske <fhue...@gmail.com>: > Hi, > > I've opened a pull request [1] that should fix the problem. > It would be great if yo

Re: Strange behavior on filter, group and reduce DataSets

2018-03-21 Thread Fabian Hueske
e > problem go away? > > If either of that is the case, it must be a mutability bug somewhere in > either accidental object reuse or accidental serializer sharing. > > > On Tue, Mar 20, 2018 at 3:34 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Simone an

Re: Flink CEP window for 1 working day

2018-03-20 Thread Fabian Hueske
Hi, I'm afraid, Flink CEP does not distinguish work days from non-work days. Of course, you could implement the logic in a DataStream program (probably using ProcessFunction). Best, Fabian 2018-03-20 15:44 GMT+01:00 shishal : > I am using flink CEP , and to match a event

Re: Strange behavior on filter, group and reduce DataSets

2018-03-20 Thread Fabian Hueske
> Hi Fabian, > > This simple code reproduces the behavior -> https://github.com/xseris/ > Flink-test-union > > Thanks, Simone. > > On 19/03/2018 15:44, Fabian Hueske wrote: > > Hmmm, I still don't see the problem. > IMO, the result should be correct for both plans. T

Re: Metric Registry Warnings

2018-03-20 Thread Fabian Hueske
Hi Pedro, Can you reopen FLINK-7100 and post a comment with your error message and environment? Thanks, Fabian 2018-03-20 14:58 GMT+01:00 PedroMrChaves : > Hello, > > I still have the same issue with Flink Version 1.4.2. > > java.lang.IllegalArgumentException: A

Re: Help Required for Log4J

2018-03-20 Thread Fabian Hueske
Hi, TBH, I don't have much experience with logging, but you might want to consider using Side Outputs [1] to route invalid records into a separate stream. The stream can then separately handled, be written to files or Kafka or wherever. Best, Fabian [1]

Re: Updating Job dependencies without restarting Flink

2018-03-20 Thread Fabian Hueske
Hi, I'm quite sure that is not supported. You'd have to take a savepoint and restart the application. Depending on the sink system, you could start the new job before shutting the old job down. Best, Fabian 2018-03-20 10:31 GMT+01:00 Rohil Surana : > Hi, > > We have a

Re: Restart hook and checkpoint

2018-03-20 Thread Fabian Hueske
Well, that's not that easy to do, because checkpoints must be coordinated and triggered the JobManager. Also, the checkpointing mechanism with flowing checkpoint barriers (to ensure checkpoint consistency) won't work once a task failed because it cannot continue processing and forward barriers. If

Re: unable to addSource to StreamExecutionEnvironment?

2018-03-20 Thread Fabian Hueske
Thanks for reporting back! 2018-03-20 10:42 GMT+01:00 James Yu : > Just found out that IDE seems auto import wrong class. > While "org.apache.flink.streaming.api.datastream.DataStream" is required, > "org.apache.flink.streaming.api.scala.DataStream" was imported. > > This is a

Re: Let BucketingSink roll file on each checkpoint

2018-03-20 Thread Fabian Hueske
Hi, The BucketingSink closes files once they reached a certain size (BatchSize) or have not been written to for a certain amount of time (InactiveBucketThreshold). While being written to, files are in an in-progress state and only moved to to completed once being closed. When that happens, other

Re: Strange behavior on filter, group and reduce DataSets

2018-03-19 Thread Fabian Hueske
is, it is either the Flink runtime code or your functions. Given that one plan produces good results, it might be the Flink runtime code. Coming back to my previous question. Can you provide a minimal program to reproduce the issue? Thanks, Fabian 2018-03-19 15:15 GMT+01:00 Fabian Hueske <f

Re: Queryable State

2018-03-19 Thread Fabian Hueske
nto...@gmail.com>: > Are there plans to address all or few of the above apart from the "JM LB > not possible" which seems understandable ? > > On Mon, Mar 19, 2018 at 9:58 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Queryable state is "just"

Re: Strange behavior on filter, group and reduce DataSets

2018-03-19 Thread Fabian Hueske
Ah, thanks for the update! I'll have a look at that. 2018-03-19 15:13 GMT+01:00 Fabian Hueske <fhue...@gmail.com>: > HI Simone, > > Looking at the plan, I don't see why this should be happening. The pseudo > code looks fine as well. > Any chance that you can create a minimal

Re: Strange behavior on filter, group and reduce DataSets

2018-03-19 Thread Fabian Hueske
> > reuse is not enabled. I attach the plan of the execution. > > Thanks, > Simone > > On 19/03/2018 11:36, Fabian Hueske wrote: > > Hi, > > Union is actually a very simple operator (not even an operator in Flink > terms). It just merges to inputs. There is no additiona

Re: Custom Processing per window

2018-03-19 Thread Fabian Hueske
Hi, Data is partitioned by key across machines and state is kept per key. It is not possible to interact with two keys at the same time. Best, Fabian 2018-03-19 14:47 GMT+01:00 Dhruv Kumar : > In other words, while using the Flink streaming APIs, is it possible to > take

Re: Queryable State

2018-03-19 Thread Fabian Hueske
ere are certain gaps in it being trully > considered a Highly Available Key Value store. > > Regards. > > > > > > On Mon, Mar 19, 2018 at 5:53 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Vishal, >> >> In general, Queryable State should b

Re: How to correct use timeWindow() with DataStream?

2018-03-19 Thread Fabian Hueske
| 2003-12-15 16:31:00.884 | 2003-12-15 16:31:00.884 > | 2003-12-15 16:31:00.884 | 2003-12-15 16:31:00.884 | 2003-12-15 > 16:31:00.884},107151667,12) > 3> (LogLine{time=2003-12-15 16:31:03.784, line=' | 2003-12-15 > 16:31:03.784},1071516670000,1) > 3> (LogLine{time=2003-12-15

Re: Strange behavior on filter, group and reduce DataSets

2018-03-19 Thread Fabian Hueske
Hi, Union is actually a very simple operator (not even an operator in Flink terms). It just merges to inputs. There is no additional logic involved. Therefore, it should also not emit records before either of both ReduceFunctions sorted its data. Once the data has been sorted for the

Re: Partial aggregation result sink

2018-03-19 Thread Fabian Hueske
plan to adding these features to flink SQL ? > > Thanks > LiYue > tig.jd.com > > > > 在 2018年3月14日,上午7:48,Fabian Hueske <fhue...@apache.org> 写道: > > Hi, > > Chesnay is right. > SQL and Table API do not support early window results and no allowed > lateness to upda

Re: Queryable State

2018-03-19 Thread Fabian Hueske
Hi Vishal, In general, Queryable State should be ready to use. There are a few things to consider though: - State queries are not synchronized with the application code, i.e., they can happen at the same time. Therefore, the Flink application should not modify objects that have been put into or

Re: How to correct use timeWindow() with DataStream?

2018-03-19 Thread Fabian Hueske
Hi, The timestamps of the stream records should be increasing (strict monotonicity is not required, a bit out of orderness can be handled due to watermarks). So, the events should also be generated with increasing timestamps. It looks like your generator generates random dates. I'd also generate

Re: Slow flink checkpoint

2018-03-19 Thread Fabian Hueske
gt; >I try modify the flink resource and start a thread to clean the > expired keyed sate, but it doesn't work well because of concurrency issues. > > > > > > > Best, > Deqiang > > 2018-03-16 16:03 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: &

Re: SQL Table API: Naming operations done in query

2018-03-16 Thread Fabian Hueske
Hmmm, that's a strange behavior that is unexpected (to me). Flink optimizes the Table API / SQL queries when a Table is converted into a DataStream (or DataSet) or emitted to a TableSink. So, given that you convert the result tables in addSink() into a DataStream and write them to a sink function,

Re: Deserializing the InputFormat (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat@50bb10fd) failed: unread block data after upgrade to 1.4.2

2018-03-16 Thread Fabian Hueske
One thing that changed in Flink 1.4 with respect to Hadoop is that Hadoop is now an optional dependency. Since Hadoop dependencies are now dynamically loaded, you might use different versions on the client and the cluster? Also the order in which classes are loaded changed. You could try to

Re: [ANNOUNCE] Apache Flink 1.3.3 released

2018-03-16 Thread Fabian Hueske
Thanks for managing this release Gordon! Cheers, Fabian 2018-03-15 21:24 GMT+01:00 Tzu-Li (Gordon) Tai : > The Apache Flink community is very happy to announce the release of Apache > Flink 1.3.3, which is the third bugfix release for the Apache Flink 1.3 > series. > >

Re: Slow flink checkpoint

2018-03-16 Thread Fabian Hueske
Hi, AFAIK, that's not possible. The only "solution" is to reduce the number of timers. Whether that's possible or not, depends on the application. For example, if you use timers to clean up state, you can work with an upper and lower bound and only register one timer for each (upper - lower)

Re: flink sql: "slow" performance on Over Window aggregation with time attribute set to event time

2018-03-15 Thread Fabian Hueske
gt; Yan > ------ > *From:* Fabian Hueske <fhue...@gmail.com> > *Sent:* Thursday, March 15, 2018 11:55:12 AM > > *To:* Yan Zhou [FDS Science] > *Cc:* user@flink.apache.org > *Subject:* Re: flink sql: "slow" performance on Over Window a

Re: flink sql: "slow" performance on Over Window aggregation with time attribute set to event time

2018-03-15 Thread Fabian Hueske
it's not the case. the end-to-end latency of application using > process time is only has 1/3 of the other(50ms vs 150ms). Why was that? > Please help me to understand. > > > Best > > Yan > > -- > *From:* Fabian Hueske <fhue...@gmail.c

Re: Implement a sort inside the WindowFunction

2018-03-15 Thread Fabian Hueske
Hi Felipe, Just like the ReduceFunction, the WindowFunction is applied in the context of a single key. So, it will be called for each key and always just see a single record (the reduced record of the key). You'd have to add a non-keyed window (allWindow) for your sorting WindowFunction. Note

Re: Restart hook and checkpoint

2018-03-15 Thread Fabian Hueske
If I understand fine-grained recovery correctly, one would still need to take checkpoints. Ashish would like to avoid checkpointing and accept to lose the state of the failed task. However, he would like to avoid losing more state than necessary due to restarting of tasks that did not fail.

Re: flink sql: "slow" performance on Over Window aggregation with time attribute set to event time

2018-03-14 Thread Fabian Hueske
fixed amount of lateness > <https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/event_timestamp_extractors.html#assigners-allowing-a-fixed-amount-of-lateness> > is used. So even the slowest source task is not that slow. > > > Best > > Yan > > > > &

Re: flink sql: "slow" performance on Over Window aggregation with time attribute set to event time

2018-03-14 Thread Fabian Hueske
Hi, Flink advances watermarks based on all parallel source tasks. If one of the source tasks lags behind the others, the event time progresses as determined by the "slowest" source task. Hence, records ingested from a faster task might have a higher processing latency. Best, Fabian 2018-03-14

Re: sorting data into sink

2018-03-14 Thread Fabian Hueske
Hi, To be honest, I did not understand your requirements and what you are looking for. stream.keyBy("partition").addSink(...) will partition the output on the "partition" attribute before handing it to the sink. Hence, all records with the same "partition" value will be handled by the same

Re: State serialization problem when we add a new field in the object

2018-03-14 Thread Fabian Hueske
Hi, Flink supports upgrading of serializers [1] [2] since version 1.3. You probably need to upgrade to Flink 1.3 before you can use the feature. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/custom_serialization.html [2]

Re: Extremely large job serialization produced by union operator

2018-03-14 Thread Fabian Hueske
sink) > > I failed to reproduce it without actually used table schemas and SQL > queries in my production. And at last I wrote my own JDBC sink with > connection pooling to migrate this problem. Maybe someone familiar with the > implementation of union operator would figure out what's

Re: Partial aggregation result sink

2018-03-13 Thread Fabian Hueske
Hi, Chesnay is right. SQL and Table API do not support early window results and no allowed lateness to update results with late arriving data. If you need such features, you should use the DataStream API. Best, Fabian 2018-03-13 12:10 GMT+01:00 Chesnay Schepler : > I don't

Re: Extremely large job serialization produced by union operator

2018-03-13 Thread Fabian Hueske
Hi Bill, The size of the program depends on the number and complexity SQL queries that you are submitting. Each query might be translated into a sequence of multiple operators. Each operator has a string with generated code that will be compiled on the worker nodes. The size of the code depends

Re: UUIDs generated by Flink SQL

2018-03-13 Thread Fabian Hueske
Hi Gregory, Your understanding is correct. It is not possible to assign UUID to the operators generated by the SQL/Table planner. To be honest, I am not sure whether the use case that you are describing should be the scope of the "officially" supported use cases of the API. It would require in

Re: Event time join

2018-03-08 Thread Fabian Hueske
; exhausted for all other pipes. I am very interested in how holding back the >> busy source does not create a pathological issue where that source is >> forever held back. Is there a FLIP for it ? >> >> On Thu, Mar 8, 2018 at 11:29 AM, Fabian Hueske <fhue...@gmail.com>

Re: Event time join

2018-03-08 Thread Fabian Hueske
Hi Gytis, Flink does currently not support holding back individual streams, for example it is not possible to align streams on (offset) event-time. However, the Flink community is working on a windowed join for the DataStream API, that only holds the relevant tail of the stream as state. If your

Re: Dynamic CEP https://issues.apache.org/jira/browse/FLINK-7129?subTaskView=all

2018-03-07 Thread Fabian Hueske
We hope to pick up FLIP-20 after Flink 1.5.0 has been released. 2018-03-07 22:05 GMT-08:00 Shailesh Jain : > In addition to making the addition of patterns dynamic, any updates on > FLIP 20 ? > https://cwiki.apache.org/confluence/display/FLINK/FLIP- >

Re: Rocksdb in production

2018-03-06 Thread Fabian Hueske
Jayant Ameta <wittyam...@gmail.com>: > Thanks Fabian. > Will there be any performance issues if I use NFS as the shared filesystem > (as compared to HDFS or S3)? > > Jayant Ameta > > On Mon, Mar 5, 2018 at 10:31 PM, Fabian Hueske <fhue...@gmail.com> wrote: > &g

Re: Rocksdb in production

2018-03-05 Thread Fabian Hueske
filesystem (HDFS, S3, etc). Is my understanding correct? > > > Jayant Ameta > > On Mon, Mar 5, 2018 at 9:51 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi, >> RockDB is an embedded key-value storage that is internally used by Flink. >> There

Re: Rocksdb in production

2018-03-05 Thread Fabian Hueske
Hi, RockDB is an embedded key-value storage that is internally used by Flink. There is no need to setup a RocksDB database or service yourself. All of that is done by Flink. As a Flink user that uses the RockDB state backend, you won't get in touch with RocksDB itself. Besides that, RocksDB is

Re: Timers and state

2018-03-05 Thread Fabian Hueske
Hi Alberto, You can also add another MapState. The key is a timestamps and the value is the key that you want to discard. When onTimer() is called, you look up the key in the MapState and and remove it from the original MapState. Best, Fabian 2018-03-05 0:48 GMT-08:00

Re: Reading csv-files

2018-03-01 Thread Fabian Hueske
That does not matter. 2018-03-01 13:32 GMT+01:00 Esa Heikkinen <esa.heikki...@student.tut.fi>: > Hi > > > > Should the custom source function be written by Java, but no Scala, like > in that RideCleansing exercise ? > > > > Best, Esa > > > >

Re: Slow Flink program

2018-03-01 Thread Fabian Hueske
Are you sure the program is doing anything at all? Do you call execute() on the StreamExecutionEnvironment? 2018-03-01 15:55 GMT+01:00 Supun Kamburugamuve : > Thanks Piotrek, > > I did it this way on purpose to see how Flink performs. With 128000 > messages it takes an

Re: Flink Kafka reads too many bytes .... Very rarely

2018-03-01 Thread Fabian Hueske
t sure how that’ll go. If you’ve got an idea for a > work around, I’d be all ears too. > > > > > > *From: *Philip Doctor <philip.doc...@physiq.com> > *Date: *Tuesday, February 27, 2018 at 10:02 PM > *To: *"Tzu-Li (Gordon) Tai" <tzuli...@apache.org>

<    2   3   4   5   6   7   8   9   10   11   >