Re: Timestamp and key preservation over operators

2019-05-03 Thread Fabian Hueske
Hi Averell, Yes, timestamps and watermarks do not (completely) move together. The watermark should always be lower than the timestamps of the currently processed records. Otherwise, the records might be processed as late records (depending on the logic). The easiest way to check the timestamp of

Re: assignTimestampsAndWatermarks not work after KeyedStream.process

2019-05-03 Thread Fabian Hueske
ortant rule documented anywhere in the official document? > > On 2019/04/30 08:47:29, Fabian Hueske wrote: > > An operator task broadcasts its current watermark to all downstream tasks > > that might receive its records. > > If you have an the following code: > > > &g

Re: Preserve accumulators after failure in DataStream API

2019-05-02 Thread Fabian Hueske
gt; Hi Wouter, > > I've met the same issue and finally managed to use operator states to back > the accumulators, so they can be restored after restarts. > The downside is that we have to update the values in both accumulators and > states to make them consistent. FYI. > >

Re: Preserve accumulators after failure in DataStream API

2019-05-02 Thread Fabian Hueske
e40762e98f8d010/src/main/scala/org/codefeedr/experimental/stats/CommitsStatsProcess.scala#L34 > > > > Op do 2 mei 2019 om 09:36 schreef Fabian Hueske : > >> Hi Wouter, >> >> The DataStream API accumulators of the AggregateFunction [1] are stored >> in state and

Re: Timestamp and key preservation over operators

2019-05-02 Thread Fabian Hueske
Hi Averell, The watermark of a stream is always the low watermark of all its input streams. If one of the input streams does not have watermarks, Flink does not compute a watermark for the merged stream. If you do not need time-based operations on streams 3 and 4, setting the watermark to MAX_WATE

Re: Filter push-down not working for a custom BatchTableSource

2019-05-02 Thread Fabian Hueske
Hi Josh, Does your TableSource also implement ProjectableTableSource? If yes, you need to make sure that the filter information is also forwarded if ProjectableTableSource.projectFields() is called after FilterableTableSource.applyPredicate(). Also make sure to correctly implement FilterableTableS

Re: Flink dashboard+ program arguments

2019-05-02 Thread Fabian Hueske
Hi, The SQL client can be started with > ./bin/sql-client.sh embedded Best, Fabian Am Di., 30. Apr. 2019 um 20:13 Uhr schrieb Rad Rad : > Thanks, Fabian. > > The problem was incorrect java path. Now, everything works fine. > > I would ask about the command for running sql-client.sh > > These

Re: Preserve accumulators after failure in DataStream API

2019-05-02 Thread Fabian Hueske
Hi Wouter, The DataStream API accumulators of the AggregateFunction [1] are stored in state and should be recovered in case of a failure as well. If this does not work, it would be a serious bug. What's the type of your accumulator? Can you maybe share the code? How to you apply the AggregateFunc

Re: Flink dashboard+ program arguments

2019-04-30 Thread Fabian Hueske
Hi, With Flink 1.5.0, we introduced a new distributed architecture (see release announcement [1] and FLIP-6 [2]). >From what you describe, I cannot tell what is going wrong. How do you submit your application? Which action resulted in the error message you shared? Btw. why do you go for Flink 1.

Re: Flink orc file write example

2019-04-30 Thread Fabian Hueske
Hi, I had a look but couldn't find an ORC writer in flink-orc, only an InputFormat and TableSource to read ORC data into DataSet programs or Table / SQL queries. Where did you find the ORC writer? Thanks, Fabian Am Di., 30. Apr. 2019 um 09:09 Uhr schrieb Hai : > Hi, > > > I found flink now supp

Re: Timestamp and key preservation over operators

2019-04-30 Thread Fabian Hueske
Hi, Actually all operators should preserve record timestamps if set the correct TimeCharacteritics to event time. A window operator will set the timestamp of all emitted records to the end-timestamp of the window. Not sure what happens if you use a processing time window in an event time applicati

Re: can we do Flink CEP on event stream or batch or both?

2019-04-30 Thread Fabian Hueske
Hi, Stateful streaming applications are typically designed to run continuously (i.e., until forever or until they are not needed anymore or replaced). May jobs run for weeks or months. IMO, using CEP for "simple" equality matches would add too much complexity for a use case that can be easily sol

Re: How to verify what maxParallelism is set to?

2019-04-30 Thread Fabian Hueske
Hi Sean, I was looking for the max-parallelism value in the UI, but couldn't find it. Also the REST API does not seem to provide it. Would you mind opening a Jira issue for adding it to the REST API and the Web UI? Thank you, Fabian Am Di., 30. Apr. 2019 um 06:36 Uhr schrieb Sean Bollin : > Tha

Re: assignTimestampsAndWatermarks not work after KeyedStream.process

2019-04-30 Thread Fabian Hueske
An operator task broadcasts its current watermark to all downstream tasks that might receive its records. If you have an the following code: DataStream a = ... a.map(A).map(B).keyBy().window(C) and execute this with parallelism 2, your plan looks like this A.1 -- B.1 --\--/-- C.1

Re: Working around lack of SQL triggers

2019-04-30 Thread Fabian Hueske
You could implement aggregation functions that just do AVG, COUNT, etc. and a parameterizable aggregation function that can be configured to call the avg, count, etc. functions. When configuring, you would specify the input and output, for example like this: input: [int, int, double] key: input.1

Re: Data Locality in Flink

2019-04-30 Thread Fabian Hueske
sing whether a Dataset could benefit from a > rebalance or not could be VERY nice (at least for batch) but I fear this > would be very hard to implement..am I wrong? > > On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske wrote: > >> Hi Flavio, >> >> These typos of race c

Re: kafka partitions, data locality

2019-04-30 Thread Fabian Hueske
Hi Sergey, You are right, keys are managed in key groups. Each key belongs to a key group and one or more key groups are assigned to each parallel task of an operator. Key groups are not exposed to users and the assignments of keys -> key-groups and key-groups -> tasks cannot be changed without ch

Re: Data Locality in Flink

2019-04-29 Thread Fabian Hueske
> will be read by a slot (according to the job parallelism) for applying the >>> map logic. >>> >>> In reading from HDFS I read this >>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske >>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to >>> know is that still the Flink strategy fro reading from distributed system >>> file? >>> >>> thanks >>> >> > >

Re: Emitting current state to a sink

2019-04-29 Thread Fabian Hueske
Nice! Thanks for the confirmation :-) Am Mo., 29. Apr. 2019 um 13:21 Uhr schrieb Avi Levi : > Thanks! Works like a charm :) > > On Mon, Apr 29, 2019 at 12:11 PM Fabian Hueske wrote: > >> *This Message originated outside your organization.* >> ---

Re: Data Locality in Flink

2019-04-29 Thread Fabian Hueske
from local file system I guess every line of the file will > be read by a slot (according to the job parallelism) for applying the map > logic. > > In reading from HDFS I read this > <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske > <https://stackover

Re: Apache Flink - Question about dynamically changing window end time at run time

2019-04-29 Thread Fabian Hueske
Hi Mans, I don't know if that would work or not. Would need to dig into the source code for that. TBH, I would recommend to check if you can implement the logic using a (Keyed-)ProcessFunction. IMO, process functions are a lot easier to reason about than Flink's windowing framework. You can manag

Re: FileInputFormat that processes files in chronological order

2019-04-29 Thread Fabian Hueske
Hi Sergei, It depends whether you want to process the file with the DataSet (batch) or DataStream (stream) API. Averell's answer was addressing the DataStream API part. The DataSet API does not have any built-in support to distinguish files (or file splits) by folders and process them in order. F

Re: Emitting current state to a sink

2019-04-29 Thread Fabian Hueske
Hi Avi, I'm not sure if you cannot emit data from the keyed state when you receive a broadcasted message. The Context parameter of the processBroadcastElement() method in the KeyedBroadcastProcessFunction has the applyToKeyedState() method. The method takes a KeyedStateFunction that is applied to

Re: Working around lack of SQL triggers

2019-04-29 Thread Fabian Hueske
Hi, I don't think that (the current state of) Flink SQL is a good fit for your requirements. Each query will be executed as an independent job. So there won't be any sharing of intermediate results. You can do some of this manually if you use the Table API, but even then it won't allow for early r

Re: Exceptions when launching counts on a Flink DataSet concurrently

2019-04-29 Thread Fabian Hueske
Hi Juan, count() and collect() trigger the execution of a job. Since Flink does not cache intermediate results (yet), all operations from the sink (count()/collect()) to the sources are executed. So in a sense a DataSet is immutable (given that the input of the sources do not change) but completel

Re: How to calculate moving average result using flink sql ?

2019-04-16 Thread Fabian Hueske
hi Lifei, This sounds to me like you need an OVER window aggregation. OVER is a standard SQL clause to compute aggregates for each row over a group of surrounding rows (defined by ordering and partitioning). Check out the documentation [1]. The example only shows ROW based windows, but Flink also

Re: Flink JDBC: Disable auto-commit mode

2019-04-15 Thread Fabian Hueske
se/FLINK-12198 > > > > Best, > > Konstantinos > > > > *From:* Papadopoulos, Konstantinos > > *Sent:* Δευτέρα, 15 Απριλίου 2019 12:30 μμ > *To:* Fabian Hueske > *Cc:* Rong Rong ; user > *Subject:* RE: Flink JDBC: Disable auto-commit mode > > > > Hi F

Re: Hbase Connector failed when deployed to yarn

2019-04-15 Thread Fabian Hueske
That's great! Thank you. Let me know if you have any questions. Fabian Am Mo., 15. Apr. 2019 um 11:32 Uhr schrieb Hai : > Hi Fabian: > > > OK ,I am glad to do that. > > > Regards > > Original Message > *Sender:* Fabian Hueske > *Recipient:* hai > *Cc:

Re: Hbase Connector failed when deployed to yarn

2019-04-15 Thread Fabian Hueske
Hi, The Jira issue is still unassigned. Would you be up to work on a fix? Best, Fabian Am Fr., 12. Apr. 2019 um 05:07 Uhr schrieb hai : > Hi, Tang: > > > Thaks for your reply, will this issue fix soon?I don’t think put > flink-hadoop-compatibility > jar under FLINK_HOME/lib is a elegant soluti

Re: Flink JDBC: Disable auto-commit mode

2019-04-15 Thread Fabian Hueske
Hi Konstantinos, This sounds like a useful extension to me. Would you like to create a Jira issue and contribute the improvement? In the meantime, you can just fork the code of JDBCInputFormat and adjust it to your needs. Best, Fabian Am Mo., 15. Apr. 2019 um 08:53 Uhr schrieb Papadopoulos, Kon

Re: Join of DataStream and DataSet

2019-04-15 Thread Fabian Hueske
Hi Reminia, What Hequn said is correct. However, I would *not* use a regular but model the problem as a time-versioned table join. A regular join will materialize both inputs which is probably not want you want to do for a stream. For a time-versioned table join, only the time-versioned table wou

Re: Apache Flink - Question about broadcast state pattern usage

2019-04-15 Thread Fabian Hueske
s-release-1.8/dev/stream/state/broadcast_state.html) > and still not sure how that will help ? > > Thanks for your help. > > Mans > > > > > > On Thursday, April 11, 2019, 3:53:59 AM EDT, Fabian Hueske < > fhue...@gmail.com> wrote: > > > Hi, > > you would s

Re: Apache Flink - How to destroy global window and release it's resources

2019-04-15 Thread Fabian Hueske
is Long.MAX_VALUE, and that is my concern. So, is > there any other way of clean up the now purged global windows ? > > Thanks again. > > > > On Thursday, April 11, 2019, 4:16:24 AM EDT, Fabian Hueske < > fhue...@gmail.com> wrote: > > > Hi, > > As far as I

Re: Does HadoopOutputFormat create MapReduce job internally?

2019-04-15 Thread Fabian Hueske
Hi Morven, You posted the same question a few days ago and it was also answered correctly. Please do not repost the same question again. You can reply to the earlier thread if you have a follow up question. To answer your question briefly: No, Flink does not trigger a MapReduce job. The whole job

Flink Forward Europe 2019 - Call for Presentations open until 17th May

2019-04-11 Thread Fabian Hueske
Hi all, Flink Forward Europe returns to Berlin on October 7-9th, 2019. We are happy to announce that the Call for Presentations is open! Please submit a proposal if you'd like to present your Apache Flink experience, best practices, new features, or use cases in front of an international audience

Re: FlinkCEP and SQL?

2019-04-11 Thread Fabian Hueske
Hi Esa, Flink's implementation of SQL MATCH_RECOGNIZE is based on it's CEP library, i.e., they share the same implementation. Best, Fabian Am Do., 11. Apr. 2019 um 10:29 Uhr schrieb Esa Heikkinen (TAU) < esa.heikki...@tuni.fi>: > Hi > > > > Is SQL CEP based (old) FlinkCEP at all and are SQL CEP

Re: Apache Flink - How to destroy global window and release it's resources

2019-04-11 Thread Fabian Hueske
Hi, As far as I know, a window is only completely removed when time (event or processing time, depending on the window type) passes the window's end timestamp. Since, GlobalWindow's end timestamp is Long.MAX_VALUE, it is never completely removed. I'm not 100% sure what state is kept around. It mig

Re: Apache Flink - Question about broadcast state pattern usage

2019-04-11 Thread Fabian Hueske
Hi, you would simply pass multiple MapStateDescriptors to the broadcast method: MapStateDescriptor bcState1 = ... MapStateDescriptor bcState2 = ... DataStream stream = ... BroadcastStream bcStream = stream.broadcast(bcState1, bcState2); Best, Fabian Am Mi., 10. Apr. 2019 um 19:44 Uhr schrieb

Re: Default Kafka producers pool size for FlinkKafkaProducer.Semantic.EXACTLY_ONCE

2019-04-11 Thread Fabian Hueske
Hi Min, I think the pool size is per parallel sink task, i.e., it should be independent of the parallelism of the sink operator. >From my understanding a pool size of 5 should be fine if the maximum number of concurrent checkpoints is 1. Running out of connections would mean that there are 5 in-fl

Re: What is the best way to handle data skew processing in Data Stream applications?

2019-04-11 Thread Fabian Hueske
Hi Felipe, three comments: 1) applying rebalance(), shuffle(), or rescale() before a keyBy() has no effect: keyBy() introduces a hash partitioning such that any data partitioning that you do immediately before keyBy() is destroyed. You only change the distribution for the call of the key extracto

Re: Does HadoopOutputFormat create MapReduce job internally?

2019-04-10 Thread Fabian Hueske
Hi, Flink's Hadoop compatibility functions just wrap functions that were implemented against Hadoop's interfaces in wrapper functions that are implemented against Flink's interfaces. There is no Hadoop cluster started or MapReduce job being executed. Job is just a class of the Hadoop API. It does

Re: Is copying flink-hadoop-compatibility jar to FLINK_HOME/lib the only way to make it work?

2019-04-10 Thread Fabian Hueske
Hi, Packaging the flink-hadoop-compatibility dependency with your code into a "fat" job jar should work as well. Best, Fabian Am Mi., 10. Apr. 2019 um 15:08 Uhr schrieb Morven Huang < morven.hu...@gmail.com>: > Hi, > > > > I’m using Flink 1.5.6 and Hadoop 2.7.1. > > > > *My requirement is to re

Re: [ANNOUNCE] Apache Flink 1.8.0 released

2019-04-10 Thread Fabian Hueske
Congrats to everyone! Thanks Aljoscha and all contributors. Cheers, Fabian Am Mi., 10. Apr. 2019 um 11:54 Uhr schrieb Congxian Qiu < qcx978132...@gmail.com>: > Cool! > > Thanks Aljoscha a lot for being our release manager, and all the others > who make this release possible. > > Best, Congxian

Re: Schema Evolution on Dynamic Schema

2019-04-09 Thread Fabian Hueske
s again, and congrats on an awesome conference, I had learned a lot > Shahar > > From: Fabian Hueske > Sent: Monday, April 8, 02:54 > Subject: Re: Schema Evolution on Dynamic Schema > To: Shahar Cizer Kobrinsky > Cc: Rong Rong, user > > > Hi Shahar, > > Sorry for

Re: HA and zookeeper

2019-04-08 Thread Fabian Hueske
Hi Boris, ZooKeeper is also used by the JobManager to store metadata about the running job. The JM writes information like the JobGraph, JAR file, checkpoint metadata to a persistent storage (like HDFS, S3, ...) and a pointer to this information to ZooKeeper. In case of a recovery, the new JM look

Re: Flink SQL hangs in StreamTableEnvironment.sqlUpdate, keeps executing and seems never stop, finally lead to java.lang.OutOfMemoryError: GC overhead limit exceeded

2019-04-08 Thread Fabian Hueske
Hi Henry, It seem that the optimizer is not handling this case well. The search space might be too large (or rather the optimizer explores too much of the search space). Can you share the query? Did you add any optimization rules? Best, Fabian Am Mi., 3. Apr. 2019 um 12:36 Uhr schrieb 徐涛 : > Hi

Re: End-to-end exactly-once semantics in simple streaming app

2019-04-08 Thread Fabian Hueske
Hi Patrick, In general, you could also implement the 2PC logic in a regular operator. It does not have to be a sink. You would need to add the logic of TwoPhaseCommitSinkFunction to your operator. However, the TwoPhaseCommitSinkFunction does not work well with JDBC. The problem is that you would n

Re: Schema Evolution on Dynamic Schema

2019-04-08 Thread Fabian Hueske
the map elements > when doing *SELECT map['a', sum(a), 'b', sum(b).. ] FROM.. group by .. *? > > On Wed, Mar 20, 2019 at 2:00 AM Fabian Hueske wrote: > >> Hi, >> >> I think this would work. >> However, you should be aware that all keys

Re: InvalidProgramException when trying to sort a group within a dataset

2019-04-08 Thread Fabian Hueske
o success. > > I got the same NotSerializableException. > > > > Best, > > Konstantinos > > > > *From:* Fabian Hueske > *Sent:* Σάββατο, 6 Απριλίου 2019 2:26 πμ > *To:* Papadopoulos, Konstantinos > > *Cc:* Chesnay Schepler ; user > *Subject:* Re: InvalidProgramExcep

Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

2019-04-08 Thread Fabian Hueske
Hi Min, Guowei is right, the comment in the documentation about exactly-once in embarrassingly parallel data flows refers to exactly-once *state consistency*, not *end-to-end* exactly-once. However, in strictly forwarding pipelines, enabling exactly-once checkpoints should not have drawbacks compa

Re: Partitioning key range

2019-04-08 Thread Fabian Hueske
Hi Davood, Flink uses hash partitioning to assign keys to key groups. Each key group is then assigned to a task for processing (a task might process multiple key groups). There is no way to directly assign a key to a particular key group or task. All you can do is to experiment with different cust

Re: InvalidProgramException when trying to sort a group within a dataset

2019-04-05 Thread Fabian Hueske
Hi, You POJO should implement the Serializable interface. Otherwise it's not considered to be serializable. Best, Fabian Papadopoulos, Konstantinos schrieb am Mi., 3. Apr. 2019, 07:22: > Hi Chesnay, > > > > Thanks for your support. ThresholdAcvFact class is a simple POJO with the > following d

Re: Source reinterpretAsKeyedStream

2019-04-05 Thread Fabian Hueske
Hi, Konstantin is right. reinterpreteAsKeyedStream only works if you call it on a DataStream that was keyBy'ed before (with the same parallelism). Flink cannot reuse the partioning of another system like Kafka. Best, Fabian Adrienne Kole schrieb am Do., 4. Apr. 2019, 14:33: > Thanks a lot for

Re: Support for custom triggers in Table / SQL

2019-03-29 Thread Fabian Hueske
Hi Piyush, Custom triggers (or early firing) is currently not supported by SQL or the Table API. It is also not on the roadmap [1]. Currently, most efforts on the relational API are focused on restructuring the code and working towards the integration of the Blink contribution [2]. AFAIK, there a

[REMINDER] Flink Forward San Francisco in a few days

2019-03-20 Thread Fabian Hueske
Hi everyone, *Flink Forward San Francisco 2019 will take place in a few days on April 1st and 2nd.* If you haven't done so already and are planning to attend, you should register soon at: -> https://sf-2019.flink-forward.org/register Don't forget to use the 25% discount code *MailingList* for ma

Re: How to split tuple2 returned by UDAF into two columns in a result table

2019-03-20 Thread Fabian Hueske
Hi Dongwon, Couldn't you just return a tuple from the aggregation function and extract the fields from the nested tuple using a value access function [1]? table table2 = table1 .window(Slide.over("3.rows").every("1.rows").on("time").as("w")) .groupBy("w, name") .select("name, my

Re: Schema Evolution on Dynamic Schema

2019-03-20 Thread Fabian Hueske
p doesnt work for me. >> Trying something like >> Select a, map(sum(b) as metric_sum_c ,sum(c) as metric_sum_c) as >> metric_map >> group by a >> >> results with "Non-query expression encountered in illegal context" >> is my train of thought the rig

Re: Set partition number of Flink DataSet

2019-03-20 Thread Fabian Hueske
mber in > Flink may be compelling in batch processing. > > Could you help explain a bit more on which works are needed to be done, so > Flink can support custom partition numbers numbers? We would be willing to > help improve this area. > > Thanks, > Qi > > On Mar 15, 2019,

Re: Schema Evolution on Dynamic Schema

2019-03-19 Thread Fabian Hueske
Hi, Restarting a changed query from a savepoint is currently not supported. In general this is a very difficult problem as new queries might result in completely different execution plans. The special case of adding and removing aggregates is easier to solve, but the schema of the stored state cha

Re: Set partition number of Flink DataSet

2019-03-15 Thread Fabian Hueske
Hi, Flink works a bit differently than Spark. By default, Flink uses pipelined shuffles which push results of the sender immediately to the receivers (btw. this is one of the building blocks for stream processing). However, pipelined shuffles require that all receivers are online. Hence, there num

Re: Joining two streams of different priorities

2019-03-11 Thread Fabian Hueske
Hi, This is not possible with Flink. Events in transport channels cannot be reordered and function cannot pick which input to read from. There are some upcoming changes for the unified batch-stream integration that enable to chose which input to read from, but this is not there yet, AFAIK. Best,

Re: EventCountJob for Flink 1.7.2

2019-03-05 Thread Fabian Hueske
Thanks Flavio! Am Di., 5. März 2019 um 11:23 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > I discovered that now (in Flink 1.7.2( queryable state server is enabed if > queryable state client is found on the classpath, i.e.: > > > org.apache.flink > flink-queryable-state-client-java_$

Re: Using Flink in an university course

2019-03-04 Thread Fabian Hueske
Hi Wouter, We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups for our trainings and it is working very well. We have an additional container that feeds a Kafka topic via the commandline producer to simulate a somewhat realistic behavior. Of course, you can do it without Kafk

Re: KeyBy distribution across taskslots

2019-02-28 Thread Fabian Hueske
Hi, The answer is in fact no. Flink hash-partitions keys into Key Groups [1] which are uniformly assigned to tasks, i.e., a task can process more than one key group. AFAIK, there are no plans to change this behavior. Stefan (in CC) might be able to give more details on this. Something that might

Re: Assigning timestamps and watermarks several times, several datastreams?

2019-02-20 Thread Fabian Hueske
Hi, Watermarks of streams are independent as long as the streams are not connected with each other. When you union, join, or connect two streams in any other way, their watermarks are fused, which means that they are synced to the "slower" stream, i.e., the stream with the earlier watermarks. Bes

Re: subscribe

2019-02-18 Thread Fabian Hueske
Hi Artur, In order to subscribe to Flink's user mailing list you need to send a mail to user-subscr...@flink.apache.org Best, Fabian Am Mo., 18. Feb. 2019 um 20:34 Uhr schrieb Artur Mrozowski : > art...@gmail.com >

Re: Dataset sampling

2019-02-18 Thread Fabian Hueske
Hi Flavio, I'm not aware of any particular plan to add sampling operators to the Table API or SQL. However, I agree. It would be a good feature. Best, Fabian Am Mo., 18. Feb. 2019 um 15:44 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > Hi to all, > is there any plan to support differ

Re: Test FileUtilsTest.testDeleteDirectory failed when building Flink

2019-02-18 Thread Fabian Hueske
Hi Paul, Which components (Flink, JDK, Docker base image, ...) are you upgrading and which versions do you come from? I think it would be good to check how (and with which options) the JVM in the container is started. Best, Fabian Am Fr., 15. Feb. 2019 um 09:50 Uhr schrieb Paul Lam : > Hi all,

Re: KafkaTopicPartition internal class treated as generic type serialization

2019-02-18 Thread Fabian Hueske
Hi Eric, I did a quick search in our Jira to check if this is a known issue but didn't find anything. Maybe Gordon (in CC) knows a bit more about this problem. Best, Fabian Am Fr., 15. Feb. 2019 um 11:08 Uhr schrieb Eric Troies : > Hi, I'm having the exact same issue with flink 1.4.0 using scal

Re: Limit in batch flink sql job

2019-02-18 Thread Fabian Hueske
Thanks for pointing this out! This is indeed a bug in the documentation. I'll fix that. Thank you, Fabian Am Mi., 13. Feb. 2019 um 02:04 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > OK, thanks. > It might be better to update the document which has the following example > that confused m

Re: Reduce one event under multiple keys

2019-02-18 Thread Fabian Hueske
Am Mo., 11. Feb. 2019 um 11:29 Uhr schrieb Stephen Connolly < stephen.alan.conno...@gmail.com>: > > > On Mon, 11 Feb 2019 at 09:42, Fabian Hueske wrote: > >> Hi Stephen, >> >> A window is created with the first record that is assigned to it. >> If the wind

Re: [Table] Types of query result and tablesink do not match error

2019-02-18 Thread Fabian Hueske
Hi François, I had a look at the code and the GenericTypeInfo checks equality by comparing the classes the represent (Class == Class). Class does not override the default implementation of equals, so this is an instance equality check. The check can evaluate to false, if Map was loaded by two diff

Re: [Meetup] Apache Flink+Beam+others in Seattle. Feb 21.

2019-02-18 Thread Fabian Hueske
Thank you Pablo! Am Fr., 15. Feb. 2019 um 20:42 Uhr schrieb Pablo Estrada : > Hello everyone, > There is an upcoming meetup happening in the Google Seattle office, on > February 21st, starting at 5:30pm: > https://www.meetup.com/seattle-apache-flink/events/258723322/ > > People will be chatting a

Re: How to load multiple same-format files with single batch job?

2019-02-15 Thread Fabian Hueske
roduce a DataSet > from a single geojson file. > This doesn't sound compatible with a custom InputFormat, don't you? > > Thanks in advance for any addition hint, all the best > > François > > Le lun. 4 févr. 2019 à 12:10, Fabian Hueske a écrit : > >> Hi

Re: Window elements for certain period for delayed processing

2019-02-14 Thread Fabian Hueske
Hi, I would not use a window for that. Implementing the logic with a ProcessFunction seems more straight-forward. The function simply collects all events between 00:00 and 01:00 in a ListState and emits them when the time passes 01:00. All other records are simply forwarded. Best, Fabian Am Fr.,

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-14 Thread Fabian Hueske
Hi, I like the idea of putting the roadmap on the website because it is much more visible (and IMO more credible, obligatory) there. However, I share the concerns about frequent updates. It think it would be great to update the "official" roadmap on the website once per release (-bugfix releases)

Re: Program of Flink Forward San Francisco 2019 announced

2019-02-12 Thread Fabian Hueske
Am Di., 12. Feb. 2019 um 16:26 Uhr schrieb Fabian Hueske : > Hi everyone, > > We announced the program of Flink Forward San Francisco 2019. > The conference takes place at the Hotel Nikko in San Francisco on April > 1st and 2nd. > > On the first day we offer three

Program of Flink Forward San Francisco 2019 announced

2019-02-12 Thread Fabian Hueske
Hi everyone, We announced the program of Flink Forward San Francisco 2019. The conference takes place at the Hotel Nikko in San Francisco on April 1st and 2nd. On the first day we offer three training sessions [1]: * Introduction to Streaming with Apache Flink * Analyzing Streaming Data with Flin

[ANNOUNCE] New Flink PMC member Thomas Weise

2019-02-12 Thread Fabian Hueske
Hi everyone, On behalf of the Flink PMC I am happy to announce Thomas Weise as a new member of the Apache Flink PMC. Thomas is a long time contributor and member of our community. He is starting and participating in lots of discussions on our mailing lists, working on topics that are of joint int

Re: Limit in batch flink sql job

2019-02-12 Thread Fabian Hueske
Hi, It's as the error message says. LIMIT 10 without ORDER BY would pick 10 random rows and hence lead to non-deterministic results. That's why it is not supported yet. Best, Fabian Am Di., 12. Feb. 2019 um 07:02 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > Why flink said "Limiting the

Re: fllink 1.7.1 and RollingFileSink

2019-02-11 Thread Fabian Hueske
Hi Vishal, Kostas (in CC) should be able to help here. Best, Fabian Am Mo., 11. Feb. 2019 um 00:05 Uhr schrieb Vishal Santoshi < vishal.santo...@gmail.com>: > Any one ? > > On Sun, Feb 10, 2019 at 2:07 PM Vishal Santoshi > wrote: > >> You don't have to. Thank you for the input. >> >> On Sun, F

Re: Is there a windowing strategy that allows a different offset per key?

2019-02-11 Thread Fabian Hueske
Hi Stephen, First of all, yes, windows computing and emitting at the same time can cause pressure on the downstream system. There are a few ways how you can achieve this: * use a custom window assigner. A window assigner decides into which window a record is assigned. This is the approach you sug

Re: Reduce one event under multiple keys

2019-02-11 Thread Fabian Hueske
Hi Stephen, A window is created with the first record that is assigned to it. If the windows are based on time and a key, than no window will be created (and not space be occupied) if there is not a first record for a key and time interval. Anyway, if tracking the number of open files & average o

Re: How to add caching to async function?

2019-02-04 Thread Fabian Hueske
Hi William, Does the cache need to be fault tolerant? If not you could use a regular in-memory map as cache (+some LRU cleaning). Or do you expect the cache to group too large for the memory? Best, Fabian Am Mo., 4. Feb. 2019 um 18:00 Uhr schrieb William Saar : > Hi, > I am trying to implement

Re: Is the order guaranteed with Windowall

2019-02-04 Thread Fabian Hueske
rtStreamKeyed.window(TumblingProcessingTimeWindowsetParallelism(1).name("Aggregate > events"); > > Thanks > David > > On 2019/02/04 13:54:14, Fabian Hueske wrote: > > Hi, > > > > A WindowAll is executed in a single task. If you sort the data bef

Re: Reverse of KeyBy

2019-02-04 Thread Fabian Hueske
Key)? My understanding is that this will further partition the > already partitioned input stream (from 1 above) and will not help me, as I > need to process all LargeMessages for a given MyKey in order. > > > > Is there an implicit assumption here that the flatMap operation (2) a

Re: Add header to a file produced using the writeAsFormattedText method

2019-02-04 Thread Fabian Hueske
nos.papadopou...@iriworldwide.com>: > Hi Fabian, > > > > Do you know if there is any plan Flink core framework to support such > functionality? > > > > Best, > > Konstantinos > > > > *From:* Fabian Hueske > *Sent:* Δευτέρα, 4 Φεβρουαρίου 2019 3:49 μμ > *To:*

Re: Reverse of KeyBy

2019-02-04 Thread Fabian Hueske
Hi, Calling keyBy twice will not work, because the second call overrides the first. You can keyBy on a composite key (MyKey, LargeMessageId). You can do the following InputStream .keyBy ( (MyKey, LargeMessageId) ) \\ use composite key .flatMap(new MyReassemblyFunction()) .keyBy(MyKey) .?

Re: Test harness for CoProcessFunction outputting Protobuf messages

2019-02-04 Thread Fabian Hueske
Hi Alexey, I think you are right. It does not seem to be possible to provide a TypeInformation for side outputs to a TestHarness. This sounds like a useful addition. Would you mind creating a Jira issue for that? Thank you, Fabian Am So., 3. Feb. 2019 um 19:13 Uhr schrieb Alexey Trenikhun : >

Re: Is the order guaranteed with Windowall

2019-02-04 Thread Fabian Hueske
Hi, A WindowAll is executed in a single task. If you sort the data before the window, the sorting must also happen in a single task, i.e., with parallelism 1. The reasons is that an operator somewhat randomly merges multiple input partitions. So even if each input partition is sorted, the merging

Re: Add header to a file produced using the writeAsFormattedText method

2019-02-04 Thread Fabian Hueske
Hi Konstantinos, Writing headers to files is currently not supported by the underlying TextOutputFormat. You can implement a custom OutputFormat by extending TextOutputFormat to add this functionality. Best, Fabian Am Fr., 1. Feb. 2019 um 16:04 Uhr schrieb Papadopoulos, Konstantinos < konstantin

Re: How to load multiple same-format files with single batch job?

2019-02-04 Thread Fabian Hueske
> before to be processed or will all be streamed? > > All the best > François > > Le mar. 29 janv. 2019 à 22:20, Fabian Hueske a écrit : > >> Hi, >> >> You can point a file-based input format to a directory and the input >> format should read all fil

Re: How to load multiple same-format files with single batch job?

2019-01-29 Thread Fabian Hueske
Hi, You can point a file-based input format to a directory and the input format should read all files in that directory. That works as well for TableSources that are internally use file-based input formats. Is that what you are looking for? Best, Fabian Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb

Re: Select feilds in Table API

2019-01-29 Thread Fabian Hueske
The problem is that the table "lineitem" does not have a field "l_returnflag". The field in "lineitem" are named [TMP_2, TMP_5, TMP_1, TMP_0, TMP_4, TMP_6, TMP_3]. I guess it depends on how you obtained lineitem. Best, Fabian Am Di., 29. Jan. 2019 um 16:38 Uhr schrieb Soheil Pourbafrani < soheil

Re: Flink Jdbc streaming source support in 1.7.1 or in future?

2019-01-23 Thread Fabian Hueske
I think this is very hard to build in a generic way. The common approach here would be to get access to the changelog stream of the table, writing it to a message queue / event log (like Kafka, Pulsar, Kinesis, ...) and ingesting the changes from the event log into a Flink application. You can of

Re: Kafka stream fed in batches throughout the day

2019-01-22 Thread Fabian Hueske
Hi Jonny, I think this is good use case for event time stream processing. The idea of taking a savepoint, stopping and later resuming the job is good as it frees the resources that would otherwise be occupied by the idle job. In that sense it would behave like a batch job. However, in contrast to

Re: Temporal tables not behaving as expected

2019-01-22 Thread Fabian Hueske
Hi, The problem is that you are using processing time which is non-deterministic. Both inputs are consumed at the same time and joined based on which record arrived first. The result depends on a race condition. If you change the input table to have event time attributes and use these to register

Re: [DISCUSS] Towards a leaner flink-dist

2019-01-18 Thread Fabian Hueske
Hi Chesnay, Thank you for the proposal. I think this is a good idea. We follow a similar approach already for Hadoop dependencies and connectors (although in application space). +1 Fabian Am Fr., 18. Jan. 2019 um 10:59 Uhr schrieb Chesnay Schepler < ches...@apache.org>: > Hello, > > the binary

Re: Should the entire cluster be restarted if a single Task Manager crashes?

2019-01-18 Thread Fabian Hueske
Hi Harshith, No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job. If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are

Re: delete all available flink timers on app start

2019-01-17 Thread Fabian Hueske
Hi Vipul, I'm not aware of a way to do this. You could have a list of all registered timers per key as state to be able to delete them. However, the problem is to identify in user code when an application was restarted, i.e., to know when to delete timers. Also, timer deletion would need to be don

<    1   2   3   4   5   6   7   8   9   10   >