Re: CSV writer/parser inconsistency when using the Table API?

2017-12-22 Thread Fabian Hueske
Hi Cliff, you are right. The CsvTableSink and the CsvInputFormat are not in sync. It would be great if you could open a JIRA ticket for this issue. As a workaround, you could implement your own CsvTableSink to add a delimiter after the last field. The code is straightforward, less than 150 lines

Re: Flink upgrade compatibility table

2017-12-21 Thread Fabian Hueske
Hi Colin, thanks for pointing out this gap in the docs! I created FLINK-8303 [1] to extend the table and updated the release process documentation [2] to update the page for new releases. Thank you, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8303 [2]

Re: A question about Triggers

2017-12-21 Thread Fabian Hueske
ove from a PriorityQueue is expensive >> ? Trigger Context does expose another version that has removal abilities >> so was wondering why this dissonance... >> >> On Tue, Dec 19, 2017 at 4:53 AM, Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Hi Visha

Re: A question about Triggers

2017-12-19 Thread Fabian Hueske
rather than off an operator >>>> the precedes the Window ? This is doable using ProcessWindowFunction using >>>> state but only in the case of non mergeable windows. >>>> >>>>The best API option I think is a TimeBaseTrigger that fires every >>

Re: how flink extracts timestamp from transformed elements?

2017-12-18 Thread Fabian Hueske
esults belonging to the > same window? > > 2017-12-18 18:51 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > If you define a keyed window (use keyBy()), the results are not merged. > > For each key, the window is individually evaluated and all results of > > windows for

Re: how flink extracts timestamp from transformed elements?

2017-12-18 Thread Fabian Hueske
> > 2017-12-18 18:02 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > Hi, > > > > timestamps are handled as meta-data in Flink's DataStream API. > > This means that Flink automatically maintains the timestamps and ensures > > that all records which w

Re: A question about Triggers

2017-12-18 Thread Fabian Hueske
Hi Vishal, the Trigger is not designed to augment records but just to control when a window is evaluated. I would recommend to use a ProcessFunction to enrich records with the current watermark before passing them into the window operator. The context object of the processElement() method gives

Re: context in OnElement of Trigger

2017-12-18 Thread Fabian Hueske
Hi, TriggerContext.getWaterMark() returns the current watermark (i.e., event-time) of the window operator. An operator tracks for each of its inputs the maximum received watermark and computes its own watermark as the minimum of all these maximums. Until an operator has not received watermarks

Re: how flink extracts timestamp from transformed elements?

2017-12-18 Thread Fabian Hueske
Hi, timestamps are handled as meta-data in Flink's DataStream API. This means that Flink automatically maintains the timestamps and ensures that all records which were aligned with the watermarks (i.e., not late) are still aligned. If records are aggregated in a time window, the aggregation

Re: Job-level close()?

2017-12-18 Thread Fabian Hueske
Hi Andrew, I'm not aware of such a plan. Another way to address such issues is to run multiple TaskManagers with a single slot. In that case, only one subtask is executed per TM processes. Best, Fabian 2017-12-15 22:23 GMT+01:00 Andrew Roberts : > Hello, > > I’m writing a

Re: [Docs] Can't add metrics to RichFilterFunction

2017-12-18 Thread Fabian Hueske
Thanks for reporting the issue. I've filed FLINK-8278 [1] to fix the issue. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8278 2017-12-14 14:04 GMT+01:00 Kien Truong : > That syntax is incorrect, should be. > > @transient private var counter:Counter = _

Re: Flink vs Spark streaming benchmark

2017-12-16 Thread Fabian Hueske
Hi, In case you haven't seen it yet. Here's an analysis and response to Databricks' benchmark [1]. Best, Fabian [1] https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime 2017-11-13 11:44 GMT+01:00 G.S.Vijay Raajaa :

Re: Consecutive windowed operations

2017-12-15 Thread Fabian Hueske
Hi Ron, chaining windows as shown in the example was also possible before 1.4.0. So you can keep using Flink 1.3.2 if this would be the only reason to update to 1.4.0. Best, Fabian 2017-12-15 1:14 GMT+01:00 Ron Crocker : > In the 1.4 docs I stumbled on this section:

Re: Flink 1.4.0 can not override JAVA_HOME for single-job deployment on YARN

2017-12-15 Thread Fabian Hueske
Thanks for reporting back! 2017-12-15 4:52 GMT+01:00 杨光 : > Yes , i'm using Java8 , and i found the 1.4 version provided new > parameters : "containerized.master.env.ENV_VAR1" and > "containerized.taskmanager.env". > I change my start command from "-yD

Re: streamin Table API - strange behavior

2017-12-14 Thread Fabian Hueske
t trigger to fire in regular intervals > (e.g. every 5 seconds) using table API? > > > On 14.12.2017 17:57, Fabian Hueske wrote: > > Hi, > > you are using a BoundedOutOfOrdernessTimestampExtractor to generate > watermarks. > The BoundedOutOfOrdernessTimestampExtractor is a

Re: streamin Table API - strange behavior

2017-12-14 Thread Fabian Hueske
Hi, you are using a BoundedOutOfOrdernessTimestampExtractor to generate watermarks. The BoundedOutOfOrdernessTimestampExtractor is a periodic watermark assigner and only generates watermarks if a watermark interval is configured. Without watermarks, the query cannot "make progress" and only

Re: FlinkKafkaProducer011 and Flink 1.4.0 Kafka docs

2017-12-14 Thread Fabian Hueske
Hi Elias, thanks for reporting this issue. I created FLINK-8260 [1] to extend the documentation. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8260 2017-12-14 1:07 GMT+01:00 Elias Levy : > Looks like the Flink Kafka connector page, in the Producer

Re: Watermark in broadcast

2017-12-14 Thread Fabian Hueske
Hi Seth, that's not possible with the current interface. There have been some discussions about how to address issues of idle sources (or partitions). Aljoscha (in CC) should know more about that. Best, Fabian 2017-12-13 18:13 GMT+01:00 Seth Wiesman : > Quick follow up

Re: Could not flush and close the file system output stream to s3a, is this fixed?

2017-12-14 Thread Fabian Hueske
Bowen Li (in CC) closed the issue but there is no fix (or at least it is not linked in the JIRA). Maybe it was resolved in another issue or can be differently resolved. @Bowen, can you comment on how to fix this problem? Will it work in Flink 1.4.0? Thank you, Fabian 2017-12-13 5:28 GMT+01:00

Re: Calling an operator serially

2017-12-12 Thread Fabian Hueske
Hi, you are right. The purpose of a KeyedStream is to process all events/records with the same key by the same operator task (which runs in a single thread). The operator itself can have a greater parallelism, such that different keys are processed by different tasks. Best, Fabian 2017-12-13

Re: what's the best practice to determine event-time watermark for unordered stream?

2017-12-12 Thread Fabian Hueske
by the > source? > > > 2017-12-12 19:25 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > Early events are usually not an issue because the can be kept in state > until > > they are ready to be processed. > > Also, depending on the watermark assigner often pus

Re: [ANNOUNCE] Apache Flink 1.4.0 released

2017-12-12 Thread Fabian Hueske
Thank you Aljoscha for managing the release! 2017-12-12 12:46 GMT+01:00 Aljoscha Krettek : > The Apache Flink community is very happy to announce the release of Apache > Flink 1.4.0. > > Apache Flink® is an open-source stream processing framework for > distributed,

Re: what's the best practice to determine event-time watermark for unordered stream?

2017-12-12 Thread Fabian Hueske
early event? > > > 2017-12-12 18:24 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > Hi, > > > > this depends on how you generate watermarks [1]. > > You could generate watermarks with a four hour delay and be fine (at the > > cost of a four hour latenc

Re: what's the best practice to determine event-time watermark for unordered stream?

2017-12-12 Thread Fabian Hueske
Hi, this depends on how you generate watermarks [1]. You could generate watermarks with a four hour delay and be fine (at the cost of a four hour latency) or have some checks that you don't increment a watermark by more than x minutes at a time. These considerations are quite use case specific,

Re: Testing CoFlatMap correctness

2017-12-12 Thread Fabian Hueske
Hi Tovi, testing the behavior of a data flow with respect to the order of records from different sources is tricky. Source functions are working independently of each other and it is not easily possible to control the order in which records is shipped (and received) across source functions. You

Re: when does the timed window ends?

2017-12-12 Thread Fabian Hueske
...@gmail.com>: > OK, I see. > > But what if a window contains no elements? Is it still get fired and > invoke the window function? > > 2017-12-12 15:42 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > Hi, > > > > this depends on the window type.

Re: could I chain two timed window?

2017-12-11 Thread Fabian Hueske
Hi, sliding windows replicate their records for each window. If you have use an incrementally aggregating function (ReduceFunction, AggregateFunction) with a sliding, the space requirement should not be an issue because each window stores a single value. However, this also means that each window

Re: when does the timed window ends?

2017-12-11 Thread Fabian Hueske
Hi, this depends on the window type. Tumbling and Sliding Windows are (by default) aligned with the epoch time (1970-01-01 00:00:00). For example a tumbling window of 2 hour starts and ends every two hours, i.e., from 12:00:00 to 13:59:59.999, from 14:00:00 to 15:59:59.999, etc. The

Re: The timing operation is similar to storm’s tick

2017-12-11 Thread Fabian Hueske
Hi, I think you are looking for a ProcessFunction with timers [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/process_function.html 2017-12-11 9:03 GMT+01:00 Marvin777 : > hi, > > I'm new to apache Flink. I want to

Re: aggregate does not allow RichAggregateFunction ?

2017-12-11 Thread Fabian Hueske
gt; >> >> >> >> >> >> >> >> >> >> On Fri, Dec 8, 2017 at 9:11 AM, Aljoscha Krettek <aljos...@apache.org> >> wrote: >> >>> Hi, >>> >>> If you use an AggregatingFunction in this way (i.e. for a

Re: deserilize nested json

2017-12-08 Thread Fabian Hueske
Hi Sendoh, it certainly possible to deserialize nested JSON. However, the JsonRowDeserializationSchema doesn't support it yet. You would either have to extend the class or implement a new one. Best, Fabian 2017-12-08 12:33 GMT+01:00 Sendoh : > Hi Flink users, > >

Re: Data loss in Flink Kafka Pipeline

2017-12-08 Thread Fabian Hueske
Hmm, I see... I'm running out of ideas. You might be right with your assumption about a bug in the Beam Flink runner. In this case, this would be an issue for the Beam project which hosts the Flink runner. But it might also be an issue on the Flink side. Maybe Aljoscha (in CC), one of the

Re: ClassNotFoundException in custom SourceFunction

2017-12-08 Thread Fabian Hueske
Hi, thanks a lot for investigating this problems and the results you shared. This looks like a bug to me. I'm CCing Aljoscha who knows the internals of the DataStream API very well. Which Flink version are you using? Would you mind creating a JIRA issue [1] with all the info you provided so

Re: TableException: GenericTypeInfo cannot be converted to Table

2017-12-08 Thread Fabian Hueske
Yes. Adding .returns(typeInfo) works as well. :-) 2017-12-08 11:29 GMT+01:00 Fabian Hueske <fhue...@gmail.com>: > Hi, > > you give the TypeInformation to your user code but you don't expose it to > the DataStream API (the code of the FlatMapFunction is a black box for th

Re: TableException: GenericTypeInfo cannot be converted to Table

2017-12-08 Thread Fabian Hueske
Hi, you give the TypeInformation to your user code but you don't expose it to the DataStream API (the code of the FlatMapFunction is a black box for the API). You're FlatMapFunction should implement the ResultTypeQueryable interface and return the TypeInformation. Best, Fabian 2017-12-08 11:19

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Fabian Hueske
m> wrote: > >> Nishu >> >> You might consider sideouput with metrics at least after window. I would >> suggest having that to catch data screw or partition screw in all flink >> jobs and amend if needed. >> >> Chen >> >> On Thu, Dec 7, 201

Re: Flink Batch Performance degradation at scale

2017-12-07 Thread Fabian Hueske
TM in time > to see what it looks like. But each one I do look at the heap usage is > ~150MB/6.16GB (with fraction: 0.1) > > On Thu, Dec 7, 2017 at 11:59 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hmm, the OOM sounds like a bug to me. Can you provide the stacktrace? &

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Fabian Hueske
Is it possible that the data is dropped due to being late, i.e., records with timestamps behind the current watemark? What kind of operations does your program consist of? Best, Fabian 2017-12-07 10:20 GMT+01:00 Sendoh : > I would recommend to also print the count of

Re: ClassNotFoundException in custom SourceFunction

2017-12-07 Thread Fabian Hueske
Hi, A ClassNotFoundException should not be expected behavior. Can you post the stacktrace of the exception? We had a few issues in the past where Flink didn't use the correct classloader. So this would not be an unusual bug. Thanks, Fabian 2017-12-07 10:44 GMT+01:00 Tugdual Grall

Re: save points through REST API not supported ?

2017-12-07 Thread Fabian Hueske
We are currently voting on the third release candidate for 1.4.0. Feel free to propose this feature on the dev mailing list [1], but I don't think this will result in cancelling the vote. If we identify a blocking issue for 1.4.0, it could be included as well. But we're already a few weeks behind

Re: How to perform efficient DataSet reuse between iterations

2017-12-07 Thread Fabian Hueske
gt; There are some join functions though, I will look into applying them. > > Besides this, can you recommend an initial place in the code where one > should look to begin studying the optimizer? > > Thanks for your time once more, > > Best regards, > > Miguel E. Coimbra &g

Re: Flink Batch Performance degradation at scale

2017-12-07 Thread Fabian Hueske
s it used to be a lot smaller, I broke it out > manually by adding the sort/partition to see which steps were causing me > the slowdown, thinking it was my code, I wanted to separate the operations. > > Thank you again for your help. > > On Thu, Dec 7, 2017 at 4:49 AM, Fabian Hueske

Re: Trace jar file name from jobId, is that possible?

2017-12-07 Thread Fabian Hueske
AFAIK, a job keeps its ID in case of a recovery. Did you observe something else? 2017-12-07 17:32 GMT+01:00 Hao Sun <ha...@zendesk.com>: > I mean restarted during failure recovery > > On Thu, Dec 7, 2017 at 8:29 AM Fabian Hueske <fhue...@gmail.com> wrote: > >> W

Re: Trace jar file name from jobId, is that possible?

2017-12-07 Thread Fabian Hueske
ed somewhere and >> expose it through the api as well? >> >> On Mon, Dec 4, 2017 at 4:52 AM Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Hi, >>> >>> you can submit jar files and start jobs via the REST interface [1]. >>> When s

Re: Flink Batch Performance degradation at scale

2017-12-07 Thread Fabian Hueske
ifferent config for taking overall the same amount of ram? > > > > > On Wed, Dec 6, 2017 at 11:49 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Garrett, >> >> data skew might be a reason for the performance degradation. >> >> The plan you

Re: aggregate does not allow RichAggregateFunction ?

2017-12-06 Thread Fabian Hueske
Hi Vishal, you are right, it is not possible to use state in an AggregateFunction because windows need to be mergeable. An AggregateFunction knows how to merge its accumulators but merging generic state is not possible. I am not aware of an efficient and easy work around for this. If you want to

Re: How to perform efficient DataSet reuse between iterations

2017-12-06 Thread Fabian Hueske
"; > ConfigOption akkaConfig = ConfigOptions.key(AkkaOptions. > FRAMESIZE.key()).defaultValue(akkaFrameSize); > clientConfig.setString(akkaConfig, akkaFrameSize); > env = new RemoteEnvironment(remoteAddress, remotePort, clientConfig, > jarFiles, null); > > I have run out of

Re: Flink Batch Performance degradation at scale

2017-12-05 Thread Fabian Hueske
Hi, Flink's operators are designed to work in memory as long as possible and spill to disk once the memory budget is exceeded. Moreover, Flink aims to run programs in a pipelined fashion, such that multiple operators can process data at the same time. This behavior can make it a bit tricky to

Re: Maintain heavy hitters in Flink application

2017-12-05 Thread Fabian Hueske
Hi, I haven't done that before either. The query API will change with the next version (Flink 1.4.0) which is currently being prepared for releasing. Kostas (in CC) might be able to help you. Best, Fabian 2017-12-05 9:52 GMT+01:00 m@xi : > Hi Fabian, > > Thanks for your

Re: Window function support on SQL

2017-12-05 Thread Fabian Hueske
d() > ) > > val kinesisStream = env.fromCollection(testData) > > tableEnv.registerDataStream(streamName, avroStream); > > val query = "SELECT nd_key, sum(concept_rank) FROM "+streamName + " GROUP > BY nd_key" > > Thanks, > Tao > > On Mon, Dec 4, 2017

Re: non-shared TaskManager-specific config file

2017-12-04 Thread Fabian Hueske
Can you create a JIRA issue to propose the feature? Thank you, Fabian 2017-12-04 16:15 GMT+01:00 Hao Sun : > Thanks. If we can support include configuration dir that will be very > helpful. > > On Mon, Dec 4, 2017, 00:50 Chesnay Schepler wrote: > >> You

Re: Kafka consumer to sync topics by event time?

2017-12-04 Thread Fabian Hueske
if > there are any windows that remain open for a very long time, but in general > it would be useful IMHO. Or Flink could even commit both (read vs. > triggered) offsets to kafka for monitoring purposes. > > On Mon, Dec 4, 2017 at 3:30 PM, Fabian Hueske <fhue...@gmail.com> wrote: >

Re: Kafka consumer to sync topics by event time?

2017-12-04 Thread Fabian Hueske
Hi Juho, the partitions of both topics are independently consumed, i.e., at their own speed without coordination. With the configuration that Gordon linked, watermarks are generated per partition. Each source task maintains the latest (and highest) watermark per partition and propagates the

Re: Trace jar file name from jobId, is that possible?

2017-12-04 Thread Fabian Hueske
Hi, you can submit jar files and start jobs via the REST interface [1]. When starting a job, you get the jobId. You can link jar files and savepoints via the jobId. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/rest_api.html#submitting-programs

Re: Flik typesafe configuration

2017-12-04 Thread Fabian Hueske
Hi Georg, The recommended approach to configure user functions is to pass parameters as (typesafe) arguments to the constructor. Flink serializes users function objects using Java serialization and distributes them to the workers. Hence, the configuration during plan construction is preserved.

Re: Maintain heavy hitters in Flink application

2017-12-04 Thread Fabian Hueske
Hi Max, state (keyed or operator state) is always local to the task. By default it is not accessible (read or write) from the outside or other tasks of the application. You can expose keyed state as queryable state [1] to perform key look ups. This feature was designed for external application

Re: flink local & interactive development

2017-12-01 Thread Fabian Hueske
Hi Georg, I have no experience with SBT's console mode, so I cannot comment on that, but Flink provides a Scala REPL that might be useful [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/scala_shell.html 2017-11-30 23:09 GMT+01:00 Georg Heiler

Re: ElasticSearch Connector for version 6.x and scala 2.11

2017-12-01 Thread Fabian Hueske
Hi Rahul, Flink does not provide a connector for ElasticSearch 6 yet. There is this JIRA issue to track the development progress [1]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8101 2017-12-01 7:22 GMT+01:00 Rahul Raj : > Hi All, > > Is there a Flink

Flink Forward San Francisco 2018 - Call for Presentation is open

2017-11-30 Thread Fabian Hueske
Dear Flink community, The Call for Presentations for Flink Forward San Francisco 2018 is now open! Share your experiences and best practices in stream processing, real-time analytics, and managing mission-critical Flink deployments in production. We’re happy to receive your talk ideas until

Re: user driven stream processing

2017-11-29 Thread Fabian Hueske
Another example is King's RBEA platform [1] which was built on Flink. In a nutshell, RBEA runs a single large Flink job, to which users can add queries that should be computed. Of course, the query language is restricted because they queries must match on the structure of the running job. Hope

Re: How to perform efficient DataSet reuse between iterations

2017-11-29 Thread Fabian Hueske
t; in the > operator's execution. > > Is this possible? I was hoping I could retrieve this information in the > Java program itself and avoid processing logs. > > Thanks again. > > Best regards, > > > Miguel E. Coimbra > Email: miguel.e.coim...@gmail.com <miguel.e.coim

Re: How to perform efficient DataSet reuse between iterations

2017-11-28 Thread Fabian Hueske
utions and not just the results > of the last execution. > > > > Best regards, > > Miguel E. Coimbra > Email: miguel.e.coim...@gmail.com <miguel.e.coim...@ist.utl.pt> > Skype: miguel.e.coimbra > > On 27 November 2017 at 22:56, Fabian Hueske <fhue...@gmail.c

Re: How to perform efficient DataSet reuse between iterations

2017-11-27 Thread Fabian Hueske
Hi Miguel, I'm sorry but AFAIK, the situation has not changed. Is it possible that you are calling execute() multiple times? In that case, the 1-st and 2-nd graph would be recomputed before the 3-rd graph is computed. That would explain the increasing execution time of 15 seconds. Best, Fabian

Re: Status of Kafka011JsonTableSink for 1.4.0 release?

2017-11-27 Thread Fabian Hueske
Hi George, Flink 1.4 will not include a KafkaTableSink for Kafka 0.11 but a DataStream API SinkFunction (KafkaProducer). As an alternative to usingthe Kafka010TableSink, you can also convert the result Table into a DataStream and use the KafkaProducer for Kafka 0.11 to emit the DataStream. We

Re: Apache Flink - Non equi joins

2017-11-27 Thread Fabian Hueske
Hi Mans, no, non-equi joins are not supported by the relational APIs because they can be prohibitively expensive to compute. There's one exception. Cross joins where one of the input tables is guaranteed to have a single row (because it is the result of a non-grouped aggregation) are supported in

Re: Dataset read csv file problem

2017-11-24 Thread Fabian Hueske
Hi Ebru, this case is not supported by Flink's CsvInputFormat. The problem is that such a file could not be read in parallel because it is not possible to identify record boundaries if you start reading in the middle of the file. We have a new CsvInputFormat under development that follows the RFC

Re: ElasticSearch 6

2017-11-24 Thread Fabian Hueske
Hi Fritz, the ElasticSearch connector has not been updated for ES6 yet. There is a JIRA issue [1] to add support for ES6 and somebody working on it as it seems. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8101 2017-11-18 2:24 GMT+01:00 Fritz Budiyanto : >

Re: Accessing Cassandra for reading and writing

2017-11-24 Thread Fabian Hueske
Hi Andre, Do you have a batch or streaming use case? Flink provides Cassandra Input and OutputFormats for DataSet (batch) jobs and a Cassandra Sink for DataStream applications. The is no Cassandra source for DataStream applications. Regarding your error, this looks more like a Zepplin

Re: external checkpoints

2017-11-24 Thread Fabian Hueske
Hi Aviad, sorry for the late reply. You can configure the checkpoint directory (which is also used for externalized checkpoints) when you create the state backend: env.setStateBackend(new RocksDBStateBackend("hdfs:///checkpoints-data/"); This configures the checkpoint directory to be

Re: How to write dataset as parquet format

2017-11-22 Thread Fabian Hueske
Hi Ebru, AvroParquetOutputFormat seems to implement Hadoop's OutputFormat interface. Flink provides a wrapper for Hadoop's OutputFormat [1], so you can try to wrap AvroParquetOutputFormat in Flink's HadoopOutputFormat. Hope this helps, Fabian [1]

Re: Streaming User-defined Aggregate Function gives exception when using Table API jars

2017-11-15 Thread Fabian Hueske
Hi Colin, thanks for reporting the bug. I had a look at it and it seems that the wrong classloader is used when compiling the code (both for the batch as well as the streaming queries). I have a fix that I need to verify. It's not necessary to open a new JIRA for that. We can cover all cases

Re: Flink drops messages?

2017-11-13 Thread Fabian Hueske
Thanks for the correction! :-) 2017-11-13 13:05 GMT+01:00 Kien Truong <duckientru...@gmail.com>: > Getting late elements from side-output is already available with Flink 1.3 > :) > > Regards, > > Kien > On 11/13/2017 5:00 PM, Fabian Hueske wrote: > > Hi Andrea,

Re: AvroParquetWriter may cause task managers to get lost

2017-11-13 Thread Fabian Hueske
Hi Ivan, I don't have much experience with Avro, but extracting the schema and creating a writer for each record sounds like a pretty expensive approach. This might result in significant load and increased GC activity. Do all records have a different schema or might it make sense to cache the

Re: Flink drops messages?

2017-11-13 Thread Fabian Hueske
Hi Andrea, you are right. Flink's window operators can drop messages which are too late, i.e., have a timestamp smaller than the last watermark. This is expected behavior and documented at several places [1] [2]. There are a couple of options how to deal with late elements: 1. Use more

Re: Metric Registry Warnings

2017-11-13 Thread Fabian Hueske
Hi Ashish, this is a known issue and has been fixed for the next version [1]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-7100 2017-11-11 16:02 GMT+01:00 Ashish Pokharel : > All, > > Hopefully this is a quick one. I enabled Graphite reporter in my App and

Re: Testing / Configuring event windows with Table API and SQL

2017-11-10 Thread Fabian Hueske
Hi Colin, Flink's SQL runner does not support handling of late data yet. At the moment, late events are simply dropped. We plan to add support for late data in a future release. The "withIdleStateRetentionTime" parameter only applies to non-windowed aggregation functions and controls when they

Re: ExecutionGraph not serializable

2017-11-07 Thread Fabian Hueske
Hi XiangWei, I don't think this is a public interface, but Till (in CC) might know better. Best, Fabian 2017-11-06 3:27 GMT+01:00 XiangWei Huang : > Hi Flink users, > Flink Jobmanager throw a NotSerializableException when i used > JobMasterGateway to get ExecutionGraph

Re: DataStream to Table Api idioms

2017-11-06 Thread Fabian Hueske
Hi Seth, I think the Table API is not there yet to address you use case. 1. Allowed lateness cannot be configured but it is on the list of features that we plan to add in the future. 2. Custom triggers are not supported. We are planning to add an option to support your use case (early firing and

Re: Facing issues with Logback

2017-11-06 Thread Fabian Hueske
Hi Teena, thanks for reaching out to the mailing list for this issue. This sound indeed like a bug in Flink and should be investigated. We are currently working on a new release 1.4 and the testing phase will start soon. So it would make sense to include this problem in the testing and hopefully

Re: Batch job per stream message?

2017-11-02 Thread Fabian Hueske
s will be ignored. > > so basically there has to be an accumulator implemented inside > AsyncFunction to gather up all results and return them in a single > .collect() call. > but how to know when to do so? or I am completely off track here > > > > On Wed, 1 Nov 2017 at 03:57 Fabian H

Re: Reprocessing the data after config change

2017-11-01 Thread Fabian Hueske
Hi Tomasz, that sounds like a sound design. You have to make sure that the output of the application is idempotent such that the reprocessing job overrides all! output data of the earlier job. Best, Fabian 2017-10-23 16:24 GMT+02:00 Tomasz Dobrzycki : > Hi all, >

Re: Batch job per stream message?

2017-11-01 Thread Fabian Hueske
Hi Tomas, triggering a batch DataSet job from a DataStream program for each input record doesn't sound like a good idea to me. You would have to make sure that the cluster always has sufficient resources and handle failures. It would be preferable to have all data processing in a DataStream job.

Re: Not enough free slots to run the job

2017-10-27 Thread Fabian Hueske
Hi David, that's correct. A TM is a single process. A slot is just a virtual concept in the TM process and runs its program slice in multiple threads. Besides managed memory (which is split into chunks add assigned to slots) all other resources (CPU, heap, network, disk) are not isolated and free

Re: Tasks, slots, and partitioned joins

2017-10-26 Thread Fabian Hueske
Hi David, please find my answers below: 1. For high utilization, all slot should be filled. Each slot will processes a slice of the program on a slice of the data. In case of partitioning or changed parallelism, the data is shuffled accordingly . 2. That's a good question. I think the default

Re: Passing Configuration & State

2017-10-26 Thread Fabian Hueske
Hi Navneeth, the configuring user function using a Configuration object and setting the parameters in the open() method of a RichFunction is no longer recommended. In fact, that only works for the DataSet API and has not been added for the DataStream API. The open() method with the Configuration

Re: Tasks, slots, and partitioned joins

2017-10-26 Thread Fabian Hueske
Hi David, Flink's DataSet API schedules one slice of a program to a task slot. A program slice is one parallel instance of each operator of a program. When all operator of your program run with a parallelism of 1, you end up with only 1 slice that runs on a single slot. Flink's DataSet API

Re: Local combiner on each mapper in Flink

2017-10-26 Thread Fabian Hueske
Hi, in a MapReduce context, combiners are used to reduce the amount of data 1) to shuffle and fully sort (to group the data by key) and 2) to reduce the impact of skewed data. The question is, why do you need a combiner in your use case. - To reduce the data to shuffle: You should not use a

Re: Case Class TypeInformation

2017-10-25 Thread Fabian Hueske
nformation to case classes: > https://issues.apache.org/jira/browse/FLINK-7859 > > Joshua > > > On Oct 17, 2017, at 3:01 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Joshua, > > that's a limitation of the Scala API. > Row requires to explicitly specify

Re: Delta iteration not spilling to disk

2017-10-25 Thread Fabian Hueske
null values that this error might be thrown? > > Thank you, > > Joshua > > On Oct 25, 2017, at 3:12 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Joshua, > > that is correct. Delta iterations cannot spill to disk. The solution set > is managed in an in-me

Re: Delta iteration not spilling to disk

2017-10-25 Thread Fabian Hueske
Hi Joshua, that is correct. Delta iterations cannot spill to disk. The solution set is managed in an in-memory hash table. Spilling that hash table to disk would have a significant impact on the performance. By default the hash table is organized in Flink's managed memory. You can try to

Re: problem with increase job parallelism

2017-10-20 Thread Fabian Hueske
Hi Lei, setting explicit operator ID should solve this issue. As far as I know, the auto-generated operator id also depended on the operator parallelism in previous versions of Flink (not sure until which point). Which version are you running? Best, Fabian 2017-10-17 3:15 GMT+02:00 Lei Chen

Re: Stumped writing to KafkaJSONSink

2017-10-18 Thread Fabian Hueske
No worries :-) Thanks for the notice. 2017-10-18 15:07 GMT+02:00 Kenny Gorman <ke...@eventador.io>: > Yep we hung out and got it working. I should have replied sooner! Thx for > the reply. > > -kg > > On Oct 18, 2017, at 7:06 AM, Fabian Hueske <fhue...@gma

Re: start-cluster.sh not working in HA mode

2017-10-18 Thread Fabian Hueske
Hi Hayden, I tried to reproduce the problem you described and followed the HA setup instructions of the documentation [1]. For me the instructions worked and start-cluster.sh started two JobManagers on my local machine (master contained two localhost entries). The bash scripts tend to be a bit

Re: Stumped writing to KafkaJSONSink

2017-10-18 Thread Fabian Hueske
Hi Kenny, this look almost correct. The Table class has a method writeToSink(TableSink) that should address your use case (so the same as yours but without the TableEnvironment argument). Does that work for you? If not what kind of error and error message do you get? Best, Fabian 2017-10-18

Re: Split a dataset

2017-10-17 Thread Fabian Hueske
here any way to "downgrade" or convert a DataSet to a DataStream? > > BR > /Magnus > > On 17 Oct 2017, at 10:54, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Magnus, > > there is no Split operator on the DataSet API. > > As you said, this can be done usi

Re: GROUP BY TUMBLE on ROW range

2017-10-17 Thread Fabian Hueske
Hi Stefano, this is not supported in Flink's SQL and we would need new Group Window functions (like TUMBLE) for this. A TUMBLE_COUNT function would be somewhat similar to SESSION, which also requires checks on the sorted neighboring rows to identify the window of a row. Such a function would

Re: Split a dataset

2017-10-17 Thread Fabian Hueske
Hi Magnus, there is no Split operator on the DataSet API. As you said, this can be done using a FilterFunction. This also allows for non-binary splits: DataSet setToSplit = ... DataSet firstSplit = setToSplit.filter(new SplitCondition1()); DataSet secondSplit = setToSplit.filter(new

Re: Unbalanced job scheduling

2017-10-17 Thread Fabian Hueske
Setting the slot sharing group is Flink's mechanism to solve this issue. I'd consider this a limitation of the library that provides LEARN and SELECT. Did you consider to open an issue at (or contributing to) the library to support setting the slotSharing group? 2017-10-17 9:38 GMT+02:00

Re: Case Class TypeInformation

2017-10-17 Thread Fabian Hueske
Hi Joshua, that's a limitation of the Scala API. Row requires to explicitly specify a TypeInformation[Row] but it is not possible to inject custom types into a CaseClassTypeInfo, which are automatically generated by a Scala compiler plugin. The probably easiest solution is to use Flink's Java

Re: Unbalanced job scheduling

2017-10-17 Thread Fabian Hueske
Hi Andrea, have you looked into assigning slot sharing groups [1]? Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html#task-chaining-and-resource-groups 2017-10-16 18:01 GMT+02:00 AndreaKinn : > Hi all, > I want to expose

<    4   5   6   7   8   9   10   11   12   13   >