Re: TIMESTAMP TypeInformation

2016-10-27 Thread Fabian Hueske
063, > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN > > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (inclu

Re: Flink Cassandra Connector is not working

2016-10-27 Thread Fabian Hueske
Hi, a NoSuchMethod indicates that you are using incompatible versions. You should check that the versions of your job dependencies and the version cluster you want to run the job on are the same. Best, Fabian 2016-10-27 7:13 GMT+02:00 NagaSaiPradeep : > Hi, > I am

Re: TIMESTAMP TypeInformation

2016-10-26 Thread Fabian Hueske
Hi Radu, I might not have complete understood your problem, but if you do val env = StreamExecutionEnvironment.getExecutionEnvironment val tEnv = TableEnvironment.getTableEnvironment(env) val ds = env.fromElements( (1, 1L, new Time(1,2,3)) ) val t = ds.toTable(tEnv, 'a, 'b, 'c) val results = t

Re: Flushing the result of a groupReduce to a Sink before all reduces complete

2016-10-26 Thread Fabian Hueske
Hi Paul, Flink pushes the results of operators (including GroupReduce) to the next operator or sink as soon as they are computed. So what you are asking for is actually happening. However, before the GroupReduceFunction can be applied, the whole data is sorted in order to group the data. This

Re: Bug: Plan generation for Unions picked a ship strategy between binary plan operators.

2016-10-25 Thread Fabian Hueske
Hi Yassine, I thought I had fixed that bug a few weeks a ago, but apparently the fix did not catch all cases. Can you please reopen FLINK-2662 and post the program to reproduce the bug there? Thanks, Fabian [1] https://issues.apache.org/jira/browse/FLINK-2662 2016-10-25 12:33 GMT+02:00 Yassine

Re: Trigger evaluate

2016-10-24 Thread Fabian Hueske
The window is evaluated when a watermark arrives that is behind the window's end time. For instance, give the window in your example there are windows that end at 1:00:00, 1:00:30, 1:01:00, 1:01:30, ... (every 30 seconds). given the windows above, the window from 00:59:00 to 1:00:00 will be

Re: multiple processing of streams

2016-10-24 Thread Fabian Hueske
Fold + Window first, which > should get rid of my heap issues. > > > > Thanks! > > > > *From: *Fabian Hueske <fhue...@gmail.com> > *Reply-To: *"user@flink.apache.org" <user@flink.apache.org> > *Date: *Monday, October 24, 2016 at 2:27 PM > &g

Re: multiple processing of streams

2016-10-24 Thread Fabian Hueske
ion), so I’m looking at a way to customize the > EventTimeSessionWindow, or perhaps create a custom EventTrigger, to force a > session to close after either X seconds of inactivity or Y seconds of > duration (or perhaps after Z events). > > > > > > > > *From: *Fabian Hueske <f

Re: multiple processing of streams

2016-10-21 Thread Fabian Hueske
Hi Robert, it is certainly possible to feed the same DataStream into two (or more) operators. Both operators should then process the complete input stream. What you describe is an unintended behavior. Can you explain how you figure out that both window operators only receive half of the events?

Re: Flink SQL Stream Parser based on calcite

2016-10-17 Thread Fabian Hueske
The translation is done in multiple stages. 1. Parsing (syntax check) 2. Validation (semantic check) 3. Query optimization (rule and cost based) 4. Generation of physical plan, incl. code generation (DataStream program) The final translation happens in the DataStream nodes, e.g., DataStreamCalc

Re: Flink SQL Stream Parser based on calcite

2016-10-17 Thread Fabian Hueske
Hi Pedro, The sql() method calls the Calcite parser in line 129. Best, Fabian 2016-10-17 16:43 GMT+02:00 PedroMrChaves : > Hello, > > I am pretty new to Apache Flink. > > I am trying to figure out how does Flink parses an Apache Calcite sql query > to its own

Re: BoundedOutOfOrdernessTimestampExtractor and allowedlateness

2016-10-17 Thread Fabian Hueske
is evaluated and late events are processed alone, i.e., in my example <12:09, G> would be processed without [A, B, C, D]. When the allowed lateness is passed, all window state is purged regardless of the trigger. Best, Fabian 2016-10-17 16:24 GMT+02:00 Fabian Hueske <fhue...@gmail.com>:

Re: BoundedOutOfOrdernessTimestampExtractor and allowedlateness

2016-10-17 Thread Fabian Hueske
Hi Yassine, the difference is the following: 1) The BoundedOutOfOrdernessTimestampExtractor is a built-in timestamp extractor and watermark assigner. A timestamp extractor tells Flink when an event happened, i.e., it extracts a timestamp from the event. A watermark assigner tells Flink what the

Re: [DISCUSS] Deprecate Hadoop source method from (batch) ExecutionEnvironment

2016-10-14 Thread Fabian Hueske
oop integration for > AvroParquetInputFormat and CqlBulkOutputFormat in Flink (although we won't > be using CqlBulkOutputFormat any longer because it doesn't seem to be > reliable). > > -Shannon > > From: Fabian Hueske <fhue...@gmail.com> > Date: Friday, October 14, 2016

Re: HivePartitionTap is not working with cascading flink

2016-10-14 Thread Fabian Hueske
Hi Santlal, I'm afraid I don't know what is going wrong here either. Debugging and correctly configuring the Taps was one of the major obstacles when implementing the connector. Best, Fabian 2016-10-14 14:40 GMT+02:00 Aljoscha Krettek : > +Fabian directly looping in Fabian

[DISCUSS] Deprecate Hadoop source method from (batch) ExecutionEnvironment

2016-10-14 Thread Fabian Hueske
Hi everybody, I would like to propose to deprecate the utility methods to read data with Hadoop InputFormats from the (batch) ExecutionEnvironment. The motivation for deprecating these methods is reduce Flink's dependency on Hadoop but rather have Hadoop as an optional dependency for users that

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-13 Thread Fabian Hueske
apply() accepts a WindowFunction which is essentially the same as a GroupReduceFunction, i.e., you have an iterator over all events in the window. If you only want to count, you should have a look at incremental window aggregation with a ReduceFunction or FoldFunction [1]. Best, Fabian [1]

Re: Current alternatives for async I/O

2016-10-13 Thread Fabian Hueske
Hi Ken, FYI: we just received a pull request for FLIP-12 [1]. Best, Fabian [1] https://github.com/apache/flink/pull/2629 2016-10-11 9:35 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > Hi Ken, > > I think your solution should work. > You need to make sure though, that y

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-13 Thread Fabian Hueske
Hi Pedro, the DataStream program would like this: val eventData: DataStream[?] = ??? val result = eventData .filter("action = denied") .keyBy("user", "ip") .timeWindow(Time.hours(1)) .apply("window.end, user, ip, count(*)") .filter("count > 5") .map("windowEnd, user, ip") Note,

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-12 Thread Fabian Hueske
Hi Pedro, support for window aggregations in SQL and Table API is currently work in progress. We have a pull request for the Table API and will add this feature for the next release. For SQL we depend on Apache Calcite to include the TUMBLE keyword in its parser and optimizer. At the moment the

Re: mapreduce.HadoopOutputFormat config value issue

2016-10-12 Thread Fabian Hueske
Hi Shannon, I tried to reproduce the problem in a unit test without success. My test configures a HadoopOutputFormat object, serializes and deserializes it, cally open, and verifies that a configured String property is present in the getRecordWriter() method. Next I would try to reproduce the

Re: Handling decompression exceptions

2016-10-11 Thread Fabian Hueske
er(",") >> .includeFields("101") >> .ignoreInvalidLines() >> .types(String.class, String.class); >> withReadCSV.writeAsText("C:\\Users\\yassine\\Desktop\\withreadcsv.txt", >> FileSystem.WriteMode.OVERWRITE).setPar

Re: Handling decompression exceptions

2016-10-11 Thread Fabian Hueske
returnRecord = null; > do { > try { > returnRecord = super.nextRecord(record); > } catch (IOException e) { > e.printStackTrace(); > } > } while (returnRecord == null && !reachedEnd())

Re: Current alternatives for async I/O

2016-10-11 Thread Fabian Hueske
Hi Ken, I think your solution should work. You need to make sure though, that you properly manage the state of your function, i.e., memorize all records which have been received but haven't be emitted yet. Otherwise records might get lost in case of a failure. Alternatively, you can implement

Re: Wordindex conversation.

2016-10-10 Thread Fabian Hueske
Hi, you can do it like this: 1) you have to split each label record of the main dataset into separate records: (0,List(a, b, c, d, e, f, g)) -> (0, a), (0, b), (0, c), ..., (0, g) (1,List(b, c, f, a, g)) -> (1, b), (1, c), ..., (1, g) 2) join word index dataset with splitted main dataset:

Re: Regarding

2016-10-09 Thread Fabian Hueske
Hi Rashmi, as Marton said, you do not need to start a local Flink instance (start-lcoal.bat) if you want to run programs from your IDE. Maybe running a local instance causes a conflict when starting an instance from IDE. Developing and running Flink programs on Windows should work, both from the

Re: readCsvFile

2016-10-07 Thread Fabian Hueske
rto...@gmail.com>: > Humm > > Your solution compile with out errors, but IncludedFields Isn't working: > [image: Imágenes integradas 1] > > The output is incorrect: > [image: Imágenes integradas 2] > > The correct result must be only 1º Column > (a,1) > (aa,1) >

Re: jdbc.JDBCInputFormat

2016-10-07 Thread Fabian Hueske
As the exception says the class org.apache.flink.api.scala.io.jdbc.JDBCInputFormat does not exist. You have to do: import org.apache.flink.api.java.io.jdbc.JDBCInputFormat There is no Scala implementation of this class but you can also use Java classes in Scala. 2016-10-07 21:38 GMT+02:00

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

2016-10-07 Thread Fabian Hueske
_OBJECT> using a private long value in > the mapper that increments on every map call). It works, but by any chance > is there a more succinct way to do it? > > On Thu, Oct 6, 2016 at 1:50 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Maybe this can be done by

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

2016-10-06 Thread Fabian Hueske
Maybe this can be done by assigning the same window id to each of the N local windows, and do a .keyBy(windowId) .countWindow(N) This should create a new global window for each window id and collect all N windows. Best, Fabian 2016-10-06 22:39 GMT+02:00 AJ Heller : > The

Re: Presented Flink use case in Japan

2016-10-05 Thread Fabian Hueske
> > > > On Tue, Oct 4, 2016 at 11:56 PM, Hironori Ogibayashi < > ogibaya...@gmail.com> > > wrote: > >> > >> Thank you for the response. > >> Regarding adding to the page, I will check with our PR department. > >> > >> Regards, >

Re: Handling decompression exceptions

2016-10-04 Thread Fabian Hueske
Hi Yassine, AFAIK, there is no built-in way to ignore corrupted compressed files. You could try to implement a FileInputFormat that wraps the CsvInputFormat and forwards all calls to the wrapped CsvIF. The wrapper would also catch and ignore the EOFException. If you do that, you would not be

Re: Enriching a tuple mapped from a datastream with data coming from a JDBC source

2016-10-04 Thread Fabian Hueske
Hi Philipp, If I got your requirements right you would like to: 1) load an initial hashmap via JDBC 2) update the hashmap from a stream 3) use the hashmap to enrich another stream. You can use a CoFlatMap to do this: stream1.connect(stream2).flatMap(new YourCoFlatMapFunction).

Re: Using Flink and Cassandra with Scala

2016-10-04 Thread Fabian Hueske
FYI: FLINK-4497 [1] requests Scala tuple and case class support for the Cassandra sink and was opened about a month ago. [1] https://issues.apache.org/jira/browse/FLINK-4497 2016-09-30 23:14 GMT+02:00 Stephan Ewen : > How hard would it be to add case class support? > >

Re: Counting latest state of stateful entities in streaming

2016-09-30 Thread Fabian Hueske
et but for what I got, this is slightly different from what I need. > > 2016-09-30 10:04 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > >> Hi Simone, >> >> I think I have a solution for your problem: >> >> val s: DataStream[(Long, Int, ts)] = ??? // (id, state,

Re: Counting latest state of stateful entities in streaming

2016-09-30 Thread Fabian Hueske
Hi Simone, I think I have a solution for your problem: val s: DataStream[(Long, Int, ts)] = ??? // (id, state, time) val stateChanges: DataStream[(Int, Int)] = s // (state, cntUpdate) .keyBy(_._1) // key by id .flatMap(new StateUpdater) // StateUpdater is a stateful FlatMapFunction. It has

Re: AW: Problem with CEPPatternOperator when taskmanager is killed

2016-09-29 Thread Fabian Hueske
Great, thanks! I gave you contributor permissions in JIRA. You can now also assign issues to yourself if you decide to continue to contribute. Best, Fabian 2016-09-29 16:48 GMT+02:00 jaxbihani : > Hi Fabian > > My JIRA user is: jaxbihani > I have created a pull request

Re: Iterations vs. combo source/sink

2016-09-29 Thread Fabian Hueske
Hi Ken, you can certainly have partitioned sources and sinks. You can control the parallelism by calling .setParallelism() method. If you need a partitioned sink, you can call .keyBy() to hash partition. I did not completely understand the requirements of your program. Can you maybe provide

Re: Flink-HBase connectivity through Kerberos keytab | Simple get/put sample implementation

2016-09-29 Thread Fabian Hueske
Hi Anchit, Flink does not yet have a streaming sink connector for HBase. Some members of the community are working on this though [1]. I think we resolved a similar issue for the Kafka connector recently [2]. Maybe the related commits contain some relevant code for your problem. Best, Fabian

Re: Complex batch workflow needs (too) much time to create executionPlan

2016-09-26 Thread Fabian Hueske
Hi Markus, thanks for the stacktraces! The client is indeed stuck in the optimizer. I have to look a bit more into this. Did you try to set JoinHints in your plan? That should reduce the plan space that is enumerated and therefore reduce the optimization time (maybe enough to run your application

Re: Simple batch job hangs if run twice

2016-09-23 Thread Fabian Hueske
Hi Yassine, can you share a stacktrace of the job when it got stuck? Thanks, Fabian 2016-09-22 14:03 GMT+02:00 Yassine MARZOUGUI : > The input splits are correctly assgined. I noticed that whenever the job > is stuck, that is because the task *Combine (GroupReduce at

Re: Why tuples are not ignored after watermark?

2016-09-15 Thread Fabian Hueske
No, this is not possible unless you use an external service such as a database. The assigners might run on different machines and Flink does not provide utilities for r/w shared state. Best, Fabian 2016-09-15 20:17 GMT+02:00 Saiph Kappa : > And is it possible to share

Re: NoClassDefFoundError with ElasticsearchSink on Yarn

2016-09-09 Thread Fabian Hueske
+1 I ran into that issue as well. Would be great to have that in the docs! 2016-09-09 11:49 GMT+02:00 Robert Metzger : > Hi Steffen, > > I think it would be good to add it to the documentation. > Would you like to open a pull request? > > > Regards, > Robert > > > On Mon,

Re: scala version of flink mongodb example

2016-09-08 Thread Fabian Hueske
dence$3: scala.reflect.ClassTag[R])org.apache.flink.api.scala. > DataSet[R] > match expected type ? > > Thanks! > Frank > > > On Thu, Sep 8, 2016 at 6:56 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Frank, >> >> input should be of

Re: scala version of flink mongodb example

2016-09-08 Thread Fabian Hueske
Hi Frank, input should be of DataSet[(BSONWritable, BSONWritable)], so a Tuple2[BSONWritable, BSONWritable], right? Something like this should work: input.map( pair => pair._1.toString ) Pair is a Tuple2[BSONWritable, BSONWritable], and pair._1 accesses the key of the pair. Alternatively you

Re: assignTimestamp after keyBy

2016-09-08 Thread Fabian Hueske
I would assign timestamps directly at the source. Timestamps are not striped of by operators. Reassigning timestamps somewhere in the middle of a job can cause very unexpected results. 2016-09-08 9:32 GMT+02:00 Dong-iL, Kim : > Thanks for replying. pushpendra. > The

Re: Sharing Java Collections within Flink Cluster

2016-09-07 Thread Fabian Hueske
nst a 'bunch > of keys' from DS2 and DS2 could shrink/expand in terms of the no., of > keys will the key-value shard work in this case? > > On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Operator state is always local in Flink. However

Re: Sharing Java Collections within Flink Cluster

2016-09-07 Thread Fabian Hueske
cross the cluster within this job1 ? > > > > On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Is writing DataStream2 to a Kafka topic and reading it from the other job >> an option? >> >> 2016-09-07 19:07 GMT+02:00

Re: Sharing Java Collections within Flink Cluster

2016-09-07 Thread Fabian Hueske
localize the cache within the cluster. > Is there a way? > > Best Regards > CVP > > On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi, >> >> Flink does not provide shared state. >> However, you can broadcast a str

Re: Sharing Java Collections within Flink Cluster

2016-09-07 Thread Fabian Hueske
Hi, Flink does not provide shared state. However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state. If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can

Re: [SUGGESTION] Stack Overflow Documentation for Apache Flink

2016-09-05 Thread Fabian Hueske
Thanks for the suggestion Vishnu! Stackoverflow documentation looks great. I like the easy contribution and versioning features. However, I am a bit skeptical. IMO, Flink's primary documentation must be hosted by Apache. Out-sourcing such an important aspect of a project to an external service is

Re: Windows and Watermarks Clarification

2016-09-01 Thread Fabian Hueske
(x)) then > evaluated relative to maxEventTime - lastWaterMarkTime. So if (maxEventTime > - lastWaterMarkTime) > x * 1000 then the window is evaluated? > > > Paul > ------ > *From:* Fabian Hueske <fhue...@gmail.com> > *Sent:* Thursday, September 1, 2016 1:

Re: Streaming - memory management

2016-09-01 Thread Fabian Hueske
roblem , or there is no harm in storing the > DTO ? > > I think the documentation should specify the point that the state will be > maintained for user-defined operators to avoid confusion. > > Regards, > Vinay Patil > > On Thu, Sep 1, 2016 at 1:12 PM, Fabian Hueske-2 [vi

Re: Windows and Watermarks Clarification

2016-09-01 Thread Fabian Hueske
Hi Paul, BoundedOutOfOrdernessTimestampExtractor implements the AssignerWithPeriodicWatermarks interface. This means, Flink will ask the assigner in regular intervals (configurable via StreamExecutionEnvironment.getConfig().setAutoWatermarkInterval()) for the current watermark. The watermark will

Re: Streaming - memory management

2016-09-01 Thread Fabian Hueske
ame key on matchingAndNonMatching and > flatmap to take care of rest logic. > > Or are you suggestion to use Co-FlatMapFunction after the outer-join > operation (I mean after doing the window and > getting matchingAndNonMatching stream )? > > Regards, > Vinay Patil > > On

Re: Streaming - memory management

2016-09-01 Thread Fabian Hueske
using RocksDB as state >> backend since the state is not gone after checkpointing ? >> >> P.S I have kept the watermark behind by 1500 secs just to be safe on >> handling late elements but to tackle edge case scenarios like the one >> mentioned above we are having a

Re: CountTrigger FIRE or FIRE_AND_PURGE

2016-09-01 Thread Fabian Hueske
is that > the previous windows will not purge, is that correct? > > final DataStream alertingMsgs = keyedStream > .window(TumblingEventTimeWindows.of(Time.minutes(1))) > .trigger(CountTrigger.of(1)) > .apply(new MyWindowProcessor())

Re: CountTrigger FIRE or FIRE_AND_PURGE

2016-08-29 Thread Fabian Hueske
Hi Paul, This blog post [1] includes an example of an early trigger that should pretty much do what you are looking for. This one [2] explains the windowing mechanics of Flink (window assigner, trigger, function, etc). Hope this helps, Fabian [1]

Re: Joda exclude in java quickstart maven archetype

2016-08-29 Thread Fabian Hueske
Hi Flavio, yes, Joda should not be excluded. This will be fixed in Flink 1.1.2. Cheers, Fabian 2016-08-29 11:00 GMT+02:00 Flavio Pompermaier : > Hi to all, > I've tried to upgrade from Flink 1.0.2 to 1.1.1 so I've copied the > excludes of the maven shade plugin from the

Re: Complex batch workflow needs (too) much time to create executionPlan

2016-08-22 Thread Fabian Hueske
Hi Markus, you might be right, that a lot of time is spend in optimization. The optimizer enumerates all alternatives and chooses the plan with the least estimated cost. The degrees of freedom of the optimizer are rather restricted (execution strategies and the used partitioning & sorting keys.

Re: [ANNOUNCE] Flink 1.1.0 Released

2016-08-09 Thread Fabian Hueske
Thanks Ufuk and everybody who contributed to the release! Cheers, Fabian 2016-08-08 20:41 GMT+02:00 Henry Saputra : > Great work all. Great Thanks to Ufuk as RE :) > > On Monday, August 8, 2016, Stephan Ewen wrote: > > > Great work indeed, and big

Re: Optimizations not performed - please confirm

2016-06-29 Thread Fabian Hueske
Yes, that was my fault. I'm used to auto reply-all on my desktop machine, but my phone just did a simple reply. Sorry for the confusion, Fabian 2016-06-29 19:24 GMT+02:00 Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr>: > Thank you, Aljoscha! > I received a similar update from Fabian,

Re: Kinesis connector classpath issue when running Flink 1.1-SNAPSHOT on YARN

2016-06-17 Thread Fabian Hueske
Hi Josh, I assume that you build the SNAPSHOT version yourself. I had similar version conflicts for Apache HttpCore with Flink SNAPSHOT versions on EMR. The problem is cause by a changed behavior in Maven 3.3 and following versions. Due to these changes, the dependency shading is not working

Re: HBase reads and back pressure

2016-06-13 Thread Fabian Hueske
1:04 Christophe Salperwyck >> >> <christophe.salperw...@gmail.com> wrote: >> >>> >> >>> Thanks for the feedback and sorry that I can't try all this straight >> >>> away. >> >>> >> >>> Is there a

Re: Strange behavior of DataStream.countWindow

2016-06-13 Thread Fabian Hueske
epartitioning by key, which is unnecessary since > I don't really care about keys. > > On 9 June 2016 at 22:00, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Yukun, >> >> the problem is that the KeySelector is internally invoked multiple times. >&g

Re: HBase reads and back pressure

2016-06-09 Thread Fabian Hueske
< christophe.salperw...@gmail.com>: > Hi Fabian, > > Thanks for the help, I will try that. The backpressure was on the source > (HBase). > > Christophe > > 2016-06-09 16:38 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > >> Hi Christophe, >> >>

Re: FlinkKafkaProducer API

2016-06-09 Thread Fabian Hueske
Great, thank you! 2016-06-09 17:38 GMT+02:00 Elias Levy <fearsome.lucid...@gmail.com>: > > On Thu, Jun 9, 2016 at 5:16 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> thanks for your feedback. I think those are good observations and >> suggestions to improve

Re: Data Source Generator emits 4 instances of the same tuple

2016-06-09 Thread Fabian Hueske
We solved this problem yesterday at the Flink Hackathon. The issue was that the source function was started with parallelism 4 and each function read the whole file. Cheers, Fabian 2016-06-06 16:53 GMT+02:00 Biplob Biswas : > Hi, > > I tried streaming the source data 2

Re: How to maintain the state of a variable in a map transformation.

2016-06-09 Thread Fabian Hueske
nces of map? > > 3) Question Cleared. > > 4) My question was can I use same ExecutionEnvironment for all flink > programs in a module. > > 5) Question Cleared. > > > Regards > Ravikumar > > > > On 9 June 2016 at 17:58, Fabian Hueske <fhue...@gmail.com&

Re: HBase reads and back pressure

2016-06-09 Thread Fabian Hueske
Hi Christophe, where does the backpressure appear? In front of the sink operator or before the window operator? In any case, I think you can improve your WindowFunction if you convert parts of it into a FoldFunction. The FoldFunction would take care of the statistics

Re: Maxby() and KeyBy() question

2016-06-09 Thread Fabian Hueske
Hi, you are computing a running aggregate, i.e., you're getting one output record for each input record and the output record is the record with the largest value observed so far. If the record with the largest value is the first, the record is sent out another time. This is what happened with

Re: Strange behavior of DataStream.countWindow

2016-06-09 Thread Fabian Hueske
Hi Yukun, the problem is that the KeySelector is internally invoked multiple times. Hence it must be deterministic, i.e., it must extract the same key for the same object if invoked multiple times. The documentation is not discussing this aspect and should be extended. Thanks for pointing out

Re: How to maintain the state of a variable in a map transformation.

2016-06-09 Thread Fabian Hueske
Hi Ravikumar, I'll try to answer your questions: 1) If you set the parallelism of a map function to 1, there will be only a single instance of that function regardless whether it is execution locally or remotely in a cluster. 2) Flink does also support aggregations, (reduce, groupReduce, combine,

Re: FlinkKafkaProducer API

2016-06-09 Thread Fabian Hueske
Hi Elias, thanks for your feedback. I think those are good observations and suggestions to improve the Kafka producers. The best place to discuss such improvements is the dev mailing list. Would like to repost your mail there or open JIRAs where the discussion about these changes can continue?

Re: NotSerializableException

2016-06-09 Thread Fabian Hueske
Hi Tarandeep, the exception suggests that Flink tries to serialize RecordsFilterer as a user function (this happens via Java Serialization). I said suggests because the code that uses RecordsFilterer is not included. To me it looks like RecordsFilterer should not be used as a user function. It

Re: WindowedStream aggregation methods pre-aggregate?

2016-05-27 Thread Fabian Hueske
Hi Elias, yes, reduce, fold, and the aggregation functions (sum, min, max, minBy, maxBy) on WindowedStream preform eager aggregation, i.e., the functions are apply for each value that enters the window and the state of the window will consist of a single value. In case you need access to the

Re: Apache Beam and Flink

2016-05-26 Thread Fabian Hueske
No, that is not supported yet. Beam provides a common API but the Flink runner translates programs against batch sources into the DataSet API programs and Beam programs against streaming source into DataStream programs. It is not possible to mix both. 2016-05-26 10:00 GMT+02:00 Ashutosh Kumar

Re: writeAsCSV with partitionBy

2016-05-23 Thread Fabian Hueske
Hi Kirsti, I'm not aware of anybody working on this issue. Would you like to create a JIRA issue for it? Best, Fabian 2016-05-23 16:56 GMT+02:00 KirstiLaurila : > Is there any plans to implement this kind of feature (possibility to write > to > data specified

Re: keyBy on a collection of Pojos

2016-05-23 Thread Fabian Hueske
Actually, the program works correctly (according to the DataStream API) Let me explain what happens: 1) You do not initialize the count variable, so it will be 0 (summing 0s results in 0) 2) DataStreams are considered to be unbound (have an infinite size). KeyBy does not group the records because

Re: Weird Kryo exception (Unable to find class: java.ttil.HashSet)

2016-05-20 Thread Fabian Hueske
The problem seems to occur quite often. Did you update your Flink version recently? If so, could you try to downgrade and see if the problem disappears. Is it otherwise possible that it is cause by faulty hardware? 2016-05-20 18:05 GMT+02:00 Flavio Pompermaier : > This

Re: Performing Reduce on a group of datasets

2016-05-19 Thread Fabian Hueske
I think that sentence is misleading and refers to the internals of Flink. It should be removed, IMO. You can only union two DataSets. If you want to union more, you have to do it one by one. Btw. union does not cause additional processing overhead. Cheers, Fabian 2016-05-19 14:44 GMT+02:00

Re: Performing Reduce on a group of datasets

2016-05-18 Thread Fabian Hueske
I think union is what you are looking for. Note that all data sets must be of the same type. 2016-05-18 16:15 GMT+02:00 Ritesh Kumar Singh : > Hi, > > How can I perform a reduce operation on a group of datasets using Flink? > Let's say my map function gives out n

Re: Flink recovery

2016-05-17 Thread Fabian Hueske
gt;> >> On Tue, May 17, 2016 at 11:27 AM, Stephan Ewen <se...@apache.org> wrote: >> >>> Hi Naveen! >>> >>> I assume you are using Hadoop 2.7+? Then you should not see the >>> ".valid-length" file. >>> >>> The fix you m

Re: Flink Kafka Streaming - Source [Bytes/records Received] and Sink [Bytes/records sent] show zero messages

2016-05-16 Thread Fabian Hueske
Hi Prateek, the missing numbers are an artifact from how the stats are collected. ATM, Flink does only collect these metrics for data which is sent over connections *between* Flink operators. Since sources and sinks connect to external systems (and not Flink operators), the dash board does not

Re: Flink recovery

2016-05-14 Thread Fabian Hueske
ansformations and rolling file sink > pipeline. > > > > Thanks, > Naveen > > From: Fabian Hueske <fhue...@gmail.com> > Reply-To: "user@flink.apache.org" <user@flink.apache.org> > Date: Friday, May 13, 2016 at 4:26 PM > > To: &qu

Re: Flink recovery

2016-05-13 Thread Fabian Hueske
Hi, Flink's exactly-once semantics do not mean that events are processed exactly-once but that events will contribute exactly-once to the state of an operator such as a counter. Roughly, the mechanism works as follows: - Flink peridically injects checkpoint markers into the data stream. This

Re: Flink + Avro GenericRecord - first field value overwrites all other fields

2016-05-12 Thread Fabian Hueske
Hi Tarandeep, the AvroInputFormat was recently extended to support GenericRecords. [1] You could also try to run the latest SNAPSHOT version and see if it works for you. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-3691 2016-05-12 10:05 GMT+02:00 Tarandeep Singh

Re: get start and end time stamp from time window

2016-05-12 Thread Fabian Hueske
Hi Martin, You can use a FoldFunction and a WindowFunction to process the same! window. The FoldFunction is eagerly applied, so the window state is only one element. When the window is closed, the aggregated element is given to the WindowFunction where you can add start and end time. The iterator

Re: Creating a custom operator

2016-05-11 Thread Fabian Hueske
port us in the development. We are waiting for them to start a > discussion and as soon as we will have a more clear idea on how to proceed, > we will validate it with the stuff you just said. Your confidence in > Flink's operators gives up hope to achieve a clean solution. > > Than

Re: Force triggering events on watermark

2016-05-10 Thread Fabian Hueske
Maybe the last example of this blog post is helpful [1]. Best, Fabian [1] https://www.mapr.com/blog/essential-guide-streaming-first-processing-apache-flink 2016-05-10 17:24 GMT+02:00 Srikanth : > Hi, > > I read the following in Flink doc "We can explicitly specify a

Re: Creating a custom operator

2016-05-09 Thread Fabian Hueske
if it is really required to implement a > custom runtime operator but given the complexity of the integration of two > distribute systems, we assumed that more control would allow more > flexibility and possibilities to achieve an ideal solution. > > > > > > 2016-05-03

Re: How to make Flink read all files in HDFS folder and do transformations on th e data

2016-05-07 Thread Fabian Hueske
Hi Palle, you can recursively read all files in a folder as explained in the "Recursive Traversal of the Input Path Directory" section of the Data Source documentation [1]. The easiest way to read line-wise JSON objects is to use ExecutionEnvironment.readTextFile() which reads text files

Re: Discussion about a Flink DataSource repository

2016-05-06 Thread Fabian Hueske
apFunctionWrapper(new MetaMapFunction(meta))) > > What do you think? Is there the possibility to open a broadcasted Dataset > as a Map instead of a List? > > Best, > Flavio > > > On Fri, May 6, 2016 at 12:06 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >&

Re: OutputFormat in streaming job

2016-05-06 Thread Fabian Hueske
6 Actor System (instantiate into > RichSinkFunction's open method) > > Am I wrong? > > Thanks again, > Andrea > > 2016-05-06 13:47 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > >> Hi Andrea, >> >> you can use any OutputFormat to emit data fr

Re: Where to put live model and business logic in Hadoop/Flink BigData system

2016-05-06 Thread Fabian Hueske
Hi Palle, this sounds indeed like a good use case for Flink. Depending on the complexity of the aggregated historical views, you can implement a Flink DataStream program which builds the views on the fly, i.e., you do not need to periodically trigger MR/Flink/Spark batch jobs to compute the

Re: OutputFormat in streaming job

2016-05-06 Thread Fabian Hueske
Hi Andrea, you can use any OutputFormat to emit data from a DataStream using the writeUsingOutputFormat() method. However, this method does not guarantee exactly-once processing. In case of a failure, it might emit some records a second time. Hence the results will be written at least once. Hope

Re: Discussion about a Flink DataSource repository

2016-05-06 Thread Fabian Hueske
at it could be easy > to define a link from operators to TableEnvironment and then to TableSource > (using the lineage tag/source-id you said) and, finally to its metadata. I > don't know whether this is specific only to us, I just wanted to share our > needs and see if the table AP

Re: Prevent job/operator from spilling to disk

2016-05-04 Thread Fabian Hueske
Hi Max, it is not possible to deactivate spilling to disk at the moment. It might be possible to implement, but this would require a few more changes to make it feasible. For instance, we would need to add more fine-grained control about how memory is distributed among operators. This is

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-04 Thread Fabian Hueske
Sorry, I confused the mail threads. We're already on the user list :-) Thanks for the suggestion. 2016-05-04 17:35 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > I'm not so much familiar with the Kafka connector. > Can you post your suggestion to the user or dev mailing list? > &

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-04 Thread Fabian Hueske
I'm not so much familiar with the Kafka connector. Can you post your suggestion to the user or dev mailing list? Thanks, Fabian 2016-05-04 16:53 GMT+02:00 Sendoh : > Glad to see it's developing. > Can I ask would the same feature (reconnect) be useful for Kafka

<    8   9   10   11   12   13   14   15   16   >