Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger
Hi Fabian, thanks, that is very helpful indeed - I now understand why the DataSet drivers insist on sorting the buffers and then processing instead of keeping state. In our case, the state should easily fit into the heap of the cluster, though. In a quick&dirty example I tried just now, the MapPa

Re: Using Contextual Data

2017-06-16 Thread Doron Keller
Thank you for your response. Please let me know if this is doable in CEP, or if you need more information. Doron From: "Tzu-Li (Gordon) Tai" mailto:tzuli...@apache.org>> Date: Sunday, April 16, 2017 at 11:05 PM To: "user@flink.apache.org" mailto:user@flink.apache.o

confusing RocksDBStateBackend parameters

2017-06-16 Thread Bowen Li
Hello guys, I've been trying to figure out differences among several parameters of RocksDBStateBackend. The confusing parameters are: In flink-conf.yaml: 1. state.backend.fs.checkpointdir 2. state.backend.rocksdb.checkpointdir 3. state.checkpoints.dir and

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Fabian Hueske
Hi Urs, on the DataSet API, the only memory-safe way to do it is GroupReduceFunction. As you observed this requires a full sort of the dataset which can be quite expensive but after the sort the computation is streamed. You could also try to manually implement a hash-based combiner using a MapPart

DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger
Hi, I'm working on a batch job (roughly 10 billion records of input, 10 million groups) that is essentially a 'fold' over each group, that is, I have a function AggregateT addToAggrate(AggregateT agg, RecordT record) {...} and want to fold this over each group in my DataSet. My understanding is

Re: Flink CEP not emitting timed out events properly

2017-06-16 Thread Biplob Biswas
Hi Kostas, Thanks for that suggestion, I would try that next, I have out of order events on one of my Kafka topics and that's why I am using BoundedOutOfOrdernessTimestampExtractor(), now that this doesn't work as expected I would try to work with the Base class as you suggested. Although this b

Re: Flink CEP not emitting timed out events properly

2017-06-16 Thread Kostas Kloudas
Hi Biplob, If you know what you want, you can always write your custom AssignerWithPeriodicWatermarks that does your job. If you want to just increase the watermark, you could simply check if you have received any elements and if not, emit a watermark with the timestamp of the previous watermark

Re: Fink: KafkaProducer Data Loss

2017-06-16 Thread ninad
Hi Aljoscha, I gather you guys aren't able to reproduce this. Here are the answers to your questions: How do you ensure that you only shut down the brokers once Flink has read all the data that you expect it to read Ninad: I am able to see the number of messages received on the Flink Job UI.

Re: How choose between YARN/Mesos/StandAlone Flink

2017-06-16 Thread Chesnay Schepler
Hello, just quickly chiming in for clarification/correction purpose: Flink can work in multi-node environments without yarn/mesos. If you are only starting out, or have short-lived workloads (i.e. a job that does not have to run for days/weeks), I would recommend standalone mode for easy-of-e

Re: Unable to use Flink RocksDB state backend due to endianness mismatch

2017-06-16 Thread Stefan Richter
Hi, we would also like to update to the latest RocksDB and drop FRocksDB altogether. But unfortunately, newer versions of RocksDB have issues with certain features in the Java API that we use in Flink, for example this one https://github.com/facebook/rocksdb/issues/1964

Re: Kafka and Flink integration

2017-06-16 Thread nragon
My custom object is used across all job, so it'll be part of checkpoints. Can you point me some references with some examples? -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-and-Flink-integration-tp13792p13802.html Sent from the Apache

Re: How choose between YARN/Mesos/StandAlone Flink

2017-06-16 Thread Biplob Biswas
Hi Andrea, If you are using Flink for research and/or testing purpose, standalone Flink is more or less sufficient. Although if you have a huge amount of data, it may take forever to process data with only one node/machine and that's where a cluster would be needed. A yarn and mesos cluster could

Re: Flink CEP not emitting timed out events properly

2017-06-16 Thread Biplob Biswas
Hi Kostas, Thanks for the reply, makes things a bit more clear. Also, I went through this link and it is something similar I am trying to observe. http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Listening-to-timed-out-patterns-in-Flink-CEP-td9371.html I am checking for timed

Add custom configuration files to TMs classpath on YARN

2017-06-16 Thread Mikhail Pryakhin
Hi all, I run my flink job on yarn cluster and need to supply job configuration parameters via configuration file alongside with the job jar. (configuration file can't be packaged into jobs jar file). I tried to put the configuration file into the folder that is passed via --yarnship option to

Re: Latency and Throughput

2017-06-16 Thread Chesnay Schepler
Hello, You don't have to measure anything yourself, since Flink exposes throughput/latency metrics as described in the System metrics/latency tracking sections of the metrics documentation. You only have to setup a reporter that fetches these metrics (see the reporter section) and calculate

Latency and Throughput

2017-06-16 Thread Paolo Cristofanelli
Hi, it is my first question that I am asking in this mailing list, so I hope you would forgive me if I miss something. I have started using flink recently, and now I would like to compute some statistics, like throughput and latency, for my programs. I was reading this URL from the documentation (

Re: Flink CEP not emitting timed out events properly

2017-06-16 Thread Kostas Kloudas
Hi Biplob, With processing time there are no watermarks in the stream. The problem that you are seeing is because in processing time, the CEP library expects the “next” element to come, in order to investigate if some of the patterns have timed-out. Kostas > On Jun 16, 2017, at 1:29 PM, Biplob

Re: Kafka and Flink integration

2017-06-16 Thread Tzu-Li (Gordon) Tai
Hi! It’s usually always recommended to register your classes with Kryo, to avoid the somewhat inefficient classname writing. Also, depending on the case, to decrease serialization overhead, nothing really beats specific custom serialization. So, you can also register specific serializers for Kr

Flink CEP not emitting timed out events properly

2017-06-16 Thread Biplob Biswas
Hi, I am having some issues with FlinkCEP again. This time I am using processing time for my CEP job where I am reading from multiple kafka topics and using the pattern API to create a rule. I am outputting both, the matched events as well as timeout events. Now my problem is, I am sending some e

How choose between YARN/Mesos/StandAlone Flink

2017-06-16 Thread AndreaKinn
Hi, I browsed Flink documentation but I don't find a deep comparison between the feature of Flink in standalone deployment/YARN/Mesos except technical guides to setup them. I'm a newbie in cluster computing so I have never used YARN or Mesos. I've just learned something about their functionalitie

Kafka and Flink integration

2017-06-16 Thread nragon
I have to produce custom objects into kafka and read them with flink. Any tuning advices to use kryo? Such as class registration or something like that? Any examples? Thanks -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-and-Flink-int

Re: Guava version conflict

2017-06-16 Thread Tzu-Li (Gordon) Tai
That’s actually what I’m trying to figure out right now, what had changed since then ... On 16 June 2017 at 12:23:57 PM, Flavio Pompermaier (pomperma...@okkam.it) wrote: I've never tested with flink 1.3.0, I have the problem with Flink 1.2.1. Didn't you say that you also had non-shaded guava d

Re: Guava version conflict

2017-06-16 Thread Flavio Pompermaier
I've never tested with flink 1.3.0, I have the problem with Flink 1.2.1. Didn't you say that you also had non-shaded guava dependencies in the flink dist jar some days ago? On Fri, Jun 16, 2017 at 12:19 PM, Tzu-Li (Gordon) Tai wrote: > Hi Flavio, > > I was just doing some end-to-end rebuild Flin

Re: Guava version conflict

2017-06-16 Thread Tzu-Li (Gordon) Tai
Hi Flavio, I was just doing some end-to-end rebuild Flink + cluster execution with ES sink tests, and it seems like the Guava shading problem isn’t there anymore in the flink-dist jar. On the `release-1.3` branch, built with Maven 3.0.5, the Guava dependencies in flink-dist are all properly sh

Re: Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-16 Thread Aljoscha Krettek
Hi, These two documentation pages might be interesting: - https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html - https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/da

Re: User self resource file.

2017-06-16 Thread Aljoscha Krettek
Yes, this is what I’m suggesting. I think you could clear the path when the operator/function shuts down, i.e. in the close() method. > On 15. Jun 2017, at 14:25, yunfan123 wrote: > > So your suggestion is I create an archive of all the file in the resources. > Then I get the distributed cache

Re: Streaming use case: Row enrichment

2017-06-16 Thread Flavio Pompermaier
Understood..Thanks anyway Aljoscha! On Fri, Jun 16, 2017 at 11:55 AM, Aljoscha Krettek wrote: > Hi, > > The problem with that is that the file is being read by (possibly, very > likely) multiple operators in parallel. The file source works like this: > there is a ContinuousFileMonitoringFunction

Re: Stateful streaming question

2017-06-16 Thread Aljoscha Krettek
I think it might be possible to do but I’m not aware of anyone working on that and I haven’t seen anyone on the mailing lists express interest in working on that. > On 16. Jun 2017, at 11:31, Flavio Pompermaier wrote: > > Ok thanks for the clarification. Do you think it could be possible (soon

Re: Streaming use case: Row enrichment

2017-06-16 Thread Aljoscha Krettek
Hi, The problem with that is that the file is being read by (possibly, very likely) multiple operators in parallel. The file source works like this: there is a ContinuousFileMonitoringFunction (this is an actual Flink source) that monitors a directory and when a new file appears sends several (

Re: Question about the custom partitioner

2017-06-16 Thread Xingcan Cui
Hi Aljoscha, Thanks for your explanation. I'll try what you suggests. Best, Xingcan On Fri, Jun 16, 2017 at 5:19 PM, Aljoscha Krettek wrote: > Hi, > > I’m afraid that’s not possible out-of-box with the current APIs. I > actually don’t know why the user-facing Partitioner only allows returning

Re: Stateful streaming question

2017-06-16 Thread Flavio Pompermaier
Ok thanks for the clarification. Do you think it could be possible (sooner or later) to have in Flink some sort of synchronization between jobs (as in this case where the input datastream should be "paused" until the second job finishes)? I know I coould use something like Oozie or Falcon to orches

Re: Streaming use case: Row enrichment

2017-06-16 Thread Flavio Pompermaier
Is it really necessary to wait for the file to reach the end of the pipeline? Isn't sufficient to know that it has been read and the source operator has been checkpointed (I don't know if I'm using this word correctly...I mean that all the file splits has been processed and Flink won't reprocess th

Re: Stateful streaming question

2017-06-16 Thread Aljoscha Krettek
Hi, I’m afraid not. You would have to wait for one job to finish before starting the next one. Best, Aljoscha > On 15. Jun 2017, at 20:11, Flavio Pompermaier wrote: > > Hi Aljoscha, > we're still investigating possible solutions here. Yes, as you correctly said > there are links between data

Re: Streaming use case: Row enrichment

2017-06-16 Thread Aljoscha Krettek
Hi, I was referring to StreamExecutionEnvironment.readFile( FileInputFormat inputFormat, String filePath, FileProcessingMode watchType, long interval) Where you can specify whether the source should shutdown once all files have been had (PROCESS_ONCE) or whether the source should contin

Re: Flink cluster : Client is not connected to any Elasticsearch nodes!

2017-06-16 Thread Flavio Pompermaier
When this connector was improved to be resilient to ES problems we used to use Logstash to index on ES and it was really cumbersome...this connector ease a lot the work of indexing to ES: it's much faster, it can index directly without persisting to file and it's a lot much easier to filter documen

Re: Question about the custom partitioner

2017-06-16 Thread Aljoscha Krettek
Hi, I’m afraid that’s not possible out-of-box with the current APIs. I actually don’t know why the user-facing Partitioner only allows returning one target because the internal StreamPartitioner (which extends ChannelSelector) allows returning multiple target partitions. You can hack around th

Re: Flink cluster : Client is not connected to any Elasticsearch nodes!

2017-06-16 Thread Tzu-Li (Gordon) Tai
Hi Dhinesh, One other thing that came to mind: the Elasticsearch 2 connector, by default, uses ES 2.3.5. If you’re using an Elasticsearch 2 with major version higher than that, you need to build the connector with the matching version. When running lower major version clients against a higher ma

Re: Painful KryoException: java.lang.IndexOutOfBoundsException on Flink Batch Api scala

2017-06-16 Thread Tzu-Li (Gordon) Tai
Hi Andrea, I’ve rallied back to this and wanted to check on the status. Have you managed to solve this in the end, or is this still a problem for you? If it’s still a problem, would you be able to provide a complete runnable example job that can reproduce the problem (ideally via a git branch I