Re: Sink not switched to "RUNUNG" even though a task slot is available

2016-11-23 Thread Yassine MARZOUGUI
Hi Robert, I see, so the join needs to consume all data first and process it. In my case, I couldn't wait long because the first join quickly generated a lot of data that can't fit in the memory or in the disk. The solution was then to manually specify a JoinHint and broadcast the small dataset, t

Re: IterativeStream seems to ignore maxWaitTimeMillis

2016-11-23 Thread Juan Rodríguez Hortalá
Thanks a lot for your suggestion Aljoscha, it has helped me discovered the problem: I was using an Executor inside a RichFunction and I wasn't shutting down the executor. Now I call executor.shutdownNow() in RichFunction .close(), and the job stops when both the input and the loop are exhausted. G

Cannot connect to the JobManager - Flink 1.1.3 cluster mode

2016-11-23 Thread Dominik Safaric
Hi all, As I’ve been setting up a cluster comprised of three worker nodes and a master node, I’ve encountered the problem that the JobManager although running is unreachable. The master instance has access using SSH to all worker nodes. The worker nodes do not however have access via SSH to t

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Greg Hogan
EMRFS looks to *add* cost (and consistency). Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day at 1 Hz. Is the number of checkpoint files simply parallelism * number of operators? That could add up quickly. Is the recommendation to run HDFS on EBS? On Wed, Nov 23, 2016 at

Re: Reading files from an S3 folder

2016-11-23 Thread Steve Morin
Alex, We are working on the same thing, for the same exact reason. We are trying to avoid the complexities of running HDFS just for the file storage. We are also okay with the S3 limitations it introduces. We'll try and update the group if we find solutions for parallelizing the files consumpt

Problem - Negative currentWatermark if the watermark assignment is made before connecting the streams

2016-11-23 Thread PedroMrChaves
Hello, I have an application which has two different streams of data, one represents a set of events and the other a set of rules that need to be matched against the events. In order to do this I use a coFlatMapOperator. The problem is that if I assign the timestamps and watermarks after the strea

State Serializer/Deserializer between savepoints

2016-11-23 Thread Daniel Santos
Hello, Is it possible to change the object that is being serialized or deserialized? Let's say we have something like the following : stream .window() .fold(InitialValueA)(FoldFunction) Now, InitialValueA is a case class A(n1 : Int). Restoring the same job from a savepoint but wi

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
We're not running on EMR (running Flink as a standalone cluster on Kubernetes on EC2). I assume that it's not possible to use EMRFS if not running on Amazon's EMR images. On 23 November 2016 at 18:00, Foster, Craig wrote: > I would suggest using EMRFS anyway, which is the way to access the S3 f

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
Hi Scott, Thanks for the suggestion, it sounds like you and I think alike, going over to hdfs sounds to me like the simplest solution. There are no requirements to use S3, just another team member who is generally sceptical fearing that adding HDFS will add a new class of maintenance problems to

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
Hi Greg, Standard storage class, everything is on defaults, we've not done anything special with the bucket. Cloud Watch only appears to give me total billing for S3 in general, I don't see a breakdown unless that's something I can configure somewhere. Regards, Jonathan On 23 November 2016 at

Re: Reading files from an S3 folder

2016-11-23 Thread Alex Reid
Each file is ~1.8G compressed (and about 15G uncompressed, so a little over 300G total for all the files). In the Web Client UI, when I look at the Plan, I click on the subtask for reading in the files, I see a line for each host and the Bytes Sent for each host is like 350G. The job takes longer

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Foster, Craig
I would suggest using EMRFS anyway, which is the way to access the S3 file system from EMR (using the same s3:// prefixes). That said, you will run into the same shading issues in our build until the next release—which is coming up relatively shortly. From: Robert Metzger Reply-To: "user@fl

Re: Error while running Yahoo Streaming Benchmarks on a single machine

2016-11-23 Thread Robert Metzger
Hi, this is not really a failure. It just means that the job has been cancelled by somebody (using the web interface or the ./bin/flink tool). On Sat, Nov 19, 2016 at 3:59 AM, Muhammad Haseeb Javed < 11besemja...@seecs.edu.pk> wrote: > I am trying to run the Yahoo Streaming Benchmarks on a singl

Re: Flink Streaming Data Source Node

2016-11-23 Thread Robert Metzger
Hi, I'm not sure if I fully understood your question. The number of input sources is always less or equal to the number of slots in one node. Usually source instances are equally distributed among the parallel workers (TaskManagers). Maybe this document describing the deployment model is also help

Re: Reading files from an S3 folder

2016-11-23 Thread Robert Metzger
Hi, This is not the expected behavior. Each parallel instance should read only one file. The files should not be read multiple times by the different parallel instances. How did you check / find out that each node is reading all the data? Regards, Robert On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid

Re: Sink not switched to "RUNUNG" even though a task slot is available

2016-11-23 Thread Robert Metzger
Hi Yassine, you don't necessarily need to set the parallelism of the last two operators of 31, the sink with parallelism 1 will fit still into the slots. A task slot can, by default, hold an entire "slice" or parallel instance of a job. The reason why the sink stays in state CREATE in the beginni

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Robert Metzger
Hi Jonathan, have you tried using Amazon's latest EMR Hadoop distribution? Maybe they've fixed the issue in their for older Hadoop releases? On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder wrote: > Hi Jonathan, > > You might be better off creating a small Hadoop HDFS cluster just for the > purpos

Re: Tame Flink UI?

2016-11-23 Thread Chesnay Schepler
Hello, So there are 2 separate issues here: 1. The response when requesting the list of available metrics is pretty big. 2. The request for the values of these metrics is also pretty big, and the response even larger. For now I will modify the WebUI to only ask the value of selected metr

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Scott Kidder
Hi Jonathan, You might be better off creating a small Hadoop HDFS cluster just for the purpose of storing Flink checkpoint & savepoint data. Like you, I tried using S3 to persist Flink state, but encountered AWS SDK issues and felt like I was going down an ill-advised path. I then created a small

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Greg Hogan
Hi Jonathan, Which S3 storage class are you using? Do you have a breakdown of the S3 costs as storage / API calls / early deletes / data transfer? Greg On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share wrote: > Hi, > > I'm interested in hearing if anyone else has experience with using Amazon > S

Re: ContinuousEventTimeTrigger breaks coGrouped windowed streams?

2016-11-23 Thread Maximilian Michels
The problem here is that the ContiuousEventTimeTrigger is kind of broken. It relies on the first element to trigger a future timer but the time might not progress this far. It should additionally trigger at the end of the window. Here is a version with an improved continuous trigger: https://gist.

Sink not switched to "RUNUNG" even though a task slot is available

2016-11-23 Thread Yassine MARZOUGUI
Hi all, My batch job has the follwoing plan in the end (figure attached): ​ I have a total of 32 task slots, and I have set the parallelism of the last two operators before the sink to 31. The sink parallelism is 1. The last operator before the sink is a MapOperator, so it doesn't need to buffer

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-23 Thread Aljoscha Krettek
I think you would need something like this: var hourlyDiscarding = stream .window(1.hour) .trigger(discarding) .apply(..) //write to cassandra hourlyDiscarding .window(1.hour) .trigger(accumulating) .apply(..) .addSink(cassandra) //forward to next acc step var daily = hourlyDiscard

Re: Many operations cause StackOverflowError with AWS EMR YARN cluster

2016-11-23 Thread Chesnay Schepler
Hello, implementing collect() in python is not that trivial and the gain is questionable. There is an inherent size limit (think 10mb), and it is a bit at odds with the deployment model of the Python API. Something easier would be to execute each iteration of the for-loop as a separate job an

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-23 Thread kaelumania
Sounds good to me. But I still need to have some kind of side output (cassandra) that stores the accumulating aggregates on each time scale (minute, hour). Thus I would need to have something like this var hourly = stream.window(1.hour).apply(..) //write to cassandra hourly.trigger(accumulating)

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-23 Thread Aljoscha Krettek
You can implement discarding behaviour by writing a custom trigger (based on EventTimeTrigger) that returns FIRE_AND_PURGE when firing. With this you could maybe implement a cascade of windows where the first aggregates for the smallest time interval and is discarding and where the other triggers t

Re: IterativeStream seems to ignore maxWaitTimeMillis

2016-11-23 Thread Aljoscha Krettek
Ah, cancel() won't be called on the source if it is already stopped, I think. Could you try boiling it down to the very basics, i.e. have just the source and an iteration and check what happens. On Wed, 23 Nov 2016 at 05:08 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Thank