ElasticSearchSink - retrying doesn't work in ActionRequestFailureHandler

2019-02-08 Thread Averell
osed 2019-02-09 04:12:35.678 [I/O dispatcher 25] WARN c.n.c..sink.MyElasticSearchSink$ - Retrying record: update {[idx-20190208][_doc][doc_id_154962270], doc_as_upsert[true], doc[index {*[null][null][null]*, source[{...}]}], scripted_upsert[false], detect_noop[true]} 2019-02-09 04:12:54.242 [Si

Re: Get nested Rows from Json string

2019-02-08 Thread Rong Rong
Hi François, I just did some research and seems like this is in fact a Stringify issue. If you try running one of the AvroRowDeSerializationSchemaTest [1], you will find out that only MAP, ARRAY are correctly stringify (Map using "{}" quote and Array using "[]" quote). However nested records are

Re: Running JobManager as Deployment instead of Job

2019-02-08 Thread Vishal Santoshi
In one case however, we do want to retain the same cluster id ( think ingress on k8s and thus SLAs with external touch points ) but it is essentially a new job ( added an incompatible change but at the interface level it retains the same contract ) , the only way seems to be to remove the

Per-workflow configurations for an S3-related property

2019-02-08 Thread Ken Krugler
Hi all, When running in EMR, we’re encountering the oh-so-common HTTP timeout that’s caused by the connection pool being too small (see below) I’d found one SO answer that said to bump fs.s3.maxConnections for the EMR S3 filesystem implementation.

Re: long lived standalone job session cluster in kubernetes

2019-02-08 Thread Heath Albritton
Has any progress been made on this? There are a number of folks in the community looking to help out. -H On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: > > Hi Derek, > > there is this issue [1] which tracks the active Kubernetes integration. Jin > Sun already started implementing some

Can an Aggregate the key from a WindowedStream.aggregate()

2019-02-08 Thread stephen . alan . connolly
If I write my aggregation logic as a WindowFunction then I get access to the key as the first parameter in WindowFunction.apply(...) however the Javadocs for calling WindowedStream.apply(WindowFunction) state: > Note that this function requires that all data in the windows is buffered > until

[Table] Types of query result and tablesink do not match error

2019-02-08 Thread françois lacombe
Hi all, An error is currently raised when using table.insertInto("registeredSink") in Flink 1.7.0 when types of table and sink don't match. I've got the following : org.apache.flink.table.api.ValidationException: Field types of query result and registered TableSink null do not match. Query

Re: Get nested Rows from Json string

2019-02-08 Thread françois lacombe
Hi Rong, Thank you for this answer. I've changed Rows to Map, which ease the conversion process. Nevertheless I'm interested in any explanation about why row1.setField(i, row2) appeends row2 at the end of row1. All the best François Le mer. 6 févr. 2019 à 19:33, Rong Rong a écrit : > Hi

Re: Broadcast state before events stream consumption

2019-02-08 Thread Chirag Dewan
Hi Vadim, I would be interested in this too.  Presently, I have to read my lookup source in the open method and keep it in a cache. By doing that I cannot make use of the broadcast state until ofcourse the first emit comes on the Broadcast stream. The problem with waiting the event stream is

Help with a stream processing use case

2019-02-08 Thread Sandybayev, Turar (CAI - Atlanta)
Hi all, I wonder whether it’s possible to use Flink for the following requirement. We need to process a Kinesis stream and based on values in each record, route those records to different S3 buckets and keyspaces, with support for batching up of files and control over partitioning scheme (so

Re: stream of large objects

2019-02-08 Thread Aggarwal, Ajay
Yes, another KeyBy will be used. The “small size” messages will be strings of length 500 to 1000. Is there a concept of “global” state in flink? Is it possible to keep these lists in global state and only pass the list reference (by name?) in the LargeMessage? From: Chesnay Schepler Date:

Re: Running single Flink job in a job cluster, problem starting JobManager

2019-02-08 Thread Thomas Eckestad
Hi again, when removing Spring Boot from the application it works. I would really like to mix Spring Boot and Flink. It does work with Spring Boot when submitting jobs to a session cluster, as stated before. /Thomas From: Thomas Eckestad Sent: Friday, February

Broadcast state before events stream consumption

2019-02-08 Thread Vadim Vararu
Hi all, I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario. I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the

Flink Standalone cluster - logging problem

2019-02-08 Thread simpleusr
We are using standalone cluster and submittig jobs through command line client. As stated in https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/logging.html , we are editing log4j-cli.properties but this does not make any effect? Anybody seen that before? Regards -- Sent

Flink Standalone cluster - production settings

2019-02-08 Thread simpleusr
I know this seems a silly question but I am trying to figure out optimal set up for our flink jobs. We are using standalone cluster with 5 jobs. Each job has 3 asynch operators with Executors with thread counts of 20,20,100. Source is kafka and cassandra and rest sinks exist. Currently we are

Flink Standalone cluster - dumps

2019-02-08 Thread simpleusr
Flink Standalone cluster - dumps We are using standalone cluster and submittig jobs through command line client. As far as I understand, the job is executed in task manager. A single task manager represents a single jvm? So the dump shows threads from all jobs bound to task manager. Two

Re: stream of large objects

2019-02-08 Thread Chesnay Schepler
Whether a LargeMessage is serialized depends on how the job is structured. For example, if you were to only apply map/filter functions after the aggregation it is likely they wouldn't be serialized. If you were to apply another keyBy they will be serialized again. When you say "small size"

Re: Running JobManager as Deployment instead of Job

2019-02-08 Thread Till Rohrmann
If you keep the same cluster id, the upgraded job should pick up checkpoints from the completed checkpoint store. However, I would recommend to take a savepoint and resume from this savepoint because then you can also specify that you allow non restored state, for example. Cheers, Till On Fri,

Reduce one event under multiple keys

2019-02-08 Thread Stephen Connolly
Ok, I'll try and map my problem into something that should be familiar to most people. Consider collection of PCs, each of which has a unique ID, e.g. ca:fe:ba:be, de:ad:be:ef, etc. Each PC has a tree of local files. Some of the file paths are coincidentally the same names, but there is no file

Running single Flink job in a job cluster, problem starting JobManager

2019-02-08 Thread Thomas Eckestad
Hi, I am trying to run a flink job cluster in K8s. As a first step I have created a Docker image according to: https://github.com/apache/flink/blob/release-1.7/flink-container/docker/README.md When I try to run the image: docker run --name=flink-job-manager flink-image:latest job-cluster

Re: Running JobManager as Deployment instead of Job

2019-02-08 Thread Vishal Santoshi
Is the rationale of using a jobID 00* also roughly the same. As in a Flink job cluster is a single job and thus a single job id suffices ? I am more wondering about the case when we are doing a compatible changes to a job and want to resume ( given we are in HA mode and thus have a

Dataset statistics

2019-02-08 Thread Flavio Pompermaier
Hi to all, is there any effort to standardize descriptive statistics in Apache Flink? Is there any suggested way to achieve this? Best, Flavio

Re: Flink Job and Watermarking

2019-02-08 Thread Chesnay Schepler
Have you considered using the metric system to access the current watermarks for each operator? (see https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#io) On 08.02.2019 03:19, Kaustubh Rudrawar wrote: Hi, I'm writing a job that wants to make an HTTP request once

Sliding window buffering on restart without save point

2019-02-08 Thread shater93
Hello, I am having a Flink pipeline processing data in several overlapping(sliding) windows such that they span [t_i, t_i + T], where t_i is the window starting time and T is the window size. The overlap is such that t_(I+1) - t_i = T/6 (i.e on every window size there is 6 overlapping windows).