Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread Fabian Hueske
Hi, this is a valid approach. It might suffer from unbalanced load if the reader tasks process the files at different speed (or the files vary in size) because each task has to process the same number of files. An alternative would be to implement your own InputFormat. The input format would crea

Re: Modify field topics (KafkaConsumer) during runtime

2017-08-31 Thread Piotr Nowojski
Hi, As far as I know it is not possible to do it on the fly. There is planned feature for discovering topics using some regex: https://github.com/apache/flink/pull/3746 https://issues.apache.org/jira/browse/FLINK-5704

Rest API for Checkpoint Data

2017-08-31 Thread sohimankotia
Is there way to read checkpoint ( if not configured to save to hdfs ) data through rest api (or some other way) for monitoring purpose ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: BlobCache and its functioning

2017-08-31 Thread Fabian Hueske
Hi Federico, Not sure what's going on there but Nico (in CC) is more familiar with the blob cache and might be able to help. Best, Fabian 2017-08-30 15:35 GMT+02:00 Federico D'Ambrosio : > Hi, > > I have a rather simple Flink job which has a KinesisConsumer as a source > and an HBase table as s

Re: Great number of jobs and numberOfBuffers

2017-08-31 Thread Nico Kruber
Hi Gwenhael, First of all, we should try getting your job to work without splitting it up as that should work. Also, the number of network buffers should not depend on your input but rather the job and parallelism. The IOException you reported may only come from either a) too few network buffer

Using local FS for checkpoint

2017-08-31 Thread Marchant, Hayden
Whether I use RocksDB or FS State backends, if my requirements are to have fault-tolerance and ability to recover with 'at-least once' semantics for my Flink job, is there still a valid case for using a backing local FS for storing states? i.e. If a Flink Node is invalidated, I would have though

dynamically partitioned stream

2017-08-31 Thread Martin Eden
Hi all, I am trying to implement the following using Flink: I have 2 input message streams: 1. Data Stream: KEY VALUE TIME . . . C V66 B V66 A V55 A V44 C V33 A V33 B V33 B V22 A V1

Re: BlobCache and its functioning

2017-08-31 Thread Nico Kruber
Hi Federico, 1) Which version of Flink are you using? 2) Can you also share the JobManager log? 3) Why do you think, Flink is stuck at the BlobCache? Is it really blocked, or do you still have CPU load? Can you post stack traces of the TaskManager (TM) and JobManager processes when you think they

RE: Great number of jobs and numberOfBuffers

2017-08-31 Thread Gwenhael Pasquiers
Hi, Well yes, I could probably make it work with a constant number of operators (and consequently buffers) by developing specific input and output classes, and that way I'd have a workaround for that buffers issue. The size of my job is input-dependent mostly because my code creates one full p

Re: Expception with Avro Serialization on RocksDBStateBackend

2017-08-31 Thread Biplob Biswas
Hi, I am still stuck here, and I still couldn't find a way to make Avro accept null values. Any help here would be really appreciated. Thanks, Biplob -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: dynamically partitioned stream

2017-08-31 Thread Tony Wei
Hi Martin, Let me understand your question first. You have two Stream: Data Stream and Control Stream and you want to select data in Data Stream based on the key set got from Control Stream. If I were not misunderstanding your question, I think SideInput is what you want. https://cwiki.apache.org

Re: Using local FS for checkpoint

2017-08-31 Thread Tony Wei
Hi Marchant, HDFS is not a must for storing checkpoints. S3 or NFS are all acceptable, as long as it is accessible from job manager and task manager. For AWS S3 configuration, you can refer to this page ( https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html). Best, Tony Wei

Re: Serialization issues with DataStreamUtils

2017-08-31 Thread vinay patil
Hi, After adding the following two lines the serialization trace does not show the Schema related classes: env.getConfig().registerTypeWithKryoSerializer(GenericData.Array.class, Serializers.SpecificInstanceCollectionSerializerForArrayList.class); env.getConfig().addDefaultKryoSerializer(

Re: datastream.print() doesn't works

2017-08-31 Thread AndreaKinn
I call env.execute(). What do you mean for "configure the logger?" Timo Walther wrote > Don't forget to call env.execute() at the end and make sure you have > configured your logger correctly. > > Regards, > Timo > > Am 29.08.17 um 14:59 schrieb Chesnay Schepler: >> The easiest explanation is

Re: BlobCache and its functioning

2017-08-31 Thread Federico D'Ambrosio
Hi, 1) I'm using Flink 1.3.2 2) Th JobManager log is pretty much the same concerning those lines: 2017-08-30 14:16:53,343 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server master-1.localdomain/10.0.0.55:2181 2017-08-30 14:16:53,344 INFO org.

RE: Using local FS for checkpoint

2017-08-31 Thread Marchant, Hayden
I didn’t think about NFS. That would save me the hassle of installing HDFS cluster just for that, especially if my organization already has an NFS ‘handy’. Thanks Hayden From: Tony Wei [mailto:tony19920...@gmail.com] Sent: Thursday, August 31, 2017 12:12 PM To: Marchant, Hayden [ICG-IT] Cc: user

Re: BlobCache and its functioning

2017-08-31 Thread Nico Kruber
to sum up: the lines you were seeing seem to be the down- and upload of the TaskManager logs from the web interface which go through the BlobServer and its components. Nico On Thursday, 31 August 2017 11:51:27 CEST Federico D'Ambrosio wrote: > Hi, > > 1) I'm using Flink 1.3.2 > > 2) Th JobMa

Re: Union limit

2017-08-31 Thread boci
Dear Fabian, Thanks to your answer (I think you said same in StackOverflow) but as you see in my code your solution does not work anymore: Here is the code, it's split the datasets to list (each list contains maximum 60 datasets) After that, I reduce the dataset using union and map with an IdMap

Re: BlobCache and its functioning

2017-08-31 Thread Federico D'Ambrosio
Ok, thank you very much! So that was nothing actually related to what I was trying to do. I guess I'll have to investigate further on the effective correctness of the implementation of the OutputFormat then, because the total lack of other log lines was the most strucking thing about this whole is

Re: dynamically partitioned stream

2017-08-31 Thread Martin Eden
Thanks for your reply Tony. So there are actually 2 problems to solve: 1. All control stream msgs need to be broadcasted to all tasks. 2. The data stream messages with the same keys as those specified in the control message need to go to the same task as well, so that all the values required for

Re: dynamically partitioned stream

2017-08-31 Thread Tony Wei
Hi Martin, About problem 2. How were those lambda functions created? Pre-defined functions / operators or automatically generated based on the message from Control Stream? For the former, you could give each function one id and user flapMap to duplicate data with multiple ids. Then, you could use

Re: datastream.print() doesn't works

2017-08-31 Thread Chesnay Schepler
If you call createLocalEnvironmentWithWebUI you don't need to start a cluster with the start-local.sh script, you can run it from the IDE and it will start the web UI. If you submit a job to a cluster that was started outside the IDE you can call getExecutionEnvironment as usual. Not sure why t

DataSet: CombineHint heuristics

2017-08-31 Thread Urs Schoenenberger
Hi all, I was wondering about the heuristics for CombineHint: Flink uses SORT by default, but the doc for HASH says that we should expect it to be faster if the number of keys is less than 1/10th of the number of records. HASH should be faster if it is able to combine a lot of records, which hap

Re: datastream.print() doesn't works

2017-08-31 Thread AndreaKinn
I verified I use just one environment. Unfortunately, (also without using start-local.sh) /callingcreateLocalEnvironmentWithWebUI()/ and run the program from the IDE anyway no one running jobs is listed in the dashboard at /http://localhost:8081/#/overview/. In the ide it is correctly executed me

Re: dynamically partitioned stream

2017-08-31 Thread Martin Eden
Thanks for your reply Tony, Yes we are in the latter case, where the functions/lambdas come in the control stream. Think of them as strings containing the logic of the function. The values for each of the arguments to the function come from the data stream. That is why we need to co-locate the dat

Very low-latency - is it possible?

2017-08-31 Thread Marchant, Hayden
We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink). Here is a very high-level description of our Flink design: We need at least onc

Re: Very low-latency - is it possible?

2017-08-31 Thread Jörn Franke
If you really need to get that low something else might be more suitable. Given the times a custom solution might be necessary. Flink is a generic powerful framework - hence it does not address these latencies. > On 31. Aug 2017, at 14:50, Marchant, Hayden wrote: > > We're about to get starte

Operator variables in memory scoped by key

2017-08-31 Thread gerardg
I'm using a trie tree to match prefixes efficiently in an operator applied to a KeyedStream. It can grow quite large so I'd like to store the contents of the tree in a keyed MapState so I benefit from incremental checkpoints. Then, I'd just need to recreate the tree in memory from the MapState in c

Re: Very low-latency - is it possible?

2017-08-31 Thread Piotr Nowojski
Achieving 1ms in any distributed system might be problematic, because even simplest ping messages between worker nodes take ~0.2ms. However, as you stated your desired throughput (40k records/s) and state is small, so maybe there is no need for using a distributed system for that? You could try

Re: Operator variables in memory scoped by key

2017-08-31 Thread Aljoscha Krettek
Hi, I'm afraid that's not possible. One operator instance is in fact, as you suspected, responsible for processing elements with many different keys. Only if you use the keyed state abstractions (ValueState, MapState and such) is your state "per-key". Best, Aljoscha > On 31. Aug 2017, at 15:1

Re: dynamically partitioned stream

2017-08-31 Thread Tony Wei
Hi Martin, So the problem is that you want to group those arguments in Data Stream and pass them to the lambda function from Control Stream at the same time. Am I right? If right, then you could give each lambda function an id as well. Use these ids to tag those arguments to which they belong. Af

Re: dynamically partitioned stream

2017-08-31 Thread Aljoscha Krettek
Hi Martin, In your original example, what does this syntax mean exactly: f1[A, B, C]1 Does it mean that f1 needs one A, one B and one C from the main stream? If yes, which ones, because there are multiple As and Bs and so on. Or does it mean that f1 can apply to an A or

Re: DataSet: CombineHint heuristics

2017-08-31 Thread Aljoscha Krettek
Hi, I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable". I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. Best, Aljoscha > On 31. Aug 2017,

Sink -> Source

2017-08-31 Thread Philip Doctor
I have a few Flink jobs. Several of them share the same code. I was wondering if I could make those shared steps their own job and then specify that the sink for one process was the source for another process, stiching my jobs together. Is this possible ? I didn’t see it in the docs.. It feel

Re: Rest API for Checkpoint Data

2017-08-31 Thread Aljoscha Krettek
Hi, No, I'm afraid this is not possible. Why can't you write the data to some file system? Best, Aljoscha > On 31. Aug 2017, at 09:33, sohimankotia wrote: > > Is there way to read checkpoint ( if not configured to save to hdfs ) data > through rest api (or some other way) for monitoring purpo

Re: dynamically partitioned stream

2017-08-31 Thread Martin Eden
Hi Aljoscha, Tony, Aljoscha: Yes it's the first option you mentioned. Yes, the stream has multiple values in flight for A, B, C. f1 needs to be applied each time a new value for either A, B or C comes in. So we need to use state to cache the latest values. So using the example data stream in my fi

Flink on DCOS on DigitalOcean

2017-08-31 Thread Alexandru Gutan
Dear all, Can somebody confirm to have successfully installed Flink on DCOS (running on DigitalOcean) ? I've raised an issue on StackOverflow but nobody has assisted me. I've described everything in detail here: https://stackoverflow.com/questions/45391980/error-installing-flink-in-dcos I depl

Re: Flink Mesos Outstanding Offers - trouble launching task managers

2017-08-31 Thread prashantnayak
Hi Eron No, unfortunately we did not directly resolve it... we work around it for now by ensuring that our Mesos slaves are set up to correctly support the JobManager with offers. Prashant -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: dynamically partitioned stream

2017-08-31 Thread Tony Wei
Hi Martin, Yes, that is exactly what I thought. But the first step also needs to be fulfilled by SideInput. I'm not sure how to achieve this in the current release. Best, Tony Wei Martin Eden 於 2017年8月31日 週四,下午11:32寫道: > Hi Aljoscha, Tony, > > Aljoscha: > Yes it's the first option you mentione

Help with table UDF

2017-08-31 Thread Flavio Pompermaier
Hi all, I'm using Flink 1.3.1 and I'm trying to register an UDF but there's something wrong. I always get the following exception: java.lang.UnsupportedOperationException: org.apache.flink.table.expressions.TableFunctionCall cannot be transformed to RexNode at org.apache.flink.table.expressions.Ex

Bucketing/Rolling Sink: New timestamp appeded to the part file name everytime a new part file is rolled

2017-08-31 Thread Raja . Aravapalli
Hi, I have a flink application that is streaming data into HDFS and I am using Bucketing Sink for that. And, I want to know if is it possible to rename the part files that is being created in the base hdfs directory. Right now I am using the below code for including the timestamp into part-fil

RE: Using local FS for checkpoint

2017-08-31 Thread prashantnayak
We ran into issues using EFS (which under the covers is a NFS like filesystem)... details are in this post http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/External-checkpoints-not-getting-cleaned-up-discarded-potentially-causing-high-load-tp14073p14106.html -- Sent from: htt

Re: Flink session on Yarn - ClassNotFoundException

2017-08-31 Thread Banias H
We had the same issue. Get the hdp version, from /usr/hdp/current/hadoop-client/hadoop-common-.jar for example. Then rebuild flink from src: mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version= for example: mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.7.3.2.6.1.0-129

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread ShB
Hi Fabian, Thanks for your response. If I implemented my own InputFormat, how would I read a specific list of files from S3? Assuming I need to use readFile(), below would read all of the files from the specified S3 bucket or path: env.readFile(MyInputFormat, "s3://my-bucket/") Is there a way

Re: Help with table UDF

2017-08-31 Thread Fabian Hueske
Hi Flavio, you're using the TableFunction not correctly. The documentation shows how to call it in a join() method. But I agree, the error message should be better. Best, Fabian 2017-08-31 18:53 GMT+02:00 Flavio Pompermaier : > Hi all, > I'm using Flink 1.3.1 and I'm trying to register an UDF

Re: Flink Mesos Outstanding Offers - trouble launching task managers

2017-08-31 Thread Eron Wright
Please mail me more information, in particular the JM log and the information on the 'offers' tab on the Mesos UI. Also, are you using any Mesos roles? Thanks On Thu, Aug 31, 2017 at 9:02 AM, prashantnayak < prash...@intellifylearning.com> wrote: > Hi Eron > > No, unfortunately we did not dire

Re: Flink Mesos Outstanding Offers - trouble launching task managers

2017-08-31 Thread Prashant Nayak
Thanks… will do this early next week. Appreciate the follow-up. Prashant On Thu, Aug 31, 2017 at 5:45 PM, Eron Wright wrote: > Please mail me more information, in particular the JM log and the > information on the 'offers' tab on the Mesos UI. Also, are you using any > Mesos roles? > > Thank

Re: dynamically partitioned stream

2017-08-31 Thread Martin Eden
This might be a way forward but since side inputs are not there I will try and key the control stream by the keys in the first co flat map. I'll see how it goes. Thanks guys, M On Thu, Aug 31, 2017 at 5:16 PM, Tony Wei wrote: > Hi Martin, > > Yes, that is exactly what I thought. > But the firs

Re: Bucketing/Rolling Sink: New timestamp appeded to the part file name everytime a new part file is rolled

2017-08-31 Thread Piotr Nowojski
Hi, BucketingSink doesn’t support the feature that you are requesting, you can not specify a dynamically generated prefix/suffix. Piotrek > On Aug 31, 2017, at 7:12 PM, Raja.Aravapalli > wrote: > > > Hi, > > I have a flink application that is streaming data into HDFS and I am using > Bu