Re: Error handling

2018-05-08 Thread Vishnu Viswanath
Was referring to the original email thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-handling-td3448.html On Tue, May 8, 2018 at 5:29 PM, vishnuviswanath < vishnu.viswanat...@gmail.com> wrote: > Hi, > > Wondering if any of these ideas were implemented after the

Re: Error handling

2018-05-08 Thread vishnuviswanath
Hi, Wondering if any of these ideas were implemented after the discussion? Thanks, Vishnu -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Streaming and batch jobs together

2018-05-08 Thread Flavio Pompermaier
Thanks! Both solutions are reasonable but ehat abiut max state size (per key)?is there any suggested database/nosql store to use? On Tue, 8 May 2018, 18:09 TechnoMage, wrote: > If you use a KeyedStream you can group records by key (city) and then use > a RichFlatMap to

Re: Reading csv-files in parallel

2018-05-08 Thread Fabian Hueske
Hi, the Table API / SQL and the DataSet API can be used together in the same program. So you could read the data with a custom input format or a TextInputFormat and a custom MapFunction parser and hand it to SQL afterwards. The program would be a regular Scala DataSet program with an

Re: Lost JobManager

2018-05-08 Thread Fabian Hueske
Hi, I noticed that you configured the Akka framesize to 2GB (the default being 10MB). This appears like quite a lot to me and might be causing problems since the exceptions indicate an Akka timeout issue. Did configure the framesize for a particular reason that high? It seems that you are

Re: Processing Sorted Input Datasets

2018-05-08 Thread Fabian Hueske
Hi Helmut, In fact this is possible with the DataSet API. However, AFAIK it is an undocumented feature and probably not widely used. You can do this by specifying so-called SplitDataProperties on a DataSource as follows: DataSource src = env.createInput(...); SplitDataProperties splitProps =

Re: Streaming and batch jobs together

2018-05-08 Thread TechnoMage
If you use a KeyedStream you can group records by key (city) and then use a RichFlatMap to aggregate state in a MapState or ListState per key. You can then have that operator publish the updated results as a new aggregated record, or send it to a database or such as you see fit. Michael > On

RE: Lost JobManager

2018-05-08 Thread Chan, Regina
There’s no collect() explicitly from me. It has a cogroup operator before writing to DataSink. From: Fabian Hueske [mailto:fhue...@gmail.com] Sent: Monday, May 07, 2018 6:31 AM To: Chan, Regina [Tech] Cc: user@flink.apache.org; Newport, Billy [Tech] Subject: Re: Lost JobManager Hi Regina, I

Fwd: Processing Sorted Input Datasets

2018-05-08 Thread Helmut Zechmann
Hi all, Helmut Zechmann helmut.zechm...@mailbox.org www.helmutzechmann.com 0151 27527950 we want to use flink batch to merge records from two or more datasets using groupBy. The input datasets are already sorted since they have been written out sorted by some other job. Is it possible to

Re: jvm options for individual jobs / topologies

2018-05-08 Thread Kostas Kloudas
Hi Benjamin, I do not think you can set per job memory configuration in a shared cluster. The reason is that if different jobs share the same TM, there are resources that are shared between them, e.g, network buffers. If you are ok with having a separate cluster per job then this will allow you

Re: Streaming and batch jobs together

2018-05-08 Thread Kostas Kloudas
Hi Flavio, If I understand correctly, you have a set of keys which evolves in two ways: keys may be added/deleted values associated with the keys can also be updated. If this is the case, you can use a streaming job that: 1. has as a source the stream of events

RE: Reading csv-files in parallel

2018-05-08 Thread Esa Heikkinen
Hi Would it better to use DataSet API, Table (Relational) and readCsvFile() , because it is little but upper level implementation ? SQL also sounds very good in this (batch processing) case, but is it possible to use (because many different type of csv-files) ? And does it understand

Re: Taskmanager with multiple slots vs running multiple taskmanagers with 1 slot each

2018-05-08 Thread Kostas Kloudas
Hi Andre, I cannot speak on behalf of everyone but I would recommend 1 TM with multiple slots. This way you pay the “fixed costs” of running a TM (like allocating memory for network buffers, launching thread pools, exchanging heartbeat messages etc) only once. On the flip-side, this means that

Re: Reading csv-files in parallel

2018-05-08 Thread Fabian Hueske
Hi, the easiest approach is to read the CSV files linewise as regular text files (ExecutionEnvironment.readTextFile()) and apply custom parse logic in a MapFunction. Then you have all freedom to deal with records of different schema. Best, Fabian 2018-05-08 12:35 GMT+02:00 Esa Heikkinen

RE: Reading csv-files in parallel

2018-05-08 Thread Esa Heikkinen
Hi At this moment a batch query is ok. Do you know any good (Scala) examples how to query batches (different type of csv-files) in parallel ? Or do you have example of a custom source function, that read csv-files parallel ? Best, Esa From: Fabian Hueske Sent: Monday,

Streaming and batch jobs together

2018-05-08 Thread Flavio Pompermaier
Hi all, I'd like to introduce in our pipeline an efficient way to aggregate incoming data around an entity. We have basically new incoming facts that are added (but also removed potentially) to an entity (by id). For example, when we receive a new name of a city we add this name to the known

jvm options for individual jobs / topologies

2018-05-08 Thread Benjamin Cuthbert
Hi all, How do I give different topologies / jobs different Xmx memory parameters? Regards

Re: Signal for End of Stream

2018-05-08 Thread Dhruv Kumar
Fabian, Thanks a lot for your continuous help! Really appreciate it. Sent from Phone. > On May 8, 2018, at 03:06, Fabian Hueske wrote: > > Hi Dhruv, > > The changes look good to me. > > Best, Fabian > > 2018-05-08 5:37 GMT+02:00 Dhruv Kumar : >>