Object reuse in DataStreams

2018-07-17 Thread Urs Schoenenberger
Hi all, we came across some interesting behaviour today. We enabled object reuse on a streaming job that looks like this: stream = env.addSource(source) stream.map(mapFnA).addSink(sinkA) stream.map(mapFnB).addSink(sinkB) Operator chaining is enabled, so the optimizer fuses all operations into a

RateLimit for Kinesis Producer

2018-04-27 Thread Urs Schoenenberger
Hi all, we are struggling with RateLimitExceededExceptions with the Kinesis Producer. The Flink documentation claims that the Flink Producer overrides the RateLimit setting from Amazon's default of 150 to 100. I am wondering whether we'd need 100/($sink_parallelism) in order for it to work

Re: Unable to run flink job after adding a new dependency (cannot find Main-class)

2017-09-25 Thread Urs Schoenenberger
should > be enough (I sure hope so, since this is some of the worst issues I've come > across). > > > Federico D'Ambrosio > > Il 25 set 2017 9:51 AM, "Urs Schoenenberger" <urs.schoenenber...@tngtech.com> > ha scritto: > >> Hi Federico,

Re: Unable to run flink job after adding a new dependency (cannot find Main-class)

2017-09-25 Thread Urs Schoenenberger
Hi Federico, just guessing, but are you explicitly setting the Main-Class manifest attribute for the jar that you are building? Should be something like mainClass in (Compile, packageBin) := Some("org.yourorg.YourFlinkJobMainClass") Best, Urs On 23.09.2017 17:53, Federico D'Ambrosio wrote: >

Re: DataSet: CombineHint heuristics

2017-09-05 Thread Urs Schoenenberger
tegy >> does in fact also depend on the ration " #total records/#records that fit >> into a single Sorter/Hashtable". >> >> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. >> >> Best, >> Aljoscha >> >>> On

DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-05 Thread Urs Schoenenberger
Hi all, we have a DataSet pipeline which reads CSV input data and then essentially does a combinable GroupReduce via first(n). In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) -> first(n)), we got a jobgraph like this: source --[Forward]--> combine --[Hash Partition on 0,

Re: part files written to HDFS with .pending extension

2017-09-02 Thread Urs Schoenenberger
Hi, you need to enable checkpointing for your job. Flink uses ".pending" extensions to mark parts that have been completely written, but are not included in a checkpoint yet. Once you enable checkpointing, the .pending extensions will be removed whenever a checkpoint completes. Regards, Urs On

DataSet: CombineHint heuristics

2017-08-31 Thread Urs Schoenenberger
Hi all, I was wondering about the heuristics for CombineHint: Flink uses SORT by default, but the doc for HASH says that we should expect it to be faster if the number of keys is less than 1/10th of the number of records. HASH should be faster if it is able to combine a lot of records, which

Re: stream partitioning to avoid network overhead

2017-08-11 Thread Urs Schoenenberger
Hi Karthik, maybe I'm misunderstanding, but there are a few things in your description that seem strange to me: - Your "slow" operator seems to be slow not because it's compute-heavy, but because it's waiting for a response. Is AsyncIO (

Re: Why would a kafka source checkpoint take so long?

2017-07-12 Thread Urs Schoenenberger
Hi Gyula, I don't know the cause unfortunately, but we observed a similiar issue on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. Which version are you running on? Urs On 12.07.2017 09:48, Gyula Fóra wrote: > Hi, > > I have noticed a strange behavior in one of our jobs:

Partition index from partitionCustom vs getIndexOfThisSubtask downstream

2017-06-27 Thread Urs Schoenenberger
Hi, if I use DataStream::partitionCustom, will the partition number that my custom Partitioner returns always be equal to getIndexOfThisSubtask in the following operator? A test case with different parallelisms seems to suggest this is true, but the Javadoc seems ambiguous to me since the

Re: Kafka and Flink integration

2017-06-22 Thread Urs Schoenenberger
Hi Greg, do you have a link where I could read up on the rationale behind avoiding Kryo? I'm currently facing a similar decision and would like to get some more background on this. Thank you very much, Urs On 21.06.2017 12:10, Greg Hogan wrote: > The recommendation has been to avoid Kryo where

Re: DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger
o consider is the state backend. You'll probably have to use > the RocksDBStateBackend to be able to spill state to disk. > > Hope this helps, > Fabian > > > 2017-06-16 17:00 GMT+02:00 Urs Schoenenberger < > urs.schoenenber...@tngtech.com>: > >> Hi, >> >> I'm worki

DataSet: combineGroup/reduceGroup with large number of groups

2017-06-16 Thread Urs Schoenenberger
Hi, I'm working on a batch job (roughly 10 billion records of input, 10 million groups) that is essentially a 'fold' over each group, that is, I have a function AggregateT addToAggrate(AggregateT agg, RecordT record) {...} and want to fold this over each group in my DataSet. My understanding