Re: [External] Re: From Kafka Stream to Flink

2019-08-16 Thread Fabian Hueske
Hi Ruben, Work on this feature has already started [1], but stalled a bit (probably due to the effort of merging the new Blink query processor). Hequn (in CC) is the guy working on upsert table ingestion, maybe he can share what the status of this feature is. Best, Fabian [1] https://github.com/

Re: Customize file assignments logic in flink application

2019-08-16 Thread Zhu Zhu
Hi Lu, I think it's OK to choose any way as long as it works. Though I've no idea how you would extend SplittableIterator in your case. The underlying is ParallelIteratorInputFormat and its processing is not matched to a certain subtask index. Thanks, Zhu Zhu Lu Niu 于2019年8月16日周五 上午12:48写道: >

Re: processing avro data source using DataSet API and output to parquet

2019-08-16 Thread Zhenghua Gao
Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2]. And if you can turn to DataStream API, StreamingFileSink + ParquetBulkWriter meets your requirement[3][4]. [1] https://github.com/apache/flink/blob/master/flink-connectors/flink-hadoop

Re: End of Window Marker

2019-08-16 Thread Fabian Hueske
Hi Padarn, What you describe is essentially publishing Flink's watermarks to an outside system. Flink processes time windows, by waiting for a watermark that's past the window end time. When it receives such a WM it processes and emits all ended windows and forwards the watermark. When a sink rece

Re: I'm not able to make a stream-stream Time windows JOIN in Flink SQL

2019-08-16 Thread Fabian Hueske
Hi Theo, The main problem is that the semantics of your join (Join all events that happened on the same day) are not well-supported by Flink yet. In terms of true streaming joins, Flink supports the time-windowed join (with the BETWEEN predicate) and the time-versioned table join (which does not

Re: Kafka producer failed with InvalidTxnStateException when performing commit transaction

2019-08-16 Thread Fabian Hueske
Hi Tony, I'm sorry I cannot help you with this issue, but Becket (in CC) might have an idea what went wrong here. Best, Fabian Am Mi., 14. Aug. 2019 um 07:00 Uhr schrieb Tony Wei : > Hi, > > Currently, I was trying to update our kafka cluster with larger ` > transaction.max.timeout.ms`. The > o

Re: How to implement Multi-tenancy in Flink

2019-08-16 Thread Fabian Hueske
Hi Ahmad, The ProcessFunction should not rely on new records to come (i..e, do the processsing in the onElement() method) but rather register a timer every 5 minutes and perform the processing when the timer fires in onTimer(). Essentially, you'd only collect data the data in `processElement()` an

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-16 Thread Terry Wang
Congratulations Andrey! Best, Terry Wang > 在 2019年8月15日,下午9:27,Hequn Cheng 写道: > > Congratulations Andrey! > > On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske > wrote: > Congrats Andrey! > > Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao

Fwd: Full GC (System.gc())rmi.dgc

2019-08-16 Thread Andrew Lin
Regular full gc because rmi.dgcTrigger one hour by default, 360msWe can change the default value by setting the fallowing parameters。Recommended to increase-Dsun.rmi.dgc.client.gcInterval=720 -Dsun.rmi.dgc.server.gcInterval=720The Sun ONE Application Server uses RMI in the Administratio

Re: Flink job parallelism

2019-08-16 Thread Biao Liu
Hi Vishwas, Regardless of the available task manager, what's your job really look like? Is it with parallelism 2 or 1? It's hard to say what happened based on your description. Could you reproduce the scenario? If the answer is yes, then could you provide more details? Like a screenshot, logs, co

Re: Understanding job flow

2019-08-16 Thread Vishwas Siravara
I did not find this to be true. Here is my code snippet. object DruidStreamJob extends Job with SinkFn { private[flink] val druidConfig = DruidConfig.current private[flink] val decryptMap = ExecutionEnv.loadDecryptionDictionary // //TODO: Add this to sbt jvm, this should be set in sbt fo

Re: processing avro data source using DataSet API and output to parquet

2019-08-16 Thread Lian Jiang
Thanks. Which api (dataset or datastream) is recommended for file handling (no window operation required)? We have similar scenario for real-time processing. May it make sense to use datastream api for both batch and real-time for uniformity? Sent from my iPhone > On Aug 16, 2019, at 00:38, Zh

Update Checkpoint and/or Savepoint Timeout of Running Job without restart?

2019-08-16 Thread Aaron Levin
Hello, Question: Is it possible to update the checkpoint and/or savepoint timeout of a running job without restarting it? If not, is this something that would be a welcomed contribution (not sure how easy this would be)? Context: sometimes we have jobs who are making progress but get into a state

Window Function that releases when downstream work is completed

2019-08-16 Thread Steven Nelson
Hello! I think I know the answer to this, but I thought I would go ahead and ask. We have a process the emits messages to our stream. These messages can include duplicates based on a certain key ( we'll call it TheKey). Our Flink job reads the messages, keys by TheKey and enters a window function

Product Evaluation

2019-08-16 Thread Ajit Saluja
Hi, We have a Client requirement to implement a Complex Event Processing tool. We would like to evaluate if Apache Flink meets our requirements. If there is any Client/Group who has implemented it, we would like to have discussion to understand the capabilities of this tool better. If you may prov

Re: Product Evaluation

2019-08-16 Thread Oytun Tez
Hi Ajit, I believe many of us use CEP – we use it extensively. If you can be more specific, we'll try to respond more directly. --- Oytun Tez *M O T A W O R D* The World's Fastest Human Translation Platform. oy...@motaword.com — www.motaword.com On Fri, Aug 16, 2019 at 5:06 PM Ajit Saluja wro

Using S3 as a sink (StreamingFileSink)

2019-08-16 Thread Swapnil Kumar
Hello, We are using Flink to process input events and aggregate and write o/p of our streaming job to S3 using StreamingFileSink but whenever we try to restore the job from a savepoint, the restoration fails with missing part files error. As per my understanding, s3 deletes those part(intermittent)

Re: Using S3 as a sink (StreamingFileSink)

2019-08-16 Thread Oytun Tez
Hi Swapnil, I am not familiar with the StreamingFileSink, however, this sounds like a checkpointing issue to me FileSink should keep its sink state, and remove from the state the files that it *really successfully* sinks (perhaps you may want to add a validation here with S3 to check file integrit

Stale watermark due to unconsumed Kafka partitions

2019-08-16 Thread Eduardo Winpenny Tejedor
Hi all, It was a bit tricky to figure out what was going wrong here, hopefully someone can add the missing piece to the puzzle. I have a Kafka source with a custom AssignerWithPeriodicWatermarks timestamp assigner. It's a copy of the AscendingTimestampExtractor with a log statement printing each

Re: Understanding job flow

2019-08-16 Thread Victor Wong
Hi Vishwas, Since `DruidStreamJob` is an “object” of scala, and the initialization of your sds client is not within any method, it will be called every time ` DruidStreamJob` is loaded (like static block in Java). Your taskmanagers are different JVM processes, and ` DruidStreamJob` needs to be

Re: End of Window Marker

2019-08-16 Thread Padarn Wilson
Hi Fabian, thanks for your input Exactly. Actually my first instinct was to see if it was possible to publish the watermarks somehow - my initial idea was to insert regular watermark messages into each partition of the stream, but exposing this seemed quite troublesome. > In that case, you could