ClassNotFoundException with Flink 1.4.1 and Kafka connector

2018-02-27 Thread Debasish Ghosh
Hi - Facing a ClassNotFoundException while running Flink application that reads from Kafka. This is a modified version of the NYC Taxi App that reads from Kafka. I am using Flink 1.4.1 .. The application runs ok with Flink 1.3 .. Here's the exception .. java.lang.ClassNotFoundException: >

Re: Flink Kafka reads too many bytes .... Very rarely

2018-02-27 Thread Philip Doctor
* The fact that I seem to get all of my data is currently leading me to discard and ignore this error Please ignore this statement, I typed this email as I was testing a theory, I meant to delete this line. This is still a very real issue for me. I was looking to try a work around

Re: Flink Kafka reads too many bytes .... Very rarely

2018-02-27 Thread Philip Doctor
Hi Gordon and Fabian, I just re-ran test case vs Flink 1.3.2, I could not reproduce this error, so it does appear to be new to Flink 1.4.0 if my test is good. The difference between my local env and prod is mostly the scale, production has multi-broker Kafka cluster with durable backups, etc.

Re: Reading csv-files

2018-02-27 Thread Fabian Hueske
Yes, that is mostly correct. You can of course read files in parallel, assign watermarks, and obtain a DataStream with correct timestamps and watermarks. If you do that, you should ensure that each parallel source tasks reads the files in the order of increasing timestamps. As I said before, you

Re: Reading csv-files

2018-02-27 Thread Esa Heikkinen
Hi Thanks for the answer. All csv-files are already present and they will not change during the processing. Because Flink can read many streams in parallel, i think it is also possbile to read many csv-files in parallel. From what i have understand, it is possible to convert csv-files to

Re: Reading csv-files

2018-02-27 Thread Fabian Hueske
Hi Esa, Reading records from files with timestamps that need watermarks can be tricky. If you are aware of Flink's watermark mechanism, you know that records should be ingested in (roughly) increasing timestamp order. This means that files usually cannot be split (i.e, need to be read by a single

Reading csv-files

2018-02-27 Thread Esa Heikkinen
I'd want to read csv-files, which includes time series data and one column is timestamp. Is it better to use addSource() (like in Data-artisans RideCleansing-exercise) or CsvSourceTable() ? I am not sure CsvTableSource() can undertand timestamps ? I have not found good examples about

Re: Fat jar fails deployment (streaming job too large)

2018-02-27 Thread Niels
In case it's useful I've found how to enable bit more debug logging on the jobmanager: flink_jobmanager_log.txt -- Sent from:

Re: Unexpected hop start & end timestamps after stream SQL join

2018-02-27 Thread Fabian Hueske
Hi Juho, a query with an OVER aggregation should emit exactly one row for each input row. Does your comment on "isn't catching all distinct values" mean that this is not the case? You can combine tumbling windows and over aggregates also by nesting queries as shown below: SELECT s_aid1,

Re: Using RowTypeInfo with Table API

2018-02-27 Thread Jens Grassel
Hi, On Tue, 27 Feb 2018 15:20:00 +0100 Timo Walther wrote: TW> you can always use TypeInformation.of() for all supported Flink TW> types. In [1] you can also find a list of all types. thank you very much for you help. Regards, Jens -- CTO, Wegtam GmbH, 27. Hornung 2018,

Re: Using RowTypeInfo with Table API

2018-02-27 Thread Timo Walther
Hi Jens, you can always use TypeInformation.of() for all supported Flink types. In [1] you can also find a list of all types. Regards, Timo [1] https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/Types.java Am 2/27/18 um 3:13 PM

Re: Unexpected hop start & end timestamps after stream SQL join

2018-02-27 Thread Juho Autio
Thanks for the hint! For some reason it isn't catching all distinct values (even though it's a much simpler way than what I initially tried and seems good in that sense). First of all, isn't this like a sliding window: "rowtime RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW"? My use

Re: Using RowTypeInfo with Table API

2018-02-27 Thread Jens Grassel
Hi, On Tue, 27 Feb 2018 14:43:06 +0100 Timo Walther wrote: TW> You can create it with org.apache.flink.table.api.Types.ROW(...). TW> You can check the type of your stream using ds.getType(). You can TW> pass information explicitly in Scala e.g. ds.map()(Types.ROW(...))

Re: SQL Table API: Naming operations done in query

2018-02-27 Thread Timo Walther
Hi Juan, usually the Flink operators contain the optimized expression that was defined in SQL. You can also name the the entire job using env.execute("Your Name") if that would help to identify the query. Regarding checkpoints, it depends how you define "small changes". You must ensure that

Re: Using RowTypeInfo with Table API

2018-02-27 Thread Timo Walther
Hi Jens, usually the Flink extracts the type information from the generic signature of a function. E.g. it knows the fields of a Tuple2. The row type cannot be analyzed and therefore always needs explicit information. You can create it with

Using RowTypeInfo with Table API

2018-02-27 Thread Jens Grassel
Hi, I tried to create a table from a DataStream[Row] but got a (somehow expected) error: <---snip---> An input of GenericTypeInfo cannot be converted to Table. Please specify the type of the input with a RowTypeInfo. <---snip---> Code looks like this: val ds: DataStream[Row] = ... val dT =

Re: Fat jar fails deployment (streaming job too large)

2018-02-27 Thread Till Rohrmann
Hi Niels, the size of the jar does not play a role for Flink. What could be a problem is that the serialized `JobGraph` (user code with closures) is larger than 10 MB and, thus, exceeds the maximum default framesize of Akka. In such a case, it cannot be sent to the `JobMaster`. You can control

SQL Table API: Naming operations done in query

2018-02-27 Thread Juan Gentile
Hello, We are currently testing the SQL API using 1.4.0 version of Flink and we would like to know if it’s possible to name a query or parts of it so we can easily recognize what it’s doing when we run it. An additional question is, In case of small changes done to the query/ies, and assuming

Checkpointing Event Time Watermarks

2018-02-27 Thread vijay kansal
Hi All Is there a way to checkpoint event time watermarks in Flink ? I tries searching for this, but could not figure out... Vijay Kansal Software Development Engineer LimeRoad

Re: Fat jar fails deployment (streaming job too large)

2018-02-27 Thread Fabian Hueske
Hi Niels, There should be no size constraints on the complexity of an application or the size of a JAR file. The problem that you describe sounds a bit strange and should be fixed. Apparently, it has to spend more time on planning / submitting the application than before. Have you tried to

Re: Suggested way to backfill for datastream

2018-02-27 Thread Tzu-Li (Gordon) Tai
Hi Chengzhi, Yes, generally speaking, you would launch a separated job to do the backfilling, and then shut down the job after the backfilling is completed. For this to work, you’ll also have to keep in mind that writes to the external sink must be idempotent. Are you using Kafka as the data

Re: Flink Kafka reads too many bytes .... Very rarely

2018-02-27 Thread Tzu-Li (Gordon) Tai
Hi Philip, Yes, I also have the question that Fabian mentioned. Did you start observing this only after upgrading to 1.4.0? Could you let me know what exactly your deserialization schema is doing? I don’t have any clues at the moment, but maybe there are hints there. Also, you mentioned that

Re: Flink Kafka reads too many bytes .... Very rarely

2018-02-27 Thread Fabian Hueske
Hi, Thanks for reporting the error and sharing the results of your detailed analysis! I don't have enough insight into the Kafka consumer, but noticed that you said you've been running your application for over a year and only noticed the faulty behavior recently. Flink 1.4.0 was released in mid

State based event's searching from files

2018-02-27 Thread Esa Heikkinen
Hi I'd like to build application, which reads many different type of csv-files (with time series data), searches certain events from cvs-files in the desired order and stores "attributes" of found events. The attributes can be used to search next searched event. This search process acts like

Fat jar fails deployment (streaming job too large)

2018-02-27 Thread Niels
Hi All, We've been using Flink 1.3.2 for a while now, but recently failed to deploy our fat jar to the cluster. The deployment only works when we remove 2 arbitrary operators, thus giving us the impression our job is too large. However, we only changed some case classes and serializers (to