Re: Hadoop FS when running standalone
Hi Lorenzo, IIRC I had the same error message when trying to write snappified parquet on HDFS with a standalone fat jar. Flink could not "find" the hadoop native/binary libraries (specifically I think for me the issue was related to snappy), because my HADOOP_HOME was not (properly) set. I have never used S3 so I don't know if what I mentioned could be the problem here too, but worth checking. Best regards, Alessandro On Thu, 16 Jul 2020 at 12:59, Lorenzo Nicora wrote: > Hi > > I need to run my streaming job as a *standalone* Java application, for > testing > The job uses the Hadoop S3 FS and I need to test it (not a unit test). > > The job works fine when deployed (I am using AWS Kinesis Data Analytics, > so Flink 1.8.2) > > I have *org.apache.flink:flink-s3-fs-hadoop* as a "compile" dependency. > > For running standalone, I have a Maven profile adding dependencies that > are normally provided ( > *org.apache.flink:flink-java*, > *org.apache.flink:flink-streaming-java_2.11*, > *org.apache.flink:flink-statebackend-rocksdb_2.11*, > *org.apache.flink:flink-connector-filesystem_2.11*) but I keep getting > the error "Hadoop is not in the classpath/dependencies" and it does not > work. > I tried adding *org.apache.flink:flink-hadoop-fs* with no luck > > What dependencies am I missing? > > Cheers > Lorenzo >
Re: Exactly-once ambiguities
> Regarding the event-time processing and watermarking, I have got that if > an event will be received late, after the allowed lateness time, it will be > dropped even though I think it is an antithesis of exactly-once semantic. > > Yes, allowed lateness is a compromise between exactly-once semantic and > acceptable delay of streaming application. Flink cannot ensure all data > sources could generate data without any late which is not the scope of a > streaming system should do. If you really need to the exactly once in > event-time processing in this scenario, I suggest to run a batch job later > to consume all data source and use that result as a credible one. For > processing-time data, use Flink to generate a credible result is enough. > The default behavior is to drop late event, but you can tolerate as much lateness as you need via `allowedLateness()` (Window parameter) and re-trigger the window computation taking also into account late events. Of course the memory consumption increases at the increase of the allowed lateness, and in streaming scenarios you usually go for a sensible trade-off as Yun Tang was mentioning. To selectively store late events for further processing, you can use a custom `ProcessFunction` which sends late events to a SideOutput, and store them somewhere (e.g., HDFS). Best regards, Alessandro