Re: Hadoop FS when running standalone

2020-07-16 Thread Alessandro Solimando
Hi Lorenzo,
IIRC I had the same error message when trying to write snappified parquet
on HDFS with a standalone fat jar.

Flink could not "find" the hadoop native/binary libraries (specifically I
think for me the issue was related to snappy), because my HADOOP_HOME was
not (properly) set.

I have never used S3 so I don't know if what I mentioned could be the
problem here too, but worth checking.

Best regards,
Alessandro

On Thu, 16 Jul 2020 at 12:59, Lorenzo Nicora 
wrote:

> Hi
>
> I need to run my streaming job as a *standalone* Java application, for
> testing
> The job uses the Hadoop S3 FS and I need to test it (not a unit test).
>
> The job works fine when deployed (I am using AWS Kinesis Data Analytics,
> so Flink 1.8.2)
>
> I have *org.apache.flink:flink-s3-fs-hadoop* as a "compile" dependency.
>
> For running standalone, I have a Maven profile adding dependencies that
> are normally provided (
> *org.apache.flink:flink-java*,
> *org.apache.flink:flink-streaming-java_2.11*,
> *org.apache.flink:flink-statebackend-rocksdb_2.11*,
> *org.apache.flink:flink-connector-filesystem_2.11*) but I keep getting
> the error "Hadoop is not in the classpath/dependencies" and it does not
> work.
> I tried adding *org.apache.flink:flink-hadoop-fs* with no luck
>
> What dependencies am I missing?
>
> Cheers
> Lorenzo
>


Re: Exactly-once ambiguities

2019-12-30 Thread Alessandro Solimando
> Regarding the event-time processing and watermarking, I have got that if
> an event will be received late, after the allowed lateness time, it will be
> dropped even though I think it is an antithesis of exactly-once semantic.
>
> Yes, allowed lateness is a compromise between exactly-once semantic and
> acceptable delay of streaming application. Flink cannot ensure all data
> sources could generate data without any late which is not the scope of a
> streaming system should do. If you really need to the exactly once in
> event-time processing in this scenario, I suggest to run a batch job later
> to consume all data source and use that result as a credible one. For
> processing-time data, use Flink to generate a credible result is enough.
>

The default behavior is to drop late event, but you can tolerate as much
lateness as you need via `allowedLateness()` (Window parameter) and
re-trigger the window computation taking also into account late events. Of
course the memory consumption increases at the increase of the allowed
lateness, and in streaming scenarios you usually go for a sensible
trade-off as Yun Tang was mentioning. To selectively store late events for
further processing, you can use a custom `ProcessFunction` which sends late
events to a SideOutput, and store them somewhere (e.g., HDFS).

Best regards,
Alessandro