Re: Reading csv-files

Esa Heikkinen Tue, 27 Feb 2018 13:14:51 -0800

Hi

Thanks for the answer. All csv-files are already present and they willnot change during the processing.

Because Flink can read many streams in parallel, i think it is alsopossbile to read many csv-files in parallel.

From what i have understand, it is possible to convert csv-files tostreams internally in Flink ? But the problem may be how to synchronizeparallel reading of csv-files based on timestamps ?

Maybe i should develop an external "replayer" of csv-files, whichgenerates parallel streams of events (based on timestamps) for Flink ?

But i think the "replayer" is also possible to do by Flink and it alsocan be run at an accelerated speed ?

The RideCleansing-example does something like that, but i don't know ifit otherwise appropriate to my purpose.


Best, Esa


Fabian Hueske kirjoitti 27.2.2018 klo 22:32:

Hi Esa,
Reading records from files with timestamps that need watermarks can betricky.If you are aware of Flink's watermark mechanism, you know that recordsshould be ingested in (roughly) increasing timestamp order.This means that files usually cannot be split (i.e, need to be read bya single task from start to end) and also need to be read in the rightorder (files with smaller timestamps first).Also each file should contain records of a certain time interval thatshould not overlap (too much) with the time interval of other files.
Unfortunately, Flink does not provide good built-in support to readfiles in a specific order.If all files that you want to process are already present, you canimplement a custom InputFormat by extending a CsvInputFormat, setunsplittable to true and override the getInputSplitAssigner() toreturn an assigner that returns the splits in the correct order.
If you want to process files as they appear, things might be a biteasier given that the timestamps in each new file are larger than thetimestamps of the previous files. In this case, you can useStreamExecutionEnvironment.readFile() with the interval andFileProcessingMode parameter. With a correctly configured watermarkassigner, it should be possible to get valid watermarks.
In any case, reading timestamped data from files is much more trickythan ingesting data from an event log which provides the events in thesame order in which they were written.
Best, Fabian
2018-02-27 20:13 GMT+01:00 Esa Heikkinen <heikk...@student.tut.fi<mailto:heikk...@student.tut.fi>>:
    I'd want to read csv-files, which includes time series data and
    one column is timestamp.

    Is it better to use addSource() (like in Data-artisans
    RideCleansing-exercise) or CsvSourceTable() ?

    I am not sure CsvTableSource() can undertand timestamps ? I have
    not found good examples about that.

    It is maybe little more job to write csv-parser in addSource()-case ?

    Best, Esa

Re: Reading csv-files

Reply via email to