Nice to see this finally! On Mon, Oct 1, 2018 at 1:53 AM Fabian Hueske <fhue...@gmail.com> wrote:
> Hi Bill, > > Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the > previously mentioned StreamingFileSink [1], [2]. > > Best, Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-9753 > [2] https://issues.apache.org/jira/browse/FLINK-9750 > > Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb hao gao <hao.x....@gmail.com>: > >> Hi Bill, >> >> I wrote those two medium posts you mentioned above. But clearly, the >> techlab one is much better >> I would suggest just "close the file when checkpointing" which is the >> easiest way. If you use BucketingSink, you can modify the code to make it >> work. Just replace the code from line 691 to 693 with >> closeCurrentPartFile() >> >> https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691 >> This should guarantee exactly-once. You may have some files with >> underscore prefix when flink job failed. But usually those files are >> ignored by the query engine/ readers for example, Presto >> >> If you use 1.6 and later, I think the issue is already addressed >> https://issues.apache.org/jira/browse/FLINK-9750 >> >> Thanks >> Hao >> >> On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org> >> wrote: >> >>> I'm trying to stream log messages (syslog fed into Kafak) into Parquet >>> files on HDFS via Flink. I'm able to read, parse, and construct objects for >>> my messages in Flink; however, writing to Parquet is tripping me up. I do >>> *not* need to have this be real-time; a delay of a few minutes, even up to >>> an hour, is fine. >>> >>> I've found the following articles talking about this being very >>> difficult: >>> * >>> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401 >>> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519 >>> * >>> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/ >>> * >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html >>> >>> All of these posts speak of troubles using the check-pointing mechanisms >>> and Parquets need to perform batch writes. I'm not experienced enough with >>> Flink's check-pointing or Parquet's file format to completely understand >>> the issue. So my questions are as follows: >>> >>> 1) Is this possible in Flink in an exactly-once way? If not, is it >>> possible in a way that _might_ cause duplicates during an error? >>> >>> 2) Is there another/better format to use other than Parquet that offers >>> compression and the ability to be queried by something like Drill or Impala? >>> >>> 3) Any further recommendations for solving the overall problem: >>> ingesting syslogs and writing them to a file(s) that is searchable by an >>> SQL(-like) framework? >>> >>> Thanks! >>> >>> Bill- >>> >> >> >> -- >> Thanks >> - Hao >> >