Re: Streaming to Parquet Files in HDFS

Biswajit Das Mon, 01 Oct 2018 15:01:13 -0700

Nice to see this finally!

On Mon, Oct 1, 2018 at 1:53 AM Fabian Hueske <fhue...@gmail.com> wrote:


> Hi Bill,
>
> Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the
> previously mentioned StreamingFileSink [1], [2].
>
> Best, Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-9753
> [2] https://issues.apache.org/jira/browse/FLINK-9750
>
> Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb hao gao <hao.x....@gmail.com>:
>
>> Hi Bill,
>>
>> I wrote those two medium posts you mentioned above. But clearly, the
>> techlab one is much better
>> I would suggest just "close the file when checkpointing" which is the
>> easiest way. If you use BucketingSink, you can modify the code to make it
>> work. Just replace the code from line 691 to 693 with
>> closeCurrentPartFile()
>>
>> https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691
>> This should guarantee exactly-once. You may have some files with
>> underscore prefix when flink job failed. But usually those files are
>> ignored by the query engine/ readers for example,  Presto
>>
>> If you use 1.6 and later, I think the issue is already addressed
>> https://issues.apache.org/jira/browse/FLINK-9750
>>
>> Thanks
>> Hao
>>
>> On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org>
>> wrote:
>>
>>> I'm trying to stream log messages (syslog fed into Kafak) into Parquet
>>> files on HDFS via Flink. I'm able to read, parse, and construct objects for
>>> my messages in Flink; however, writing to Parquet is tripping me up. I do
>>> *not* need to have this be real-time; a delay of a few minutes, even up to
>>> an hour, is fine.
>>>
>>> I've found the following articles talking about this being very
>>> difficult:
>>> *
>>> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401
>>> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519
>>> *
>>> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/
>>> *
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html
>>>
>>> All of these posts speak of troubles using the check-pointing mechanisms
>>> and Parquets need to perform batch writes. I'm not experienced enough with
>>> Flink's check-pointing or Parquet's file format to completely understand
>>> the issue. So my questions are as follows:
>>>
>>> 1) Is this possible in Flink in an exactly-once way? If not, is it
>>> possible in a way that _might_ cause duplicates during an error?
>>>
>>> 2) Is there another/better format to use other than Parquet that offers
>>> compression and the ability to be queried by something like Drill or Impala?
>>>
>>> 3) Any further recommendations for solving the overall problem:
>>> ingesting syslogs and writing them to a file(s) that is searchable by an
>>> SQL(-like) framework?
>>>
>>> Thanks!
>>>
>>> Bill-
>>>
>>
>>
>> --
>> Thanks
>>  - Hao
>>
>

Re: Streaming to Parquet Files in HDFS

Reply via email to