[jira] [Comment Edited] (FLINK-26301) Test AvroParquet format

Dawid Wysakowicz (Jira) Wed, 23 Feb 2022 05:42:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496715#comment-17496715
 ]


Dawid Wysakowicz edited comment on FLINK-26301 at 2/23/22, 1:41 PM:
--------------------------------------------------------------------

# Personally I find it strange, that Parquet documentation uses {{RowData}} and 
{{LogicalType}} which are table specific classes. Especially {{LogicalType}} 
comes from the {{flink-table-common}} package and thus it requires additional 
dependency. Not sure if this is a good usage example, at least not as the main 
example. I would rather see it, somewhere further down below along with some 
description of the relation to Table API.
# Might be just personal taste, but I found the format documentation a bit 
cluttered by examples for both bounded/unbounded examples. As far as I can tell 
it is irrelevant from the point of view of the format. As far as I can tell the 
documentation for FileSource already covers those two modes. I can be convinced 
otherwise, though.
# I'd expect some explanation of differences between 
forBulkFormat/forStreamRecordFormat in docs. Preferrably with a compatibility 
matrix. E.g. can I use AvroParquet with bulk format? Can I somehow read into 
Pojos using bulk format? (even a prominent cross link to some common place 
would be good)


was (Author: dawidwys):
# Personally I find it strange, that Parquet documentation uses {{RowData}} and 
{{LogicalType}} which are table specific classes. Especially {{LogicalType}} 
comes from the {{flink-table-common}} package and thus it requires additional 
dependency. Not sure if this is a good usage example, at least not as the main 
example. I would rather see it, somewhere further down below along with some 
description of the relation to Table API.
# Might be just personal taste, but I found the format documentation a bit 
cluttered by examples for both bounded/unbounded examples. As far as I can tell 
it is irrelevant from the point of view of the format. As far as I can tell the 
documentation for FileSource already covers those two modes. I can be convinced 
otherwise, though.
# I'd expect some explanation of differences between 
forBulkFormat/forStreamRecordFormat in docs (even a prominent cross link to 
some common place would be good)

> Test AvroParquet format
> -----------------------
>
>                 Key: FLINK-26301
>                 URL: https://issues.apache.org/jira/browse/FLINK-26301
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>            Reporter: Jing Ge
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> The following scenarios are worthwhile to test
>  * Start a simple job with None/At-least-once/exactly-once delivery guarantee 
> read Avro Generic/sSpecific/Reflect records and write them to an arbitrary 
> sink.
>  * Start the above job with bounded/unbounded data.
>  * Start the above job with streaming/batch execution mode.
>  
> This format works with FileSource[2] and can only be used with DataStream. 
> Normal parquet files can be used as test files. Schema introduced at [1] 
> could be used.
>  
> [1]Reference:
> [1][https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/formats/parquet/]
> [2] 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-26301) Test AvroParquet format

Reply via email to