Re: parquet protobuf output and aws athena support

Arvid Heise Mon, 22 Mar 2021 06:48:37 -0700

Hi Jin,

I have no experience with your combination. Did you check if you can read
the file in a standalone java format? That may help to provide you some
meaningful logs.


On Mon, Mar 15, 2021 at 8:51 PM Jin Yi <j...@promoted.ai> wrote:

> using ParquetProtoWriters
> <https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/formats/parquet/protobuf/ParquetProtoWriters.html>,
> does anyone have this working with aws athena ingestion via aws glue crawls?
>
> the parquet files being generated by our flink job looks fine at a binary
> level, but aws glue crawler crawls over these files via s3 don't seem to be
> able to deserialize the row data properly.  the schema is correctly picked
> up, but the actual unmarshalling of the rows seems to fail (with no helpful
> logs).
>
> likewise, using parquet-tools or pqrs
> <https://github.com/manojkarthick/pqrs> locally has the same behavior of
> readinging the metadata perfectly fine, but the actual data does not.
>
> i'd like to verify that this is just a relatively atypical combination of
> formats (parquet and protos) that doesn't have widespread tooling support
> vs something i'm overlooking on my end.  for example, must i define the
> table manually in athena using a create table statement (most examples of
> parquet/protobuf uses this approach) and not rely on the schema defined by
> the aws glue crawler?  i didn't go this route because this seemed counter
> to the spirit of the parquet format being embedded w/ the schema.
>
> thanks!
>

Re: parquet protobuf output and aws athena support

Reply via email to