Re: Selecting format_version=2.6 ?

Stephen Colebourne Fri, 15 Mar 2024 16:19:06 -0700

We are writing a file out with parquet mr but it is failing to be read by
AWS Redshift COPY. When the file is loaded and saved in python it works
just fine and is imported ok. The first visible difference between the Java
and python files is the format version. I had hoped that there would be a
simple way to eliminate the format version as the problematic thing.


That said, the problem seems to be with dates and times, we write as micros
but it isn't clear what the import needs/expects.

     optional int64 calculationStartTime (TIMESTAMP(MICROS,true))

COPY of the file into Redshift:

 redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000',
'M': 'Spectrum Scan Error', 'D': "\n
-----------------------------------------------\n  error:  Spectrum Scan
Error\n  code:      15001\n  context:   File '
https://s3.eu-west-1.amazonaws.com/bucket/results__calculation.parquet' is
corrupt: error decoding delta-binary-packed-encoded value of type TIMESTAMP
at offset 84\n  query:     8738249[child_sequence:4]\n  location:
dory_util.cpp:1579\n  process:   worker_thread [pid=28293]\n
-----------------------------------------------\n", 'F':
'../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'}

Is there any documentation on the configuration you mention below? Could
that have any impact on date columns?

Any other suggestions welcome.

Stephen




On Fri, 15 Mar 2024, 16:07 Gang Wu, <[email protected]> wrote:

> Hi Stephen,
>
> Thanks for raising the issue! You are right that the version is always
> 1 written by parquet-mr. This is something we need to fix. However,
> IMHO, the community does not have a clear answer on the definition
> of parquet format v2. Which feature are you referring to specifically in
> the version 2.6? It seems that you don't have to bother with the version
> and just set the config to enable it.
>
> Best,
> Gang
>
> On Fri, Mar 15, 2024 at 6:02 PM Stephen Colebourne <[email protected]>
> wrote:
>
> > Hi all,
> > I'm trying to use the parquet-mr library to set format_version=2.6 (or
> > higher).
> >
> > When I review a file that is produced by the library, it appears that
> > the version is set to 1.0. Looking at the code in
> > org.apache.parquet.hadoop.ParquetFileWriter CURRENT_VERSION is hard
> > coded to 1.0.
> >
> > Is it a bug to hard code the version there? Am I missing something
> > obvious to select the format_version?
> >
> > thanks
> > Stephen
> >
>

Re: Selecting format_version=2.6 ?

Reply via email to