Re: Selecting format_version=2.6 ?

Gang Wu Sun, 17 Mar 2024 18:40:06 -0700

>From the error message, it seems that the parquet reader in the AWS
Redshift was having trouble to decode delta-binary-packed-encoded
values. Have you tried to use other parquet readers (e.g. the python one)
to read the "corrupted" file? To workaround it, perhaps you may need to
set parquet.writer.version to PARQUET_1_0 based on this doc [1]


[1]
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md

Best,
Gang

On Sat, Mar 16, 2024 at 7:19 AM Stephen Colebourne <[email protected]>
wrote:

> We are writing a file out with parquet mr but it is failing to be read by
> AWS Redshift COPY. When the file is loaded and saved in python it works
> just fine and is imported ok. The first visible difference between the Java
> and python files is the format version. I had hoped that there would be a
> simple way to eliminate the format version as the problematic thing.
>
> That said, the problem seems to be with dates and times, we write as micros
> but it isn't clear what the import needs/expects.
>
>      optional int64 calculationStartTime (TIMESTAMP(MICROS,true))
>
> COPY of the file into Redshift:
>
>  redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000',
> 'M': 'Spectrum Scan Error', 'D': "\n
> -----------------------------------------------\n  error:  Spectrum Scan
> Error\n  code:      15001\n  context:   File '
> https://s3.eu-west-1.amazonaws.com/bucket/results__calculation.parquet' is
> corrupt: error decoding delta-binary-packed-encoded value of type TIMESTAMP
> at offset 84\n  query:     8738249[child_sequence:4]\n  location:
> dory_util.cpp:1579\n  process:   worker_thread [pid=28293]\n
> -----------------------------------------------\n", 'F':
> '../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'}
>
> Is there any documentation on the configuration you mention below? Could
> that have any impact on date columns?
>
> Any other suggestions welcome.
>
> Stephen
>
>
>
>
> On Fri, 15 Mar 2024, 16:07 Gang Wu, <[email protected]> wrote:
>
> > Hi Stephen,
> >
> > Thanks for raising the issue! You are right that the version is always
> > 1 written by parquet-mr. This is something we need to fix. However,
> > IMHO, the community does not have a clear answer on the definition
> > of parquet format v2. Which feature are you referring to specifically in
> > the version 2.6? It seems that you don't have to bother with the version
> > and just set the config to enable it.
> >
> > Best,
> > Gang
> >
> > On Fri, Mar 15, 2024 at 6:02 PM Stephen Colebourne <[email protected]
> >
> > wrote:
> >
> > > Hi all,
> > > I'm trying to use the parquet-mr library to set format_version=2.6 (or
> > > higher).
> > >
> > > When I review a file that is produced by the library, it appears that
> > > the version is set to 1.0. Looking at the code in
> > > org.apache.parquet.hadoop.ParquetFileWriter CURRENT_VERSION is hard
> > > coded to 1.0.
> > >
> > > Is it a bug to hard code the version there? Am I missing something
> > > obvious to select the format_version?
> > >
> > > thanks
> > > Stephen
> > >
> >
>

Re: Selecting format_version=2.6 ?

Reply via email to