Re: Selecting format_version=2.6 ?

Stephen Colebourne Wed, 20 Mar 2024 15:19:43 -0700

Hi,
Just to complete the thread - you were indeed correct that downgrading
the version to PARQUET_1_0 was the solution that works with Redshift
COPY.
Thanks!
Stephen


On Mon, 18 Mar 2024 at 01:40, Gang Wu <[email protected]> wrote:
>
> From the error message, it seems that the parquet reader in the AWS
> Redshift was having trouble to decode delta-binary-packed-encoded
> values. Have you tried to use other parquet readers (e.g. the python one)
> to read the "corrupted" file? To workaround it, perhaps you may need to
> set parquet.writer.version to PARQUET_1_0 based on this doc [1]
>
> [1]
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>
> Best,
> Gang
>
> On Sat, Mar 16, 2024 at 7:19 AM Stephen Colebourne <[email protected]>
> wrote:
>
> > We are writing a file out with parquet mr but it is failing to be read by
> > AWS Redshift COPY. When the file is loaded and saved in python it works
> > just fine and is imported ok. The first visible difference between the Java
> > and python files is the format version. I had hoped that there would be a
> > simple way to eliminate the format version as the problematic thing.
> >
> > That said, the problem seems to be with dates and times, we write as micros
> > but it isn't clear what the import needs/expects.
> >
> >      optional int64 calculationStartTime (TIMESTAMP(MICROS,true))
> >
> > COPY of the file into Redshift:
> >
> >  redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000',
> > 'M': 'Spectrum Scan Error', 'D': "\n
> > -----------------------------------------------\n  error:  Spectrum Scan
> > Error\n  code:      15001\n  context:   File '
> > https://s3.eu-west-1.amazonaws.com/bucket/results__calculation.parquet' is
> > corrupt: error decoding delta-binary-packed-encoded value of type TIMESTAMP
> > at offset 84\n  query:     8738249[child_sequence:4]\n  location:
> > dory_util.cpp:1579\n  process:   worker_thread [pid=28293]\n
> > -----------------------------------------------\n", 'F':
> > '../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'}
> >
> > Is there any documentation on the configuration you mention below? Could
> > that have any impact on date columns?
> >
> > Any other suggestions welcome.
> >
> > Stephen
> >
> >
> >
> >
> > On Fri, 15 Mar 2024, 16:07 Gang Wu, <[email protected]> wrote:
> >
> > > Hi Stephen,
> > >
> > > Thanks for raising the issue! You are right that the version is always
> > > 1 written by parquet-mr. This is something we need to fix. However,
> > > IMHO, the community does not have a clear answer on the definition
> > > of parquet format v2. Which feature are you referring to specifically in
> > > the version 2.6? It seems that you don't have to bother with the version
> > > and just set the config to enable it.
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Mar 15, 2024 at 6:02 PM Stephen Colebourne <[email protected]
> > >
> > > wrote:
> > >
> > > > Hi all,
> > > > I'm trying to use the parquet-mr library to set format_version=2.6 (or
> > > > higher).
> > > >
> > > > When I review a file that is produced by the library, it appears that
> > > > the version is set to 1.0. Looking at the code in
> > > > org.apache.parquet.hadoop.ParquetFileWriter CURRENT_VERSION is hard
> > > > coded to 1.0.
> > > >
> > > > Is it a bug to hard code the version there? Am I missing something
> > > > obvious to select the format_version?
> > > >
> > > > thanks
> > > > Stephen
> > > >
> > >
> >

Re: Selecting format_version=2.6 ?

Reply via email to