Hi, Just to complete the thread - you were indeed correct that downgrading the version to PARQUET_1_0 was the solution that works with Redshift COPY. Thanks! Stephen
On Mon, 18 Mar 2024 at 01:40, Gang Wu <[email protected]> wrote: > > From the error message, it seems that the parquet reader in the AWS > Redshift was having trouble to decode delta-binary-packed-encoded > values. Have you tried to use other parquet readers (e.g. the python one) > to read the "corrupted" file? To workaround it, perhaps you may need to > set parquet.writer.version to PARQUET_1_0 based on this doc [1] > > [1] > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > Best, > Gang > > On Sat, Mar 16, 2024 at 7:19 AM Stephen Colebourne <[email protected]> > wrote: > > > We are writing a file out with parquet mr but it is failing to be read by > > AWS Redshift COPY. When the file is loaded and saved in python it works > > just fine and is imported ok. The first visible difference between the Java > > and python files is the format version. I had hoped that there would be a > > simple way to eliminate the format version as the problematic thing. > > > > That said, the problem seems to be with dates and times, we write as micros > > but it isn't clear what the import needs/expects. > > > > optional int64 calculationStartTime (TIMESTAMP(MICROS,true)) > > > > COPY of the file into Redshift: > > > > redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', > > 'M': 'Spectrum Scan Error', 'D': "\n > > -----------------------------------------------\n error: Spectrum Scan > > Error\n code: 15001\n context: File ' > > https://s3.eu-west-1.amazonaws.com/bucket/results__calculation.parquet' is > > corrupt: error decoding delta-binary-packed-encoded value of type TIMESTAMP > > at offset 84\n query: 8738249[child_sequence:4]\n location: > > dory_util.cpp:1579\n process: worker_thread [pid=28293]\n > > -----------------------------------------------\n", 'F': > > '../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'} > > > > Is there any documentation on the configuration you mention below? Could > > that have any impact on date columns? > > > > Any other suggestions welcome. > > > > Stephen > > > > > > > > > > On Fri, 15 Mar 2024, 16:07 Gang Wu, <[email protected]> wrote: > > > > > Hi Stephen, > > > > > > Thanks for raising the issue! You are right that the version is always > > > 1 written by parquet-mr. This is something we need to fix. However, > > > IMHO, the community does not have a clear answer on the definition > > > of parquet format v2. Which feature are you referring to specifically in > > > the version 2.6? It seems that you don't have to bother with the version > > > and just set the config to enable it. > > > > > > Best, > > > Gang > > > > > > On Fri, Mar 15, 2024 at 6:02 PM Stephen Colebourne <[email protected] > > > > > > wrote: > > > > > > > Hi all, > > > > I'm trying to use the parquet-mr library to set format_version=2.6 (or > > > > higher). > > > > > > > > When I review a file that is produced by the library, it appears that > > > > the version is set to 1.0. Looking at the code in > > > > org.apache.parquet.hadoop.ParquetFileWriter CURRENT_VERSION is hard > > > > coded to 1.0. > > > > > > > > Is it a bug to hard code the version there? Am I missing something > > > > obvious to select the format_version? > > > > > > > > thanks > > > > Stephen > > > > > > > > >
