[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688029#comment-17688029 ]
ASF GitHub Bot commented on PARQUET-2228: ----------------------------------------- shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List<Path> inputFiles, Configuration conf) { + Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + + for (Path inputFile : inputFiles) { + try { + TransParquetFileReader reader = new TransParquetFileReader( + HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); + MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); + if (this.schema == null) { + this.schema = inputFileSchema; + } else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { + throw new InvalidSchemaException("Input files have different schemas"); + } + } + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); + this.inputFiles.add(reader); + } catch (IOException e) { + throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } + } + + extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: Do we do dedup? > ParquetRewriter supports more than one input file > ------------------------------------------------- > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr > Reporter: Gang Wu > Assignee: Gang Wu > Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)