rdblue commented on a change in pull request #3273:
URL: https://github.com/apache/iceberg/pull/3273#discussion_r732191990
##########
File path:
spark/v3.0/spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
##########
@@ -106,6 +119,68 @@ public void addDataUnpartitionedOrc() {
sql("SELECT * FROM %s ORDER BY id", tableName));
}
+ @Test
+ public void addAvroFile() throws Exception {
+ // Spark Session Catalog cannot load metadata tables
+ // with "The namespace in session catalog must have exactly one name part"
+ Assume.assumeFalse(catalogName.equals("spark_catalog"));
+
+ Schema schema = new Schema(
+ Types.NestedField.required(1, "id", Types.LongType.get()),
+ Types.NestedField.optional(2, "data", Types.StringType.get()));
+
+ GenericRecord baseRecord = GenericRecord.create(schema);
+
+ ImmutableList.Builder<Record> builder = ImmutableList.builder();
+ builder.add(baseRecord.copy(ImmutableMap.of("id", 1L, "data", "a")));
+ builder.add(baseRecord.copy(ImmutableMap.of("id", 2L, "data", "b")));
+ List<Record> records = builder.build();
+
+ OutputFile file = Files.localOutput(temp.newFile());
+
+ DataWriter<Record> dataWriter = Avro.writeData(file)
+ .schema(schema)
+ .createWriterFunc(org.apache.iceberg.data.avro.DataWriter::create)
+ .overwrite()
+ .withSpec(PartitionSpec.unpartitioned())
+ .build();
+
+ try {
+ for (Record record : records) {
+ dataWriter.add(record);
+ }
+ } finally {
+ dataWriter.close();
+ }
+
+ String path = dataWriter.toDataFile().path().toString();
+
+ String createIceberg =
+ "CREATE TABLE %s (id Long, data String) USING iceberg";
+ sql(createIceberg, tableName);
+
+ Object result = scalarSql("CALL %s.system.add_files('%s', '`avro`.`%s`')",
+ catalogName, tableName, path);
Review comment:
This is actually dangerous and we probably want to disallow it. We
should only import data files that do not have field IDs. Otherwise, the field
IDs may not match and you could get strange behavior. I'd prefer if the test
used an Avro file written without Iceberg support to ensure it doesn't have
field IDs. Not a huge problem, but eventually I think we should catch that
there were IDs in the imported file and fail if they don't match the table
schema's IDs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]