[GitHub] [iceberg] nastra commented on a diff in pull request #4325: Spark:Skip corrupt files in Spark Procedures and Actions

via GitHub Wed, 08 Mar 2023 03:03:23 -0800


nastra commented on code in PR #4325:
URL: https://github.com/apache/iceberg/pull/4325#discussion_r1129265912



##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java:
##########
@@ -406,6 +408,76 @@ public void testMigrateUnpartitioned() throws Exception {
     assertMigratedFileCount(SparkActions.get().migrateTable(source), source, 
dest);
   }
 
+  @Test
+  public void testMigrateSkipOnError() throws Exception {
+    Assume.assumeTrue("Cannot migrate to a hadoop based catalog", 
!type.equals("hadoop"));
+    Assume.assumeTrue(
+        "Can only migrate from Spark Session Catalog", 
catalog.name().equals("spark_catalog"));
+    String source = sourceName("test_migrate_skip_on_error_table");
+    String dest = source;
+
+    File location = temp.newFolder();
+    spark.sql(String.format(CREATE_PARQUET, source, location));
+    CatalogTable table = loadSessionTable(source);
+    Seq<String> partitionColumns = table.partitionColumnNames();
+    String format = table.provider().get();
+
+    spark
+        .table(baseTableName)
+        .write()
+        .mode(SaveMode.Append)
+        .format(format)
+        .partitionBy(partitionColumns.toSeq())
+        .saveAsTable(source);
+
+    spark
+        .table(baseTableName)
+        .write()
+        .mode(SaveMode.Append)
+        .format(format)
+        .partitionBy(partitionColumns.toSeq())
+        .saveAsTable(source);
+
+    List<File> expectedFiles = 
expectedFiles(source).collect(Collectors.toList());
+
+    Assert.assertEquals("Expected number of source files", 2, 
expectedFiles.size());
+
+    // Corrupt the second file
+    File file = expectedFiles.get(1);
+    Assume.assumeTrue("Delete source file!", file.delete());
+    Assume.assumeTrue("Create a empty source file!", file.createNewFile());
+
+    MigrateTable migrateAction = SparkActions.get().migrateTable(source);
+
+    AssertHelpers.assertThrows(
+        "Expected an exception",
+        RuntimeException.class,
+        "not a Parquet file (length is too low: 0)",
+        migrateAction::execute);
+
+    // skip files which cannot be imported into Iceberg
+    migrateAction = SparkActions.get().migrateTable(source).skipOnError();
+
+    MigrateTable.Result migratedFiles = migrateAction.execute();
+    validateTables(source, dest);
+
+    SparkTable destTable = loadTable(dest);
+    Assert.assertEquals(
+        "Provider should be iceberg",
+        "iceberg",
+        destTable.properties().get(TableCatalog.PROP_PROVIDER));
+    List<Row> actual = spark.table(dest).collectAsList();
+
+    Assert.assertEquals(

Review Comment:
   this would be also good to apply in all the new tests as it makes 
understanding test failures much easier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] nastra commented on a diff in pull request #4325: Spark:Skip corrupt files in Spark Procedures and Actions

Reply via email to