RussellSpitzer commented on a change in pull request #3757:
URL: https://github.com/apache/iceberg/pull/3757#discussion_r772401871
##########
File path:
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java
##########
@@ -174,6 +174,80 @@ public void
testRewriteDataFilesWithFilterOnPartitionTable() {
assertEquals("Data after compaction should not change", expectedRecords,
actualRecords);
}
+ @Test
+ public void testRewriteDataFilesWithInFilterOnPartitionTable() {
+ createPartitionTable();
+ // create 5 files for each partition (c2 = 'foo' and c2 = 'bar')
+ insertData(10);
+ List<Object[]> expectedRecords = currentData();
+
+ // select only 5 files for compaction (files in the partition c2 in
('bar'))
+ List<Object[]> output = sql(
+ "CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c2 in (\"bar\")')", catalogName, tableIdent);
+
+ assertEquals("Action should rewrite 5 data files from single matching
partition" +
+ "(containing c2 = bar) and add 1 data files",
+ ImmutableList.of(row(5, 1)),
+ output);
+
+ List<Object[]> actualRecords = currentData();
+ assertEquals("Data after compaction should not change", expectedRecords,
actualRecords);
+ }
+
+ @Test
+ public void testRewriteDataFilesWithAllPossibleFilters() {
+ createPartitionTable();
+ // create 5 files for each partition (c2 = 'foo' and c2 = 'bar')
+ insertData(10);
+
+ // Pass the literal value which is not present in the data files.
+ // So that parsing can be tested on a same dataset without actually
compacting the files.
+
+ // EqualTo
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 = 3')", catalogName, tableIdent);
+ // GreaterThan
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 > 3')", catalogName, tableIdent);
+ // GreaterThanOrEqual
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 >= 3')", catalogName, tableIdent);
+ // LessThan
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 < 0')", catalogName, tableIdent);
+ // LessThanOrEqual
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 <= 0')", catalogName, tableIdent);
+ // In
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 in (3,4,5)')", catalogName, tableIdent);
+ // IsNull
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 is null')", catalogName, tableIdent);
+ // IsNotNull
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c3 is not null')", catalogName, tableIdent);
+ // And
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 = 3 and c2 = \"bar\"')", catalogName, tableIdent);
+ // Or
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 = 3 or c1 = 5')", catalogName, tableIdent);
+ // Not
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
+ " where => 'c1 not in (1,2)')", catalogName, tableIdent);
+ // StringStartsWith
+ sql("CALL %s.system.rewrite_data_files(table => '%s'," +
Review comment:
Chatted online
a) Requires us using optimization rules in spark since there is a Like
Simplifier rule which converts like -> startsWith
b) Is a bug we have in our Spark Code causes unknown filters to crash our
pushdown. This comes up whenever a Spark Filter is passed in which doesn't
exist in `SparkFilters` I meant to document this last week but ran out of time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]