Re: [PR] [112] Add a UUID Type [incubator-xtable]

via GitHub Wed, 18 Sep 2024 18:02:50 -0700


the-other-tim-brown commented on code in PR #518:
URL: https://github.com/apache/incubator-xtable/pull/518#discussion_r1765955623



##########
xtable-core/src/test/java/org/apache/xtable/ITConversionController.java:
##########
@@ -797,15 +883,45 @@ private void checkDatasetEquivalence(
             // if count is not known ahead of time, ensure datasets are 
non-empty
             assertFalse(dataset1Rows.isEmpty());
           }
+          // Process UUID for the diff in Iceberg and Delta to make test smooth
+          // Iceberg supports UUIDs directly, while Delta represents them as 
binary
+          List<String> processedDataset1Rows = processUUID(dataset1Rows);
+          List<String> processedDataset2Rows = processUUID(dataset2Rows);
           assertEquals(
-              dataset1Rows,
-              dataset2Rows,
+              processedDataset1Rows,
+              processedDataset2Rows,
               String.format(
                   "Datasets are not equivalent when reading from Spark. 
Source: %s, Target: %s",
                   sourceFormat, format));
         });
   }
 
+  // converting Base64-encoded UUID fields from Delta to their UUID string 
representations for
+  // testing
+  private List<String> processUUID(List<String> rows) {
+    Gson gson = new Gson();
+    JsonParser parser = new JsonParser();

Review Comment:
   We use jackson in other places for json, can we use that here to avoid 
requiring another library in case our dependencies change in the future?



##########
xtable-core/src/test/java/org/apache/xtable/ITConversionController.java:
##########
@@ -797,15 +883,45 @@ private void checkDatasetEquivalence(
             // if count is not known ahead of time, ensure datasets are 
non-empty
             assertFalse(dataset1Rows.isEmpty());
           }
+          // Process UUID for the diff in Iceberg and Delta to make test smooth
+          // Iceberg supports UUIDs directly, while Delta represents them as 
binary
+          List<String> processedDataset1Rows = processUUID(dataset1Rows);
+          List<String> processedDataset2Rows = processUUID(dataset2Rows);
           assertEquals(
-              dataset1Rows,
-              dataset2Rows,
+              processedDataset1Rows,
+              processedDataset2Rows,
               String.format(
                   "Datasets are not equivalent when reading from Spark. 
Source: %s, Target: %s",
                   sourceFormat, format));
         });
   }
 
+  // converting Base64-encoded UUID fields from Delta to their UUID string 
representations for
+  // testing
+  private List<String> processUUID(List<String> rows) {
+    Gson gson = new Gson();
+    JsonParser parser = new JsonParser();
+    return rows.stream()
+        .map(
+            row -> {
+              JsonObject jsonObject = parser.parse(row).getAsJsonObject();
+              for (String key : jsonObject.keySet()) {
+                JsonElement value = jsonObject.get(key);
+                // Check if the key is "uuid_field" and if the value is Base64 
encoded
+                if (key.equals("uuid_field")
+                    && value.getAsString().matches("^[A-Za-z0-9+/]+={0,2}$")) {

Review Comment:
   `matches` will keep recompiling the regex here. We can compile the pattern 
and define it as a constant for reuse to be slightly more efficient



##########
xtable-core/src/test/java/org/apache/xtable/ITConversionController.java:
##########
@@ -797,15 +883,45 @@ private void checkDatasetEquivalence(
             // if count is not known ahead of time, ensure datasets are 
non-empty
             assertFalse(dataset1Rows.isEmpty());
           }
+          // Process UUID for the diff in Iceberg and Delta to make test smooth
+          // Iceberg supports UUIDs directly, while Delta represents them as 
binary
+          List<String> processedDataset1Rows = processUUID(dataset1Rows);
+          List<String> processedDataset2Rows = processUUID(dataset2Rows);
           assertEquals(
-              dataset1Rows,
-              dataset2Rows,
+              processedDataset1Rows,
+              processedDataset2Rows,
               String.format(
                   "Datasets are not equivalent when reading from Spark. 
Source: %s, Target: %s",
                   sourceFormat, format));
         });
   }
 
+  // converting Base64-encoded UUID fields from Delta to their UUID string 
representations for
+  // testing
+  private List<String> processUUID(List<String> rows) {
+    Gson gson = new Gson();
+    JsonParser parser = new JsonParser();
+    return rows.stream()
+        .map(
+            row -> {
+              JsonObject jsonObject = parser.parse(row).getAsJsonObject();
+              for (String key : jsonObject.keySet()) {
+                JsonElement value = jsonObject.get(key);
+                // Check if the key is "uuid_field" and if the value is Base64 
encoded
+                if (key.equals("uuid_field")
+                    && value.getAsString().matches("^[A-Za-z0-9+/]+={0,2}$")) {
+                  byte[] bytes = 
Base64.getDecoder().decode(value.getAsString());
+                  ByteBuffer bb = ByteBuffer.wrap(bytes);
+                  UUID uuid = new UUID(bb.getLong(), bb.getLong());
+                  // Replace the Base64 encoded value with the UUID string 
representation
+                  jsonObject.addProperty(key, uuid.toString());
+                }
+              }
+              return gson.toJson(jsonObject);

Review Comment:
   Just want to sanity check that this will maintain the field ordering. If 
not, we could get some flakey tests in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [112] Add a UUID Type [incubator-xtable]

Reply via email to