paul-rogers commented on a change in pull request #1500: DRILL-6820: Msgpack 
format reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231000081
 
 

 ##########
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackSchema.java
 ##########
 @@ -0,0 +1,114 @@
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.SchemaChangeRuntimeException;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField.Builder;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.AccessControlException;
+
+import com.google.protobuf.TextFormat;
+import com.google.protobuf.TextFormat.ParseException;
+
+public class MsgpackSchema {
+  public static final String SCHEMA_FILE_NAME = ".schema.proto";
+
+  @SuppressWarnings("unused")
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackSchema.class);
+
+  private DrillFileSystem fileSystem;
+
+  public MsgpackSchema(DrillFileSystem fileSystem) {
+    this.fileSystem = fileSystem;
+  }
+
+  public MaterializedField load(Path schemaLocation) throws 
AccessControlException, FileNotFoundException, IOException {
+    MaterializedField previousMapField = null;
+    if (schemaLocation != null && fileSystem.exists(schemaLocation)) {
+      try (FSDataInputStream in = fileSystem.open(schemaLocation)) {
+        String schemaData = IOUtils.toString(in);
+        Builder newBuilder = SerializedField.newBuilder();
+        try {
+          TextFormat.merge(schemaData, newBuilder);
+        } catch (ParseException e) {
+          throw new DrillRuntimeException("Failed to read schema file: " + 
schemaLocation, e);
+        }
+        SerializedField read = newBuilder.build();
+        previousMapField = MaterializedField.create(read);
+      }
+    }
+    return previousMapField;
+  }
+
+  public void save(MaterializedField mapField, Path schemaLocation) throws 
IOException {
+    try (FSDataOutputStream out = fileSystem.create(schemaLocation, true)) {
+      SerializedField serializedMapField = mapField.getSerializedField();
+      String data = TextFormat.printToString(serializedMapField);
+      IOUtils.write(data, out);
+    }
+  }
+
+  public MaterializedField merge(MaterializedField existingField, 
MaterializedField newField) {
 
 Review comment:
   This merges field c into map m {a, b}.
   
   Does this code handle merge of types? In file 1, c is a BIGINT. In file 2, 
it is a DOUBLE. What is the merged type? DOUBLE?
   
   In fact, does this mechanism handle cross-file merges? A scan operator can 
scan file 1, file 2, file 3 as three distinct readers, but must return a 
consistent schema. (In fact, file 2 must return the same vectors as file 1.)
   
   The ResultSetLoader has piles of code to handle all this; I'm a bit 
skeptical that all those cases are handled here. Nor, am I suggesting that you 
rewrite all that code.
   
   What you may want to do is:
   
   1. Understand the complex requirements when a single scan handles multiple 
files, when a query has multiple scans, and when those scans are distributed.
   2. Understand the kinds of schema evolution that could occur.
   3. Look at the implementation of the ResultSetLoader and decide if you want 
to rewrite that code (maybe you can find a simpler solution), or if you want to 
leverage the existing code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to