Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

via GitHub Tue, 08 Aug 2023 08:31:28 -0700


cgivre commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287295957



##########
common/src/main/java/org/apache/drill/common/Typifier.java:
##########
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+    Entry<Class, String> result = Typifier.typify(data);
+    String dataType = result.getKey().getSimpleName();
+
+    // If the string is empty, return UNKNOWN

Review Comment:
   @mbeckerle Drill doesn't really have an `UNKNOWN` data type.   The way the 
typifier works is that if it can't determine the datatype, it falls back to 
string which can basically accept anything.
   
   Regarding the lists...  The issue is that to create a list, you have to set 
the data mode to `REPEATED`.  The problem with XML is that there's no real way 
to know if a field is repeated or not.  Consider this:
   
   ```xml
   
   <book>
     <author>a</author>
   </book>
   <book>
       <author>a1</author>
       <author>a2</author>
   </book>
   ```
   
   Since Drill uses the streaming reader, when it first encounters the `author` 
field, it would add an entry for a VARCHAR field.  However, when it gets to the 
next author record, it should be list, but there's no way to really know that 
w/o a schema.  
   
   With JSON we don't have this problem because it uses `[` to denote lists. 
    
   Does that make sense?
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

Reply via email to