mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287322034


##########
common/src/main/java/org/apache/drill/common/Typifier.java:
##########
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+    Entry<Class, String> result = Typifier.typify(data);
+    String dataType = result.getKey().getSimpleName();
+
+    // If the string is empty, return UNKNOWN

Review Comment:
   Makes perfect sense. 
   
   For XML you need XSD to know what's potentially repeating. 
   
   Sometimes that is easy because of minOccurs/maxOccurs.
   
   But there's also these "implied arrays".
   ```
   <element name="a" type="xs:int"/><!-- this is a[1] -->
   <element name="b" type="xs:int"/>
   <element name="a" type="xs:int"/><!-- this is a[2] -->
   ```
   That's allowed in both XSD and DFDL schemas (though I want to change 
Daffodil to issue a warning if you do this, because it is such a bad idea when 
representing structured data.)
   
   The element 'a' looks like an array, in that you can index it. 
   
   I think for drill there are just 2 columns: 'a', 'b',  but as there is more 
than one declaration for 'a', it is an implied array. 
   
   Even just detecting this (and disallowing it for now) requires a more 
sophisticated metadata builder which is what I'm working on now. 
    
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to