Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-21 Thread via GitHub


cgivre merged PR #2819:
URL: https://github.com/apache/drill/pull/2819


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-21 Thread via GitHub


jnturton commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1686562600

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-21 Thread via GitHub


cgivre commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1686494732

   @mbeckerle @jnturton Are we ok to merge this?  I'll add support for arrays 
in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-19 Thread via GitHub


cgivre commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1299285670


##
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpXmlOptions.java:
##
@@ -111,7 +111,7 @@ public String toString() {
   public static class HttpXmlOptionsBuilder {
 
 private int dataLevel;
-private boolean allTextMode;
+private Boolean allTextMode;

Review Comment:
   @mbeckerle 
   In the JSON reader there are two parameters: `allTextMode` and 
`readAllNumbersAsDouble`.  Both are boolean.For the XML reader, I chose not 
to implement the `readAllNumbersAsDouble` parameter because in practice, it 
requires very clean data.   From using Drill with clients, I can tell you from 
a lot of personal experience that this was one of the biggest data challenges.  
 For instance, you'd get data where there was an DOUBLE field and then there 
would be a row with zero denoted as `0`.   This would then cause schema change 
exceptions. 
   
   We have actually made significant improvements in Drill's implicit casting 
rules which do prevent a lot of schema change exceptions and as a result, IMHO, 
it makes distinguishing between INTs and DOUBLES a lot less important.  So.. 
out of laziness I decided it wasn't worth it.  I can be convinced otherwise.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-18 Thread via GitHub


mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1298815764


##
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpXmlOptions.java:
##
@@ -111,7 +111,7 @@ public String toString() {
   public static class HttpXmlOptionsBuilder {
 
 private int dataLevel;
-private boolean allTextMode;
+private Boolean allTextMode;

Review Comment:
   I thought there were 3 modes: allTextMode, allNumbersAreDouble mode, and 
infer-types mode. 
   
   So why is this a boolean vs am enum?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-18 Thread via GitHub


cgivre commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1684011222

   @mbeckerle Could you please take another look.  I had to fix a few things 
for a unit test.  Thx!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-13 Thread via GitHub


cgivre commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1676572517

   @mbeckerle Unit tests fixed.  I also added the data type inference for APIs 
that generate XML.  
   @jnturton, The CI is still failing with that Kerberos issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-10 Thread via GitHub


cgivre commented on PR #2819:
URL: https://github.com/apache/drill/pull/2819#issuecomment-1673416654

   Converting to draft.  There's a unit test failing in the HTTP plugin.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-08 Thread via GitHub


mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287322034


##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   Makes perfect sense. 
   
   For XML you need XSD to know what's potentially repeating. 
   
   Sometimes that is easy because of minOccurs/maxOccurs.
   
   But there's also these "implied arrays".
   ```
   
   
   
   ```
   That's allowed in both XSD and DFDL schemas (though I want to change 
Daffodil to issue a warning if you do this, because it is such a bad idea when 
representing structured data.)
   
   The element 'a' looks like an array, in that you can index it. 
   
   I think for drill there are just 2 columns: 'a', 'b',  but as there is more 
than one declaration for 'a', it is an implied array. 
   
   Even just detecting this (and disallowing it for now) requires a more 
sophisticated metadata builder which is what I'm working on now. 

   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-08 Thread via GitHub


cgivre commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287295957


##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   @mbeckerle Drill doesn't really have an `UNKNOWN` data type.   The way the 
typifier works is that if it can't determine the datatype, it falls back to 
string which can basically accept anything.
   
   Regarding the lists...  The issue is that to create a list, you have to set 
the data mode to `REPEATED`.  The problem with XML is that there's no real way 
to know if a field is repeated or not.  Consider this:
   
   ```xml
   
   
 a
   
   
   a1
   a2
   
   ```
   
   Since Drill uses the streaming reader, when it first encounters the `author` 
field, it would add an entry for a VARCHAR field.  However, when it gets to the 
next author record, it should be list, but there's no way to really know that 
w/o a schema.  
   
   With JSON we don't have this problem because it uses `[` to denote lists. 

   Does that make sense?
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8450: Add Data Type Inference to XML Format Plugin (drill)

2023-08-08 Thread via GitHub


mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287251884


##
contrib/format-xml/README.md:
##
@@ -15,12 +15,15 @@ The default configuration is shown below:
   "extensions": [
 "xml"
   ],
+  "allTextMode": true,
   "dataLevel": 2
 }
 ```
 
 ## Data Types
-All fields are read as strings.  Nested fields are read as maps.  Future 
functionality could include support for lists.
+The XML reader has an `allTextMode` which, when set to `true` reads all data 
fields as strings.
+When set to `false`, Drill will attempt to infer data types.
+Nested fields are read as maps.  Future functionality could include support 
for lists.

Review Comment:
   Not really part of this change set, but I don't know what you are suggesting 
by "future functionality could include support for lists." I'd like to 
understand that plan/idea just as part of grokking all of this XML mapping. 



##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   The next line of code contradicts this comment by returning VARCHAR. 
   (Unless VARCHAR == UNKNOWN, which is news to me.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org