[jira] [Created] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data
benj created DRILL-7954: --- Summary: XML ability to not concatenate fields and attribute - change presentation of data Key: DRILL-7954 URL: https://issues.apache.org/jira/browse/DRILL-7954 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.19.0 Reporter: benj With a XML containing these data : {noformat} x y z a {noformat} {noformat} apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>1)) as x; +---++ | attributes | attr | +---++ | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} | +---++ SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) as x; +-+-+ | attributes| set | +-+-+ | {"set_num":"01","set_val":"12"} | xy | | {"set_num":"23","set_val":"ab"} | za | +-+-+ apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>3)) as x; ++ | attributes | ++ | {} | | {} | | {} | | {} | ++ {noformat} Attributes and fields with the same name are concatenated and remains inexploitable _(maybe the posibility of adding separator should help but it's not the point here)_ In fact that we really need is the ability to obtain something like _(depending of the defining level)_ : {noformat} +--+ | attr | +--+ | [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}] | | [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}] | +--+ ++ | set | ++ | {"set":"x","_attributes":{"num":"0","val":"1"}} | | {"set":"y","_attributes":{"num":"1","val":"2"}} | | {"set":"z","_attributes":{"num":"2","val":"a"}} | | {"set":"a","_attributes":{"num":"3","val":"b"}} | ++ {noformat} _attributes fields could be generated on each level instead of generated with path from top level => that will allow to work with data from each level without losing information -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7935) Error: Hash join does not support schema changes in probe side.
[ https://issues.apache.org/jira/browse/DRILL-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367097#comment-17367097 ] Raúl commented on DRILL-7935: - Hello [~Nisd] With Select CAST(COMPANY_CODE as VARCHAR) it keeps giving the same error, so it is the detects that COMPANY_CODE IS int in Prior schema. It is like that before so many nulls, he detects the field as int, and then when he consults he realizes that it is varchar, and he does not like the change of schema. thanks > Error: Hash join does not support schema changes in probe side. > --- > > Key: DRILL-7935 > URL: https://issues.apache.org/jira/browse/DRILL-7935 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.18.0 >Reporter: Raúl >Priority: Major > > Let's see if someone can help us, we have a problem: > We have a collection in mongo with 58,948 documents, in which a field of the > collection is null in 58,270 and with value (not null) 714. In a select, when > we join on that collection, it returns the following error: > {code:java} > SQL Error: UNSUPPORTED_OPERATION ERROR: Hash join does not support schema > changes in probe side. > Prior schema : > BatchSchema [fields=[[`CITY_CODE` (VARCHAR:OPTIONAL)], [`COMPANY_CODE` > (INT:OPTIONAL)], [`LOCATION_CODE0` (VARCHAR(65535):OPTIONAL)]], > selectionVector=NONE] > New schema : > BatchSchema [fields=[[`CITY_CODE` (VARCHAR:OPTIONAL)], [`COMPANY_CODE` > (VARCHAR:OPTIONAL)], [`LOCATION_CODE0` (VARCHAR(65535):OPTIONAL)]], > selectionVector=NONE] > > Unsupported schema change > Fragment: 0:0 > {code} > > The problem that I observe is that when the field has so many nulls, it > considers it INT, and when it crosses it, it sees that a value is varchar and > that is when it gives the error. > 1.) How can I solve this problem? I have changed different parameters > regarding mongo and nothing makes it work > 2.) Why does Drill consider null to be an int? It should not consider it as > varchar. > If I make a select only of the COMPANY_CODE field and I try to sort it, it > gives me the following error: > {code:java} > SQL Error: INTERNAL_ERROR ERROR: You tried to write a VarChar type when you > are using a ValueWriter of type NullableIntWriterImpl. > Fragment: 0:0 > Please, refer to logs for more information. > {code} > Thank you!! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7931) Rowtype mismatch in DrillReduceAggregatesRule
[ https://issues.apache.org/jira/browse/DRILL-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367078#comment-17367078 ] ASF GitHub Bot commented on DRILL-7931: --- Leon-WTF edited a comment on pull request #2259: URL: https://github.com/apache/drill/pull/2259#issuecomment-864843790 Thanks paul for your detailed note. I reviewed the code more deeper, and below are my discoveries for your questions, please correct me if something is wrong. 1. There two sum functions: SqlSumAggFunction and SqlSumEmptyIsZeroAggFunction, from TypeInferenceUtils.DrillSumSqlReturnTypeInference, we can find that the return type of SqlSumAggFunction is non-nullable only when there is group-py and input type is non-nullable, otherwise it is nullable(Don't know why?). SqlSumAggFunction can be convert to SqlSumEmptyIsZeroAggFunction when return type is non-nullable or to "case count(c) when 0 then null else sum0(c) end" when return type is nullable. The return type of SqlSumEmptyIsZeroAggFunction should always be non-nullable. 2. In RexBuilder.addAggCall, we can find that it uses the equals and hashCode methods of AggregateCall to decide if it already exists in aggCallMapping, but the return type property is not checked in the methods. That's why the code finds the nullable sum operator and throws the non-nullable one. 3. When we call RelOptRuleCall.transformTo, it will do the rowtype equality check between the new and old RelNode, that's where the exception is thrown. The return type of "bare" sum operation is non-nullable because there is group-py and the input type is non-nullable. The return type of the sum function from stddev is nullable because it is changed to nullable at line 516 of DrillReduceAggregatesRule(Don't know why?) I made some new modifications based on above discoveries, and also I think we should use DrillCalciteSqlAggFunctionWrapper consistently for calcite's SqlAggFunction to avoid unequality of the class type when compare AggregateCall. We have two choses to fix the bug now: one solution as my previouse modification is to compare the return type of AggregateCall when checking if it is already existing, this will result in having two same functions just different with return type which actually can be combined, but this solution will have less side effect. Another solution as my current modification is to make sum0 operation's return type always be false, which one do you prefer? In addition, sum returns empty data with group by, otherwise returns null when there is no rows, > Thanks for finding and fixing this bug. It is a very difficult one. This area of code is complex. > > Revision to my earlier comment: I had assumed that the problem was in the run time part of the code, since that is where this stuff usually fails. However, I double-checked the code that was changed. Turns out the fix is in the _planner_ not the runtime. So, most of the stuff I say below may not actually relevant to this bug. (Going to leave it anyway as background.) > > My knowledge of the Calcite planner is not as deep as it should be so I can't tell if the fix is correct. I'm inclined to approve the fix because it works. Also, a deeper investigation would require quite a bit of work. > > What I find confusing is that the type conflict can _only_ occur at runtime since Drill is unique in that the planner does not know about data types (that is what "schema on read" is all about.) So, why would the planner do something that would cause the runtime (which does know about types) to throw an error? And, why are the aggregate accumulators nullable (see below)? > > _Original note, somewhat edited:_ > > The description suggests that the code is attempting to combine the sum operation for `stddev()` with the sum operation for `sum()`. I believe that doing so is a bug unless there was some specific optimization attempt to share common expressions. Normally the planner would perform this kind of optimization. The trick, for Drill, is that the planner does not know the data type, so it must be the runtime that handles such details. I rather doubt that the original developers had gotten to the level of optimization in code generation where they attempted to share subexpressions. > > However, since the fix is in the planner, I wonder if Calcite is attempting to do what it would normally do (combine subexpressions), but doing so is wrong for Drill because we don't yet know the types? The only thing we do know is that if two expressions both work on column `c`, then their types are the same, even if the planner does not know what the actual type will turn out to be. > > Then there is the issue of the nullable `bigint` data type. According to the [SQL Server docs](https://docs.microsoft.com/en-us/sql/t-sql/functions/stde
[jira] [Commented] (DRILL-7931) Rowtype mismatch in DrillReduceAggregatesRule
[ https://issues.apache.org/jira/browse/DRILL-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367076#comment-17367076 ] ASF GitHub Bot commented on DRILL-7931: --- Leon-WTF commented on pull request #2259: URL: https://github.com/apache/drill/pull/2259#issuecomment-864843790 Thanks paul for your detailed note. I reviewed the code more deeper, and below are my discoveries for your questions, please correct it if something is wrong. 1. There two sum functions: SqlSumAggFunction and SqlSumEmptyIsZeroAggFunction, from TypeInferenceUtils.DrillSumSqlReturnTypeInference, we can find that the return type of SqlSumAggFunction is non-nullable only when there is group-py and input type is non-nullable, otherwise it is nullable(Don't know why?). SqlSumAggFunction can be convert to SqlSumEmptyIsZeroAggFunction when return type is non-nullable or to "case count(c) when 0 then null else sum0(c) end" when return type is nullable. The return type of SqlSumEmptyIsZeroAggFunction should always be non-nullable. 2. In RexBuilder.addAggCall, we can find that it uses the equals and hashCode methods of AggregateCall to decide if it already exists in aggCallMapping, but the return type property is not checked in the methods. That's why the code finds the nullable sum operator and throws the non-nullable one. 3. When we call RelOptRuleCall.transformTo, it will do the rowtype equality check between the new and old RelNode, that's where the exception is thrown. The return type of "bare" sum operation is non-nullable because there is group-py and the input type is non-nullable. The return type of the sum function from stddev is nullable because it is changed to nullable at line 516 of DrillReduceAggregatesRule(Don't know why?) I made some new modifications based on above discoveries, and also I think we should use DrillCalciteSqlAggFunctionWrapper consistently for calcite's SqlAggFunction to avoid unequality of the class type when compare AggregateCall. We have two choses to fix the bug now: one solution as my previouse modification is to compare the return type of AggregateCall when checking if it is already existing, this will result in having two same functions just different with return type which actually can be combined, but this solution will have less side effect. Another solution as my current modification is to make sum0 operation's return type always be false, which one do you prefer? In addition, sum returns empty data with group by, otherwise returns null when there is no rows, > Thanks for finding and fixing this bug. It is a very difficult one. This area of code is complex. > > Revision to my earlier comment: I had assumed that the problem was in the run time part of the code, since that is where this stuff usually fails. However, I double-checked the code that was changed. Turns out the fix is in the _planner_ not the runtime. So, most of the stuff I say below may not actually relevant to this bug. (Going to leave it anyway as background.) > > My knowledge of the Calcite planner is not as deep as it should be so I can't tell if the fix is correct. I'm inclined to approve the fix because it works. Also, a deeper investigation would require quite a bit of work. > > What I find confusing is that the type conflict can _only_ occur at runtime since Drill is unique in that the planner does not know about data types (that is what "schema on read" is all about.) So, why would the planner do something that would cause the runtime (which does know about types) to throw an error? And, why are the aggregate accumulators nullable (see below)? > > _Original note, somewhat edited:_ > > The description suggests that the code is attempting to combine the sum operation for `stddev()` with the sum operation for `sum()`. I believe that doing so is a bug unless there was some specific optimization attempt to share common expressions. Normally the planner would perform this kind of optimization. The trick, for Drill, is that the planner does not know the data type, so it must be the runtime that handles such details. I rather doubt that the original developers had gotten to the level of optimization in code generation where they attempted to share subexpressions. > > However, since the fix is in the planner, I wonder if Calcite is attempting to do what it would normally do (combine subexpressions), but doing so is wrong for Drill because we don't yet know the types? The only thing we do know is that if two expressions both work on column `c`, then their types are the same, even if the planner does not know what the actual type will turn out to be. > > Then there is the issue of the nullable `bigint` data type. According to the [SQL Server docs](https://docs.microsoft.com/en-us/sql/t-sql/functions/stdev-trans