[ https://issues.apache.org/jira/browse/DRILL-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115064#comment-16115064 ]
Jinfeng Ni edited comment on DRILL-5464 at 8/4/17 10:52 PM: ------------------------------------------------------------ Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues related to NULL dataset. The query was finished successfully in multiple runs. {code} select stars, count(*) as cnt from dfs.tmp.yelp group by stars; +--------+---------+ | stars | cnt | +--------+---------+ | 2 | 102737 | | 1 | 110772 | | 4 | 342143 | | 5 | 406045 | | 3 | 163761 | +--------+---------+ {code} Physical plan for the query; {code} 00-00 Screen 00-01 Project(stars=[$0], cnt=[$1]) 00-02 UnionExchange 01-01 HashAgg(group=[{0}], cnt=[$SUM0($1)]) 01-02 Project(stars=[$0], cnt=[$1]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 UnorderedMuxExchange 03-01 Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 03-02 HashAgg(group=[{0}], cnt=[COUNT()]) 03-03 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2, columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]]) {code} was (Author: jni): Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues related to NULL dataset. The query was finished successfully. {code} select stars, count(*) as cnt from dfs.tmp.yelp group by stars; +--------+---------+ | stars | cnt | +--------+---------+ | 2 | 102737 | | 1 | 110772 | | 4 | 342143 | | 5 | 406045 | | 3 | 163761 | +--------+---------+ {code} Physical plan for the query; {code} 00-00 Screen 00-01 Project(stars=[$0], cnt=[$1]) 00-02 UnionExchange 01-01 HashAgg(group=[{0}], cnt=[$SUM0($1)]) 01-02 Project(stars=[$0], cnt=[$1]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 UnorderedMuxExchange 03-01 Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 03-02 HashAgg(group=[{0}], cnt=[COUNT()]) 03-03 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2, columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]]) {code} > Fix JSON reader when it deals with empty file > --------------------------------------------- > > Key: DRILL-5464 > URL: https://issues.apache.org/jira/browse/DRILL-5464 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > > An empty json file is the one without any json object. If we query an empty > json file asking it to return column 'A', Drill's JSON record reader would > return a batch with 0 row, and put column 'A' as a nullable int column. A > better name for such column might be phantom columns, as the record reader > does not have any knowledge of the column schema, and the nullable int column > is just a guessed schema. > However, that processing could introduce many issues. Consider if we have a > directory consisting of multiple json files and at least one of them is > empty. If column 'A' is returned as nullable-int column from the reader over > the empty file, while the other json files contains a real typed column 'A', > that would cause query hit many issues, including 1) SchemaChangeException, > 2) failed in certain operator which does not detect SchemaChange, 3) or > incorrect query result, since the run-time code is generated over a phantom > column type, not a real type. > For instance, the following query against yelp json file run successfully. > {code} > select count(*), stars from > dfs.`/tmp/yelp/yelp_academic_dataset_review.json` group by stars; > {code} > If an empty json file is added to the directory, the query would fail with > the following error (which falls into the 2nd category : PartitionSender did > not detect schema change properly). > {code} > select count(*), stars from dfs.`/tmp/yelp` group by stars; > Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector. > Expected vector class of org.apache.drill.exec.vector.NullableIntVector but > was holding vector class org.apache.drill.exec.vector.NullableBigIntVector, > field= stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)