[ 
https://issues.apache.org/jira/browse/DRILL-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091746#comment-15091746
 ] 

Khurram Faraaz edited comment on DRILL-4255 at 1/11/16 10:34 AM:
-----------------------------------------------------------------

In another similar scenario, below query FAILS with UNSUPPORTED_OPERATION error 
when there is an empty JSON file in the directory.

And when that empty JSON file is removed, and the the same query is submitted, 
it executes fine. So it seems it has to do something with the way we handle 
empty JSON files when there are many non-empty JSON files in that directory and 
one or more empty JSON files.

{noformat}

[root@centos-01 ~]# hadoop fs -put empty.json /tmp/MD_332/

0: jdbc:drill:schema=dfs.tmp> select DISTINCT charKey FROM `MD_332`;
Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
changes

Fragment 3:0

[Error Id: 78dd06c7-1914-4ec1-88f0-7cb42234b357 on centos-01.qa.lab:31010] 
(state=,code=0)

# Remove empty.son file
[root@centos-01 ~]# hadoop fs -rmr /tmp/MD_332/empty.json

$ run query it runs fine, once the empty.son file is removed.
0: jdbc:drill:schema=dfs.tmp> select DISTINCT charKey FROM `MD_332`;
+----------+
| charKey  |
+----------+
| MA       |
| WA       |
| WI       |
| AL       |
| AZ       |
| MD       |
| NV       |
| IN       |
| GA       |
| MO       |
| VT       |
| CA       |
| KY       |
| OH       |
| ND       |
| OK       |
| OR       |
| FL       |
| NM       |
| MS       |
| UT       |
| CT       |
| DE       |
| TN       |
| SC       |
| NH       |
| RI       |
| NJ       |
| ME       |
| MI       |
| LA       |
| CO       |
| ID       |
| VA       |
| PA       |
| KS       |
| NC       |
| MN       |
| WY       |
| IA       |
| NY       |
| NE       |
| MT       |
| WV       |
| IL       |
| TX       |
| SD       |
| HI       |
| AK       |
+----------+
49 rows selected (3.595 seconds)

{noformat}


was (Author: khfaraaz):
In another similar scenario, below query FAILS with UNSUPPORTED_OPERATION error 
when there is an empty JSON file in the directory.

And when that empty JSON file is removed, and the the same query is submitted, 
it executes fine. So it seems it has to do something with the way we handle 
empty JSON files where are many non-empty JSON files in that directory.

{noformat}

[root@centos-01 ~]# hadoop fs -put empty.json /tmp/MD_332/

0: jdbc:drill:schema=dfs.tmp> select DISTINCT charKey FROM `MD_332`;
Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
changes

Fragment 3:0

[Error Id: 78dd06c7-1914-4ec1-88f0-7cb42234b357 on centos-01.qa.lab:31010] 
(state=,code=0)

# Remove empty.son file
[root@centos-01 ~]# hadoop fs -rmr /tmp/MD_332/empty.json

$ run query it runs fine, once the empty.son file is removed.
0: jdbc:drill:schema=dfs.tmp> select DISTINCT charKey FROM `MD_332`;
+----------+
| charKey  |
+----------+
| MA       |
| WA       |
| WI       |
| AL       |
| AZ       |
| MD       |
| NV       |
| IN       |
| GA       |
| MO       |
| VT       |
| CA       |
| KY       |
| OH       |
| ND       |
| OK       |
| OR       |
| FL       |
| NM       |
| MS       |
| UT       |
| CT       |
| DE       |
| TN       |
| SC       |
| NH       |
| RI       |
| NJ       |
| ME       |
| MI       |
| LA       |
| CO       |
| ID       |
| VA       |
| PA       |
| KS       |
| NC       |
| MN       |
| WY       |
| IA       |
| NY       |
| NE       |
| MT       |
| WV       |
| IL       |
| TX       |
| SD       |
| HI       |
| AK       |
+----------+
49 rows selected (3.595 seconds)

{noformat}

> SELECT DISTINCT query over JSON data returns UNSUPPORTED OPERATION
> ------------------------------------------------------------------
>
>                 Key: DRILL-4255
>                 URL: https://issues.apache.org/jira/browse/DRILL-4255
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.4.0
>         Environment: CentOS
>            Reporter: Khurram Faraaz
>
> SELECT DISTINCT over mapr fs generated audit logs (JSON files) results in 
> unsupported operation. An exact query over another set of JSON data returns 
> correct results.
> MapR Drill 1.4.0, commit ID : 9627a80f
> MapRBuildVersion : 5.1.0.36488.GA
> OS : CentOS x86_64 GNU/Linux
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select distinct t.operation from `auditlogs` t;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
> changes
> Fragment 3:3
> [Error Id: 1233bf68-13da-4043-a162-cf6d98c07ec9 on example.com:31010] 
> (state=,code=0)
> {noformat}
> Stack trace from drillbit.log
> {noformat}
> 2016-01-08 11:35:35,093 [297060f9-1c7a-b32c-09e8-24b5ad863e73:frag:3:3] INFO  
> o.a.d.e.p.i.aggregate.HashAggBatch - User Error Occurred
> org.apache.drill.common.exceptions.UserException: UNSUPPORTED_OPERATION 
> ERROR: Hash aggregate does not support schema changes
> [Error Id: 1233bf68-13da-4043-a162-cf6d98c07ec9 ]
>         at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:144)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
> [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:93)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) 
> [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:256)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:250)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at java.security.AccessController.doPrivileged(Native Method) 
> [na:1.7.0_65]
>         at javax.security.auth.Subject.doAs(Subject.java:415) [na:1.7.0_65]
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>  [hadoop-common-2.7.0-mapr-1506.jar:na]
>          at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:250)
>  [drill-java-exec-1.4.0.jar:1.4.0]
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_65]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_65]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
> {noformat}
> Query plan for above query.
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY operation): rowcount = 141437.16, 
> cumulative cost = {3.4100499276E7 rows, 1.69455861396E8 cpu, 0.0 io, 
> 1.2165858754560001E10 network, 2.7382234176000005E8 memory}, id = 7572
> 00-01      UnionExchange : rowType = RecordType(ANY operation): rowcount = 
> 141437.16, cumulative cost = {3.408635556E7 rows, 1.6944171768E8 cpu, 0.0 io, 
> 1.2165858754560001E10 network, 2.7382234176000005E8 memory}, id = 7571
> 01-01        Project(operation=[$0]) : rowType = RecordType(ANY operation): 
> rowcount = 141437.16, cumulative cost = {3.3944918400000006E7 rows, 
> 1.683102204E8 cpu, 0.0 io, 1.15865321472E10 network, 2.7382234176000005E8 
> memory}, id = 7570
> 01-02          HashAgg(group=[{0}]) : rowType = RecordType(ANY operation): 
> rowcount = 141437.16, cumulative cost = {3.3944918400000006E7 rows, 
> 1.683102204E8 cpu, 0.0 io, 1.15865321472E10 network, 2.7382234176000005E8 
> memory}, id = 7569
> 01-03            Project(operation=[$0]) : rowType = RecordType(ANY 
> operation): rowcount = 1414371.6, cumulative cost = {3.2530546800000004E7 
> rows, 1.569952476E8 cpu, 0.0 io, 1.15865321472E10 network, 
> 2.4892940160000002E8 memory}, id = 7568
> 01-04              HashToRandomExchange(dist0=[[$0]]) : rowType = 
> RecordType(ANY operation, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 1414371.6, cumulative cost = {3.2530546800000004E7 rows, 1.569952476E8 cpu, 
> 0.0 io, 1.15865321472E10 network, 2.4892940160000002E8 memory}, id = 7567
> 02-01                UnorderedMuxExchange : rowType = RecordType(ANY 
> operation, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1414371.6, cumulative 
> cost = {3.1116175200000003E7 rows, 1.34365302E8 cpu, 0.0 io, 0.0 network, 
> 2.4892940160000002E8 memory}, id = 7566
> 03-01                  Project(operation=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)]) : rowType = RecordType(ANY 
> operation, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1414371.6, cumulative 
> cost = {2.97018036E7 rows, 1.329509304E8 cpu, 0.0 io, 0.0 network, 
> 2.4892940160000002E8 memory}, id = 7565
> 03-02                    HashAgg(group=[{0}]) : rowType = RecordType(ANY 
> operation): rowcount = 1414371.6, cumulative cost = {2.8287432E7 rows, 
> 1.27293444E8 cpu, 0.0 io, 0.0 network, 2.4892940160000002E8 memory}, id = 7564
> 03-03                      Scan(groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/auditlogs, numFiles=31, columns=[`operation`], 
> files=[maprfs:/tmp/auditlogs/DBAudit.log-2015-12-30-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-28-002.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-31-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-06-003.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-28-002.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-28-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-30-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-28-003.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-31-002.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-04-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2016-01-06-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-28-003.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-31-002.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2016-01-06-003.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-31-003.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-06-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-03-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-31-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-29-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2015-12-28-004.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-01-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-28-004.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-29-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2015-12-28-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2016-01-01-001.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-06-004.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2016-01-06-004.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-06-002.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-07-001.json, 
> maprfs:/tmp/auditlogs/DBAudit.log-2016-01-06-002.json, 
> maprfs:/tmp/auditlogs/FSAudit.log-2016-01-08-001.json]]]) : rowType = 
> RecordType(ANY operation): rowcount = 1.4143716E7, cumulative cost = 
> {1.4143716E7 rows, 1.4143716E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
> 7563
> {noformat}
> Another query that is exactly like the failing query reported here, this one 
> returns correct results though.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select distinct t.key2 from `twoKeyJsn.json` t;
> +-------+
> | key2  |
> +-------+
> | d     |
> | c     |
> | b     |
> | 1     |
> | a     |
> | 0     |
> | k     |
> | m     |
> | j     |
> | h     |
> | e     |
> | n     |
> | g     |
> | f     |
> | l     |
> | i     |
> +-------+
> 16 rows selected (27.097 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to