[ https://issues.apache.org/jira/browse/DRILL-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Altekruse updated DRILL-2677: ----------------------------------- Fix Version/s: (was: 1.2.0) 1.3.0 > Query does not go beyond 4096 lines in small JSON files > ------------------------------------------------------- > > Key: DRILL-2677 > URL: https://issues.apache.org/jira/browse/DRILL-2677 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON > Environment: drill 0.8 official build > Reporter: Alexander Reshetov > Assignee: Jason Altekruse > Fix For: 1.3.0 > > Attachments: dataset_4095_and_1.json, dataset_4096_and_1.json, > dataset_sample.json.gz.part-aa, dataset_sample.json.gz.part-ab, > dataset_sample.json.gz.part-ac, dataset_sample.json.gz.part-ad, > dataset_sample.json.gz.part-ae, dataset_sample.json.gz.part-af > > > Hello, > I'm trying to execute next query: > {code} > select * from (select source.pck, source.`timestamp`, > flatten(source.HostUpdateTypeNW.Transfers) as entry from > dfs.`/mnt/data/dataset_4095_and_1.json` as source) as parsed; > {code} > And it works as expected and I got result: > {code} > +------------+------------+------------+ > | pck | timestamp | entry | > +------------+------------+------------+ > | 3547 | 1419807470286356 | > {"TransferingPurpose":"8","TransferingImpact":"88","TransferingKind":"8","TransferingTime":"888888888","PackageOrigSenderID":"8","TransferingID":"88888","TransitCN":"888","PackageChkPnt":"8888","PackageFullSize":"8","TransferingSessionID":"8","SubpackagesCounter":"8"} > | > +------------+------------+------------+ > 1 row selected (0.188 seconds) > {code} > This file contains 4095 same lines of one JSON string + at the end another > JOSN line (see attached file dataset_4095_and_1.json) > The problem is when first string repeats more than 4095 times query got > exception. Here is query for file with 4096 string of first type + 1 string > of another (see attached file dataset_4096_and_1.json). > {code} > select * from (select source.pck, source.`timestamp`, > flatten(source.HostUpdateTypeNW.Transfers) as entry from > dfs.`/mnt/data/dataset_4096_and_1.json` as source) as parsed; > Exception in thread "2ae108ff-b7ea-8f07-054e-84875815d856:frag:0:0" > java.lang.RuntimeException: Error closing fragment context. > at > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:224) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:187) > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.drill.exec.vector.NullableIntVector cannot be cast to > org.apache.drill.exec.vector.RepeatedVector > at > org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.getFlattenFieldTransferPair(FlattenRecordBatch.java:274) > at > org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.setupNewSchema(FlattenRecordBatch.java:296) > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78) > at > org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext(FlattenRecordBatch.java:122) > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142) > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99) > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89) > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134) > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142) > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68) > at > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:96) > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163) > ... 4 more > Query failed: RemoteRpcException: Failure while running fragment., > org.apache.drill.exec.vector.NullableIntVector cannot be cast to > org.apache.drill.exec.vector.RepeatedVector [ > cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ] > [ cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ] > Error: exception while executing query: Failure while executing query. > (state=,code=0) > {code} > It means that Drill stops analyze schema exactly after 4096 lines and that's > why my query is failing. > And I assume that such behavior lead to another issue from which I > investigated this one. It could be shown on large files, perhaps Drill > somehow split file into smaller chunks and in one of them exists similar > sequence of lines (4096 of the same type from Drill point of view and it > stops query which lead to another exception). Large file attached as > dataset_sample.json.gz > Here is view (dataset_sample.view.drill) which I use for query of large file: > {code} > { > "name" : "dataset_sample", > "sql" : "SELECT `Message`.`timestamp`, > `flatten`(`Message`.`HostUpdateTypeCR`['Transfers']) AS `entries`\nFROM > `dfs`.`/mnt/data/dataset_sample.json.gz` AS `Message`", > "fields" : [ { > "name" : "timestamp", > "type" : "ANY" > }, { > "name" : "transfers", > "type" : "ANY" > } ], > "workspaceSchemaPath" : [ "dfs", "mnt" ] > } > {code} > And here is query which I'm trying to execute: > {code} > 0: jdbc:drill:zk=local> create table dataset_tbl as > . . . . . . . . . . . > select dataset_sample.transfers.TransferingID as id, > dataset_sample.transfers.TransferingKind as type from dataset_sample; > Query failed: Query stopped., index: 9502, length: 1 (expected: range(0, > 1024)) [ c5eac3ee-0266-4645-b6b5-2a1b58df4821 on testlab-broker:31010 ] > Error: exception while executing query: Failure while executing query. > (state=,code=0) > 0: jdbc:drill:zk=local> Exception in thread "WorkManager-19" > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:133) > at > org.apache.drill.common.DeferredException.addException(DeferredException.java:47) > at > org.apache.drill.common.DeferredException.addThrowable(DeferredException.java:61) > at > org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:133) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:181) > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Please let me know if I should split this issue to two separate issues or if > you need any additional info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)