[ 
https://issues.apache.org/jira/browse/DRILL-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Altekruse updated DRILL-2677:
-----------------------------------
    Fix Version/s:     (was: 1.2.0)
                   1.3.0

> Query does not go beyond 4096 lines in small JSON files
> -------------------------------------------------------
>
>                 Key: DRILL-2677
>                 URL: https://issues.apache.org/jira/browse/DRILL-2677
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>         Environment: drill 0.8 official build
>            Reporter: Alexander Reshetov
>            Assignee: Jason Altekruse
>             Fix For: 1.3.0
>
>         Attachments: dataset_4095_and_1.json, dataset_4096_and_1.json, 
> dataset_sample.json.gz.part-aa, dataset_sample.json.gz.part-ab, 
> dataset_sample.json.gz.part-ac, dataset_sample.json.gz.part-ad, 
> dataset_sample.json.gz.part-ae, dataset_sample.json.gz.part-af
>
>
> Hello,
> I'm trying to execute next query:
> {code}
> select * from (select source.pck, source.`timestamp`, 
> flatten(source.HostUpdateTypeNW.Transfers) as entry from 
> dfs.`/mnt/data/dataset_4095_and_1.json` as source) as parsed;
> {code}
> And it works as expected and I got result:
> {code}
> +------------+------------+------------+
> |    pck     | timestamp  |   entry    |
> +------------+------------+------------+
> | 3547       | 1419807470286356 | 
> {"TransferingPurpose":"8","TransferingImpact":"88","TransferingKind":"8","TransferingTime":"888888888","PackageOrigSenderID":"8","TransferingID":"88888","TransitCN":"888","PackageChkPnt":"8888","PackageFullSize":"8","TransferingSessionID":"8","SubpackagesCounter":"8"}
>  |
> +------------+------------+------------+
> 1 row selected (0.188 seconds)
> {code}
> This file contains 4095 same lines of one JSON string + at the end another 
> JOSN line (see attached file dataset_4095_and_1.json)
> The problem is when first string repeats more than 4095 times query got 
> exception. Here is query for file with 4096 string of first type + 1 string 
> of another (see attached file dataset_4096_and_1.json).
> {code}
> select * from (select source.pck, source.`timestamp`, 
> flatten(source.HostUpdateTypeNW.Transfers) as entry from 
> dfs.`/mnt/data/dataset_4096_and_1.json` as source) as parsed;
> Exception in thread "2ae108ff-b7ea-8f07-054e-84875815d856:frag:0:0" 
> java.lang.RuntimeException: Error closing fragment context.
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:224)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:187)
>       at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.drill.exec.vector.NullableIntVector cannot be cast to 
> org.apache.drill.exec.vector.RepeatedVector
>       at 
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.getFlattenFieldTransferPair(FlattenRecordBatch.java:274)
>       at 
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.setupNewSchema(FlattenRecordBatch.java:296)
>       at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78)
>       at 
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext(FlattenRecordBatch.java:122)
>       at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>       at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>       at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
>       at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
>       at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>       at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>       at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>       at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>       at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68)
>       at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:96)
>       at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163)
>       ... 4 more
> Query failed: RemoteRpcException: Failure while running fragment., 
> org.apache.drill.exec.vector.NullableIntVector cannot be cast to 
> org.apache.drill.exec.vector.RepeatedVector [ 
> cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ]
> [ cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}
> It means that Drill stops analyze schema exactly after 4096 lines and that's 
> why my query is failing.
> And I assume that such behavior lead to another issue from which I 
> investigated this one. It could be shown on large files, perhaps Drill 
> somehow split file into smaller chunks and in one of them exists similar 
> sequence of lines (4096 of the same type from Drill point of view and it 
> stops query which lead to another exception). Large file attached as 
> dataset_sample.json.gz
> Here is view (dataset_sample.view.drill) which I use for query of large file:
> {code}
> {
>   "name" : "dataset_sample",
>   "sql" : "SELECT `Message`.`timestamp`, 
> `flatten`(`Message`.`HostUpdateTypeCR`['Transfers']) AS `entries`\nFROM 
> `dfs`.`/mnt/data/dataset_sample.json.gz` AS `Message`",
>   "fields" : [ {
>     "name" : "timestamp",
>     "type" : "ANY"
>   }, {
>     "name" : "transfers",
>     "type" : "ANY"
>   } ],
>   "workspaceSchemaPath" : [ "dfs", "mnt" ]
> }
> {code}
> And here is query which I'm trying to execute:
> {code}
> 0: jdbc:drill:zk=local> create table dataset_tbl as
> . . . . . . . . . . . > select dataset_sample.transfers.TransferingID as id, 
> dataset_sample.transfers.TransferingKind as type from dataset_sample;
> Query failed: Query stopped., index: 9502, length: 1 (expected: range(0, 
> 1024)) [ c5eac3ee-0266-4645-b6b5-2a1b58df4821 on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> 0: jdbc:drill:zk=local> Exception in thread "WorkManager-19" 
> java.lang.IllegalStateException
>       at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:133)
>       at 
> org.apache.drill.common.DeferredException.addException(DeferredException.java:47)
>       at 
> org.apache.drill.common.DeferredException.addThrowable(DeferredException.java:61)
>       at 
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:133)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:181)
>       at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> Please let me know if I should split this issue to two separate issues or if 
> you need any additional info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to