[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841535#comment-17841535 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on PR #2909: URL: https://github.com/apache/drill/pull/2909#issuecomment-2081179778 This fails its tests due to a maven checkstyle failure. It's complaining about Drill:Exec:Vectors, which my code has no changes to. Can someone advise on what is wrong here? > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802012#comment-17802012 ] ASF GitHub Bot commented on DRILL-2835: --- paul-rogers commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1874845274 Hi Mike, Just jumping in with a random thought. Drill has accumulated a number of schema systems: Parquet metadata cache, HMS, Drill's own metastore, "provided schema", and now DFDL. All provide ways of defining data: be it Parquet, JSON, CSV or whatever. One can't help but wonder, should some future version try to reduce this variation somewhat? Maybe map all the variations to DFDL? Map DFDL to Drill's own mechanisms? Drill uses two kinds of metadata: schema definitions and file metadata used for scan pruning. Schema information could be used at plan time (to provide column types), but certainly at scan time (to "discover" the defined schema.) File metadata is used primarily at plan time to work out how to distribute work. A bit of background on scan pruning. Back in the day, it was common to have thousands or millions of files in Hadoop to scan: this was why tools like Drill were distributed: divide and conquer. And, of course, the fastest scan is to skip files that we know can't contain the information we want. File metadata captures this information outside of the files themselves. HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is evidently based on HMS.) For example, Drill's Parquet metadata cache, the Drill metastore and HMS all provide both schema and file metadata information. The schema information mainly helped with schema evolution: over time, different files have different sets of columns. File metadata provides information *about* the file, such as the data ranges stored in each file. For Parquet, we might track that '2023-01-Boston.parquet' has data from the office='Boston' range. (So, no use scanning the file for office='Austin'.) And so on. With Hadoop HFS, it was customary to use directory structure as a partial primary index: our file above would live in the /sales/2023/01 directory, for example, and logic chooses the proper set of directories to scan. In Drill, it is up to the user to add crufty conditionals on the path name. In Impala, and other HMS-aware tools, the user just says WHERE order_year = 2023 AND order_month = 1, and HMS tells the tool that the order_year and order_month columns translate to such-and-so directory paths. Would be nice if Drill could provide that feature as well, given the proper file metadata: in this case, the mapping of column names to path directories and file names. Does DFDL provide only schema information? Does it support versioning so that we know that "old.csv" lacks the "version" column, while "new.csv" includes that column? Does it also include the kinds of file metadata mentioned above? Or, perhaps DFDL is used in a different context in which the files have a fixed schema and are small in number? This would fit well the "desktop analytics" model that Charles and James suggested is where Drill is now most commonly used. The answers might suggest if DFDL can be the universal data description. or if DFDL applies just to individual file schemas, and Drill would still need a second system to track schema evolution and file metadata for large deployments. Further, if DFDL is kind of a stand-alone thing, with its own reader, then we end up with more complexity: the Drill JSON reader and the DFDL JSON reader. Same for CSV, etc. JSON is so complex that we'd find ourselves telling people that the quirks work one way with the native reader, another way with DFDL. Plus, the DFDL readers might not handle file splits the same way, or support the same set of formats that Drill's other readers support, and so on. It would be nice to separate the idea of schema description from reader implementation, so that DFDL can be used as a source of schema for any arbitrary reader: both at plan and scan times. If DFDL uses its own readers, then we'd need DFDL reader representations in Calcite, which would pick up DFDL schemas so that the schemas are reliably serialized out to each node as part of the physical plan. This is possible, but it does send us down the two-readers-for-every-format path. On the other hand, if DFDL mapped to Drill's existing schema description, then DFDL could be used with our existing readers and there would be just one schema description sent to readers: Drill's existing provided schema format that EVF can already consume. At present, just a few formats support provided schema in the Calcite layer: CSV for sure, maybe JSON?
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801835#comment-17801835 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on code in PR #2836: URL: https://github.com/apache/drill/pull/2836#discussion_r1439542636 ## contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java: ## @@ -0,0 +1,184 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.daffodil; + +import java.io.InputStream; +import java.net.URI; +import java.net.URISyntaxException; +import java.util.Objects; + +import org.apache.daffodil.japi.DataProcessor; +import org.apache.drill.common.AutoCloseables; +import org.apache.drill.common.exceptions.CustomErrorContext; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.v3.ManagedReader; +import org.apache.drill.exec.physical.impl.scan.v3.file.FileDescrip; +import org.apache.drill.exec.physical.impl.scan.v3.file.FileSchemaNegotiator; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.store.daffodil.schema.DaffodilDataProcessorFactory; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import static org.apache.drill.exec.store.daffodil.schema.DrillDaffodilSchemaUtils.daffodilDataProcessorToDrillSchema; + + +public class DaffodilBatchReader implements ManagedReader { + + private static final Logger logger = LoggerFactory.getLogger(DaffodilBatchReader.class); + private final DaffodilFormatConfig dafConfig; + private final RowSetLoader rowSetLoader; + private final CustomErrorContext errorContext; + private final DaffodilMessageParser dafParser; + private final InputStream dataInputStream; + + static class DaffodilReaderConfig { +final DaffodilFormatPlugin plugin; +DaffodilReaderConfig(DaffodilFormatPlugin plugin) { + this.plugin = plugin; +} + } + + public DaffodilBatchReader (DaffodilReaderConfig readerConfig, EasySubScan scan, FileSchemaNegotiator negotiator) { + +errorContext = negotiator.parentErrorContext(); +this.dafConfig = readerConfig.plugin.getConfig(); + +String schemaURIString = dafConfig.getSchemaURI(); // "schema/complexArray1.dfdl.xsd"; +String rootName = dafConfig.getRootName(); +String rootNamespace = dafConfig.getRootNamespace(); +boolean validationMode = dafConfig.getValidationMode(); + +URI dfdlSchemaURI; +try { + dfdlSchemaURI = new URI(schemaURIString); +} catch (URISyntaxException e) { + throw UserException.validationError(e) + .build(logger); +} + +FileDescrip file = negotiator.file(); +DrillFileSystem fs = file.fileSystem(); +URI fsSchemaURI = fs.getUri().resolve(dfdlSchemaURI); + + +DaffodilDataProcessorFactory dpf = new DaffodilDataProcessorFactory(); +DataProcessor dp; +try { + dp = dpf.getDataProcessor(fsSchemaURI, validationMode, rootName, rootNamespace); +} catch (Exception e) { + throw UserException.dataReadError(e) + .message(String.format("Failed to get Daffodil DFDL processor for: %s", fsSchemaURI)) + .addContext(errorContext).addContext(e.getMessage()).build(logger); +} +// Create the corresponding Drill schema. +// Note: this could be a very large schema. Think of a large complex RDBMS schema, +// all of it, hundreds of tables, but all part of the same metadata tree. +TupleMetadata drillSchema = daffodilDataProcessorToDrillSchema(dp); +// Inform Drill about the schema +negotiator.tableSchema(drillSchema, true); + +// +// DATA TIME: Next we construct the runtime objects, and open files. +// +// We get the DaffodilMessageParser, which is a stateful driver for daffodil that +// actually does the parsing. +rowSetLoader = negotiator.build().writer(); + +
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801834#comment-17801834 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1874213780 @cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it. I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help. > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801647#comment-17801647 ] ASF GitHub Bot commented on DRILL-2835: --- cgivre commented on code in PR #2836: URL: https://github.com/apache/drill/pull/2836#discussion_r1439055155 ## contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java: ## @@ -0,0 +1,184 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.daffodil; + +import java.io.InputStream; +import java.net.URI; +import java.net.URISyntaxException; +import java.util.Objects; + +import org.apache.daffodil.japi.DataProcessor; +import org.apache.drill.common.AutoCloseables; +import org.apache.drill.common.exceptions.CustomErrorContext; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.v3.ManagedReader; +import org.apache.drill.exec.physical.impl.scan.v3.file.FileDescrip; +import org.apache.drill.exec.physical.impl.scan.v3.file.FileSchemaNegotiator; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.store.daffodil.schema.DaffodilDataProcessorFactory; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import static org.apache.drill.exec.store.daffodil.schema.DrillDaffodilSchemaUtils.daffodilDataProcessorToDrillSchema; + + +public class DaffodilBatchReader implements ManagedReader { + + private static final Logger logger = LoggerFactory.getLogger(DaffodilBatchReader.class); + private final DaffodilFormatConfig dafConfig; + private final RowSetLoader rowSetLoader; + private final CustomErrorContext errorContext; + private final DaffodilMessageParser dafParser; + private final InputStream dataInputStream; + + static class DaffodilReaderConfig { +final DaffodilFormatPlugin plugin; +DaffodilReaderConfig(DaffodilFormatPlugin plugin) { + this.plugin = plugin; +} + } + + public DaffodilBatchReader (DaffodilReaderConfig readerConfig, EasySubScan scan, FileSchemaNegotiator negotiator) { + +errorContext = negotiator.parentErrorContext(); +this.dafConfig = readerConfig.plugin.getConfig(); + +String schemaURIString = dafConfig.getSchemaURI(); // "schema/complexArray1.dfdl.xsd"; +String rootName = dafConfig.getRootName(); +String rootNamespace = dafConfig.getRootNamespace(); +boolean validationMode = dafConfig.getValidationMode(); + +URI dfdlSchemaURI; +try { + dfdlSchemaURI = new URI(schemaURIString); +} catch (URISyntaxException e) { + throw UserException.validationError(e) + .build(logger); +} + +FileDescrip file = negotiator.file(); +DrillFileSystem fs = file.fileSystem(); +URI fsSchemaURI = fs.getUri().resolve(dfdlSchemaURI); + + +DaffodilDataProcessorFactory dpf = new DaffodilDataProcessorFactory(); +DataProcessor dp; +try { + dp = dpf.getDataProcessor(fsSchemaURI, validationMode, rootName, rootNamespace); +} catch (Exception e) { + throw UserException.dataReadError(e) + .message(String.format("Failed to get Daffodil DFDL processor for: %s", fsSchemaURI)) + .addContext(errorContext).addContext(e.getMessage()).build(logger); +} +// Create the corresponding Drill schema. +// Note: this could be a very large schema. Think of a large complex RDBMS schema, +// all of it, hundreds of tables, but all part of the same metadata tree. +TupleMetadata drillSchema = daffodilDataProcessorToDrillSchema(dp); +// Inform Drill about the schema +negotiator.tableSchema(drillSchema, true); + +// +// DATA TIME: Next we construct the runtime objects, and open files. +// +// We get the DaffodilMessageParser, which is a stateful driver for daffodil that +// actually does the parsing. +rowSetLoader = negotiator.build().writer(); + +//
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799653#comment-17799653 ] ASF GitHub Bot commented on DRILL-2835: --- cgivre commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1867184695 > > > Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) > > > Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work. > > > > > > @mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release? > > We could have a Daffodil release in Jan or Feb. There are some Daffodil API cleanups that need to be discussed that would provide better stability for this Drill integration ... we may want to wait for those and update this to use them. @mbeckerle So is the next step really to figure out how to access the Daffodil files from a potentially distributed environment? > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799635#comment-17799635 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1867120954 > > Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) > > Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work. > > @mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release? We could have a Daffodil release in Jan or Feb. There are some Daffodil API cleanups that need to be discussed that would provide better stability for this Drill integration ... we may want to wait for those and update this to use them. > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799630#comment-17799630 ] ASF GitHub Bot commented on DRILL-2835: --- cgivre commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1867113844 > Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) > > Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work. @mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release? > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799622#comment-17799622 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1867102615 Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work. > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799618#comment-17799618 ] ASF GitHub Bot commented on DRILL-2835: --- mbeckerle commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1867092071 This is pretty much working now, in terms of constructing drill metadata from DFDL schemas, and Daffodil delivering data to Drill. There were dozens of commits to get here, so I squashed them as they were no longer helpful. Obviously more test are needed, but the ones there show nested subrecords working. The issues like how schemas get distributed, and how Daffodil gets invoked in parallel by drill are still open. > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti >Priority: Major > Fix For: 0.9.0 > > Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-2835) IndexOutOfBoundsException in partition sender when doing streaming aggregate with LIMIT
[ https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505429#comment-14505429 ] Aman Sinha commented on DRILL-2835: --- Patch looks good to me. +1. I had a question about the role of dropAll vs. terminate and clarified with Venki. I suggest adding some of that discussion in the comments - it would be useful for future reference. > IndexOutOfBoundsException in partition sender when doing streaming aggregate > with LIMIT > > > Key: DRILL-2835 > URL: https://issues.apache.org/jira/browse/DRILL-2835 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Affects Versions: 0.8.0 >Reporter: Aman Sinha >Assignee: Venki Korukanti > Attachments: DRILL-2835-1.patch > > > Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster: > {code} > alter session set `planner.enable_hashagg` = false; > alter session set `planner.enable_multiphase_agg` = true; > create table dfs.tmp.stream9 as > select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk , > cr_refunded_addr_sk , count(*) from catalog_returns_dri100 > group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk > , cr_refunded_addr_sk > limit 100 > ; > {code} > {code} > Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1 > (expected: range(0, 0)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final] > at > org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284) > ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208) > ~[na:na] > at > org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176) > ~[na:na] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)