[GitHub] drill issue #850: DRILL-5541: C++ Client Crashes During Simple "Man in the M...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/850 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #849: DRILL-5568: Include hadoop-common jars inside drill-jdbc-a...
Github user sohami commented on the issue: https://github.com/apache/drill/pull/849 + @paul-rogers - To help review this PR in case JIRA didn't sent out the email. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/808 Rebased onto latest master and resolved merge conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5572) Redundant user name property in AbstractBase, EasySubScan
Paul Rogers created DRILL-5572: -- Summary: Redundant user name property in AbstractBase, EasySubScan Key: DRILL-5572 URL: https://issues.apache.org/jira/browse/DRILL-5572 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor In Drill physical operators: -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120498921 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/memory/BaseAllocator.java --- @@ -793,20 +797,20 @@ public static boolean isDebug() { return DEBUG; } - /** - * Disk I/O buffer used for all reads and writes of DrillBufs. - */ - - private byte ioBuffer[]; - public byte[] getIOBuffer() { if (ioBuffer == null) { + // Length chosen to the smallest size that maximizes + // disk I/O performance. Smaller sizes slow I/O. Larger + // sizes provide no increase in performance. + // Revisit from time to time. --- End diff -- Fixed. It is a constant because no one can detect changes to performance of the number from outside Drill. To come up with the number required intense, focused tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120497872 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java --- @@ -0,0 +1,506 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +public class SortMemoryManager { + + /** + * Maximum memory this operator may use. Usually comes from the + * operator definition, but may be overridden by a configuration + * parameter for unit testing. + */ + + private final long memoryLimit; + + /** + * Estimated size of the records for this query, updated on each + * new batch received from upstream. + */ + + private int estimatedRowWidth; + + /** + * Size of the merge batches that this operator produces. Generally + * the same as the merge batch size, unless low memory forces a smaller + * value. + */ + + private int expectedMergeBatchSize; + + /** + * Estimate of the input batch size based on the largest batch seen + * thus far. + */ + private int estimatedInputBatchSize; + + /** + * Maximum memory level before spilling occurs. That is, we can buffer input + * batches in memory until we reach the level given by the buffer memory pool. + */ + + private long bufferMemoryLimit; + + /** + * Maximum memory that can hold batches during the merge + * phase. + */ + + private long mergeMemoryLimit; + + /** + * The target size for merge batches sent downstream. + */ + + private int preferredMergeBatchSize; + + /** + * The configured size for each spill batch. + */ + private int preferredSpillBatchSize; + + /** + * Estimated number of rows that fit into a single spill batch. + */ + + private int spillBatchRowCount; + + /** + * The estimated actual spill batch size which depends on the + * details of the data rows for any particular query. + */ + + private int expectedSpillBatchSize; + + /** + * The number of records to add to each output batch sent to the + * downstream operator or spilled to disk. + */ + + private int mergeBatchRowCount; + + private SortConfig config; + +// private long spillPoint; + +// private long minMergeMemory; + + private int estimatedInputSize; + + private boolean potentialOverflow; + + public SortMemoryManager(SortConfig config, long memoryLimit) { +this.config = config; + +// The maximum memory this operator can use as set by the +// operator definition (propagated to the allocator.) + +if (config.maxMemory() > 0) { + this.memoryLimit = Math.min(memoryLimit, config.maxMemory()); +} else { + this.memoryLimit = memoryLimit; +} + +preferredSpillBatchSize = config.spillBatchSize();; +preferredMergeBatchSize = config.mergeBatchSize(); + } + + /** + * Update the data-driven memory use numbers including: + * + * The average size of incoming records. + * The estimated spill and output batch size. + * The estimated number of average-size records per + * spill and output batch. + * The amount of memory set aside to hold the incoming + * batches before spilling starts. + * + * + * Under normal circumstances, the amount of memory available is much + * larger than the input, spill or merge batch sizes. The primary question + * is to determine how many input batches we can buffer during the load + * phase, and how many spill batches we can merge during the merge + * phase. + * + * @param batchSize the overall size of the current batch received from + * upstream + * @param batchRo
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120496924 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortImpl.java --- @@ -0,0 +1,495 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +import java.io.IOException; +import java.util.List; + +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.ops.OperExecContext; +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.physical.impl.xsort.MSortTemplate; +import org.apache.drill.exec.physical.impl.xsort.managed.BatchGroup.InputBatch; +import org.apache.drill.exec.physical.impl.xsort.managed.SortMemoryManager.MergeTask; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.VectorAccessible; +import org.apache.drill.exec.record.VectorAccessibleUtilities; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; + +/** + * Implementation of the external sort which is wrapped into the Drill + * "next" protocol by the {@link ExternalSortBatch} class. + * + * Accepts incoming batches. Sorts each and will spill to disk as needed. + * When all input is delivered, can either do an in-memory merge or a + * merge from disk. If runs spilled, may have to do one or more "consolidation" + * passes to reduce the number of runs to the level that will fit in memory. + */ + +public class SortImpl { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ExternalSortBatch.class); + + /** + * Iterates over the final sorted results. Implemented differently + * depending on whether the results are in-memory or spilled to + * disk. + */ + + public interface SortResults { +/** + * Container into which results are delivered. May the + * the original operator container, or may be a different + * one. This is the container that should be sent + * downstream. This is a fixed value for all returned + * results. + * @return + */ +VectorContainer getContainer(); +boolean next(); +void close(); +int getBatchCount(); +int getRecordCount(); +SelectionVector2 getSv2(); +SelectionVector4 getSv4(); + } + + private final SortConfig config; + private final SortMetrics metrics; + private final SortMemoryManager memManager; + private VectorContainer outputBatch; + private OperExecContext context; + + /** + * Memory allocator for this operator itself. Incoming batches are + * transferred into this allocator. Intermediate batches used during + * merge also reside here. + */ + + private final BufferAllocator allocator; + + private final SpilledRuns spilledRuns; + + private final BufferedBatches bufferedBatches; + + public SortImpl(OperExecContext opContext, SortConfig sortConfig, SpilledRuns spilledRuns, VectorContainer batch) { +this.context = opContext; +outputBatch = batch; +this.spilledRuns = spilledRuns; +allocator = opContext.getAllocator(); +config = sortConfig; +memManager = new SortMemoryManager(config, allocator.getLimit()); +metrics = new SortMetrics(opContext.getStats()); +bufferedBatches = new BufferedBatches(opContext); + +// Reset the allocator to allow a 10% safety margin. This is done because +// the memory manager will enforce the original limit. Changing the hard +// limit will reduce the probability that random chance causes the allocator +// to kill the query because of a small, spurious over-allocation. + +allocator.setLim
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120497990 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java --- @@ -0,0 +1,506 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +public class SortMemoryManager { + + /** + * Maximum memory this operator may use. Usually comes from the + * operator definition, but may be overridden by a configuration + * parameter for unit testing. + */ + + private final long memoryLimit; + + /** + * Estimated size of the records for this query, updated on each + * new batch received from upstream. + */ + + private int estimatedRowWidth; + + /** + * Size of the merge batches that this operator produces. Generally + * the same as the merge batch size, unless low memory forces a smaller + * value. + */ + + private int expectedMergeBatchSize; + + /** + * Estimate of the input batch size based on the largest batch seen + * thus far. + */ + private int estimatedInputBatchSize; + + /** + * Maximum memory level before spilling occurs. That is, we can buffer input + * batches in memory until we reach the level given by the buffer memory pool. + */ + + private long bufferMemoryLimit; + + /** + * Maximum memory that can hold batches during the merge + * phase. + */ + + private long mergeMemoryLimit; + + /** + * The target size for merge batches sent downstream. + */ + + private int preferredMergeBatchSize; + + /** + * The configured size for each spill batch. + */ + private int preferredSpillBatchSize; + + /** + * Estimated number of rows that fit into a single spill batch. + */ + + private int spillBatchRowCount; + + /** + * The estimated actual spill batch size which depends on the + * details of the data rows for any particular query. + */ + + private int expectedSpillBatchSize; + + /** + * The number of records to add to each output batch sent to the + * downstream operator or spilled to disk. + */ + + private int mergeBatchRowCount; + + private SortConfig config; + +// private long spillPoint; + +// private long minMergeMemory; + + private int estimatedInputSize; + + private boolean potentialOverflow; + + public SortMemoryManager(SortConfig config, long memoryLimit) { +this.config = config; + +// The maximum memory this operator can use as set by the +// operator definition (propagated to the allocator.) + +if (config.maxMemory() > 0) { + this.memoryLimit = Math.min(memoryLimit, config.maxMemory()); +} else { + this.memoryLimit = memoryLimit; +} + +preferredSpillBatchSize = config.spillBatchSize();; +preferredMergeBatchSize = config.mergeBatchSize(); + } + + /** + * Update the data-driven memory use numbers including: + * + * The average size of incoming records. + * The estimated spill and output batch size. + * The estimated number of average-size records per + * spill and output batch. + * The amount of memory set aside to hold the incoming + * batches before spilling starts. + * + * + * Under normal circumstances, the amount of memory available is much + * larger than the input, spill or merge batch sizes. The primary question + * is to determine how many input batches we can buffer during the load + * phase, and how many spill batches we can merge during the merge + * phase. + * + * @param batchSize the overall size of the current batch received from + * upstream + * @param batchRo
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120485024 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java --- @@ -346,9 +346,7 @@ public IterOutcome innerNext() { if (unionTypeEnabled) { this.schema = SchemaUtil.mergeSchemas(schema, incoming.getSchema()); } else { -throw SchemaChangeException.schemaChanged("Schema changes not supported in External Sort. Please enable Union type", -schema, -incoming.getSchema()); +throw new SchemaChangeException("Schema changes not supported in External Sort. Please enable Union type"); --- End diff -- Fixed. Converted to a UserException, else the fragment executor will wrap the schema change exception with some generic message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120485596 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortImpl.java --- @@ -0,0 +1,495 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +import java.io.IOException; +import java.util.List; + +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.ops.OperExecContext; +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.physical.impl.xsort.MSortTemplate; +import org.apache.drill.exec.physical.impl.xsort.managed.BatchGroup.InputBatch; +import org.apache.drill.exec.physical.impl.xsort.managed.SortMemoryManager.MergeTask; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.VectorAccessible; +import org.apache.drill.exec.record.VectorAccessibleUtilities; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; + +/** + * Implementation of the external sort which is wrapped into the Drill + * "next" protocol by the {@link ExternalSortBatch} class. + * + * Accepts incoming batches. Sorts each and will spill to disk as needed. + * When all input is delivered, can either do an in-memory merge or a + * merge from disk. If runs spilled, may have to do one or more "consolidation" + * passes to reduce the number of runs to the level that will fit in memory. + */ + +public class SortImpl { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ExternalSortBatch.class); + + /** + * Iterates over the final sorted results. Implemented differently + * depending on whether the results are in-memory or spilled to + * disk. + */ + + public interface SortResults { +/** + * Container into which results are delivered. May the + * the original operator container, or may be a different + * one. This is the container that should be sent + * downstream. This is a fixed value for all returned + * results. + * @return + */ +VectorContainer getContainer(); +boolean next(); +void close(); +int getBatchCount(); +int getRecordCount(); +SelectionVector2 getSv2(); +SelectionVector4 getSv4(); + } + + private final SortConfig config; + private final SortMetrics metrics; + private final SortMemoryManager memManager; + private VectorContainer outputBatch; + private OperExecContext context; + + /** + * Memory allocator for this operator itself. Incoming batches are + * transferred into this allocator. Intermediate batches used during + * merge also reside here. + */ + + private final BufferAllocator allocator; + + private final SpilledRuns spilledRuns; + + private final BufferedBatches bufferedBatches; + + public SortImpl(OperExecContext opContext, SortConfig sortConfig, SpilledRuns spilledRuns, VectorContainer batch) { +this.context = opContext; +outputBatch = batch; +this.spilledRuns = spilledRuns; +allocator = opContext.getAllocator(); +config = sortConfig; +memManager = new SortMemoryManager(config, allocator.getLimit()); +metrics = new SortMetrics(opContext.getStats()); +bufferedBatches = new BufferedBatches(opContext); + +// Reset the allocator to allow a 10% safety margin. This is done because +// the memory manager will enforce the original limit. Changing the hard +// limit will reduce the probability that random chance causes the allocator +// to kill the query because of a small, spurious over-allocation. + +allocator.setLim
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r119226233 --- Diff: common/src/main/java/org/apache/drill/test/SecondaryTest.java --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.test; + +/** + * Label for Drill secondary tests. A secondary test is one that is omitted from + * the normal Drill build because: + * + * It is slow + * It tests particular functionality which need not be tested on every + * build. + * It is old, but still worth running some times. + * It requires specialized setup and/or runs on specific platforms. + * + * + * To mark a test as secondary, do either: + * {@literal @}Category(SecondaryTest.class) + * class MyTest { + *... + * } + * Or: + * class MyTest { + * {@literal @}Category(SecondaryTest.class) + * public void slowTest() { ... } + * } + * Maven is configured as follows: + *+ * + * To run the secondary tests (preliminary): + * > mvn surefire:test -Dgroups=org.apache.drill.test.SecondaryTest -DexcludedGroups= + * The above says to run the secondary test and exclude nothing. The exclusion + * is required to override the default exclusions: skip that parameter and Maven will + * blindly try to run all tests annotated with the SecondaryTest category except + * for those annotated with the SecondaryTest category, which is not very helpful... + * + * Note that java-exec (only) provides a named execution to run large tests: + * + * mvn surefire:test@include-large-tests + * + * However, the above does not work. Nor did it work to include the category in + * a profile as described earlier. At present, there is no known way to run just --- End diff -- Sorry, I'll make the comment clearer. I'm trying to say that we claim we have a way to run secondary (slower) tests separate from the bulk of our unit tests. But, when I tried to use it, it turns out that the solution does not work. As a result, I was forced to include certain (slow) tests in the bulk of our unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---maven-surefire-plugin + * ... + *+ *... + * + * ... + *org.apache.drill.test.SecondaryTest + *
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120498177 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java --- @@ -0,0 +1,506 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +public class SortMemoryManager { + + /** + * Maximum memory this operator may use. Usually comes from the + * operator definition, but may be overridden by a configuration + * parameter for unit testing. + */ + + private final long memoryLimit; + + /** + * Estimated size of the records for this query, updated on each + * new batch received from upstream. + */ + + private int estimatedRowWidth; + + /** + * Size of the merge batches that this operator produces. Generally + * the same as the merge batch size, unless low memory forces a smaller + * value. + */ + + private int expectedMergeBatchSize; + + /** + * Estimate of the input batch size based on the largest batch seen + * thus far. + */ + private int estimatedInputBatchSize; + + /** + * Maximum memory level before spilling occurs. That is, we can buffer input + * batches in memory until we reach the level given by the buffer memory pool. + */ + + private long bufferMemoryLimit; + + /** + * Maximum memory that can hold batches during the merge + * phase. + */ + + private long mergeMemoryLimit; + + /** + * The target size for merge batches sent downstream. + */ + + private int preferredMergeBatchSize; + + /** + * The configured size for each spill batch. + */ + private int preferredSpillBatchSize; + + /** + * Estimated number of rows that fit into a single spill batch. + */ + + private int spillBatchRowCount; + + /** + * The estimated actual spill batch size which depends on the + * details of the data rows for any particular query. + */ + + private int expectedSpillBatchSize; + + /** + * The number of records to add to each output batch sent to the + * downstream operator or spilled to disk. + */ + + private int mergeBatchRowCount; + + private SortConfig config; + +// private long spillPoint; + +// private long minMergeMemory; + + private int estimatedInputSize; + + private boolean potentialOverflow; + + public SortMemoryManager(SortConfig config, long memoryLimit) { +this.config = config; + +// The maximum memory this operator can use as set by the +// operator definition (propagated to the allocator.) + +if (config.maxMemory() > 0) { + this.memoryLimit = Math.min(memoryLimit, config.maxMemory()); +} else { + this.memoryLimit = memoryLimit; +} + +preferredSpillBatchSize = config.spillBatchSize();; +preferredMergeBatchSize = config.mergeBatchSize(); + } + + /** + * Update the data-driven memory use numbers including: + * + * The average size of incoming records. + * The estimated spill and output batch size. + * The estimated number of average-size records per + * spill and output batch. + * The amount of memory set aside to hold the incoming + * batches before spilling starts. + * + * + * Under normal circumstances, the amount of memory available is much + * larger than the input, spill or merge batch sizes. The primary question + * is to determine how many input batches we can buffer during the load + * phase, and how many spill batches we can merge during the merge + * phase. + * + * @param batchSize the overall size of the current batch received from + * upstream + * @param batchRo
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120496720 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortImpl.java --- @@ -0,0 +1,495 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +import java.io.IOException; +import java.util.List; + +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.ops.OperExecContext; +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.physical.impl.xsort.MSortTemplate; +import org.apache.drill.exec.physical.impl.xsort.managed.BatchGroup.InputBatch; +import org.apache.drill.exec.physical.impl.xsort.managed.SortMemoryManager.MergeTask; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.VectorAccessible; +import org.apache.drill.exec.record.VectorAccessibleUtilities; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; + +/** + * Implementation of the external sort which is wrapped into the Drill + * "next" protocol by the {@link ExternalSortBatch} class. + * + * Accepts incoming batches. Sorts each and will spill to disk as needed. + * When all input is delivered, can either do an in-memory merge or a + * merge from disk. If runs spilled, may have to do one or more "consolidation" + * passes to reduce the number of runs to the level that will fit in memory. + */ + +public class SortImpl { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ExternalSortBatch.class); + + /** + * Iterates over the final sorted results. Implemented differently + * depending on whether the results are in-memory or spilled to + * disk. + */ + + public interface SortResults { +/** + * Container into which results are delivered. May the + * the original operator container, or may be a different + * one. This is the container that should be sent + * downstream. This is a fixed value for all returned + * results. + * @return + */ +VectorContainer getContainer(); +boolean next(); +void close(); +int getBatchCount(); +int getRecordCount(); +SelectionVector2 getSv2(); +SelectionVector4 getSv4(); + } + + private final SortConfig config; + private final SortMetrics metrics; + private final SortMemoryManager memManager; + private VectorContainer outputBatch; + private OperExecContext context; + + /** + * Memory allocator for this operator itself. Incoming batches are + * transferred into this allocator. Intermediate batches used during + * merge also reside here. + */ + + private final BufferAllocator allocator; + + private final SpilledRuns spilledRuns; + + private final BufferedBatches bufferedBatches; + + public SortImpl(OperExecContext opContext, SortConfig sortConfig, SpilledRuns spilledRuns, VectorContainer batch) { +this.context = opContext; +outputBatch = batch; +this.spilledRuns = spilledRuns; +allocator = opContext.getAllocator(); +config = sortConfig; +memManager = new SortMemoryManager(config, allocator.getLimit()); +metrics = new SortMetrics(opContext.getStats()); +bufferedBatches = new BufferedBatches(opContext); + +// Reset the allocator to allow a 10% safety margin. This is done because +// the memory manager will enforce the original limit. Changing the hard +// limit will reduce the probability that random chance causes the allocator +// to kill the query because of a small, spurious over-allocation. + +allocator.setLim
[GitHub] drill pull request #808: DRILL-5325: Unit tests for the managed sort
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/808#discussion_r120497564 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java --- @@ -0,0 +1,506 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +public class SortMemoryManager { --- End diff -- Not sure how transferable the logic here is. This is, essentially, a mathematical model of how memory is used by the sort/merge algorithm. Each operator needs its own model, else we're back to using made up logic that does not map to the actual code behavior. For the most part, the numbers and algorithms are no longer "fuzzy." Instead, they reflect actual code behavior as determined first by inspection, then relentless QA testing. Any places where the estimates don't match reality have resulted in bugs filed against the sort. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #851: DRILL-5518: Test framework enhancements
GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/851 DRILL-5518: Test framework enhancements * Create a SubOperatorTest base class to do routine setup and shutdown. * Additional methods to simplify creating complex schemas with field widths. * Define a test workspace with plugin-specific options (as for the CSV storage plugin) * When verifying row sets, add methods to verify and release just the "actual" batch in addition to the existing method for verify and free both the actual and expected batches. * Allow reading of row set values as object for generic comparisons. * "Column builder" within schema builder to simplify building a single MatrializedField for tests. * Misc. code cleanup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5518 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/851.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #851 commit e95ed08915fa04dfdc1977ce899ef34114c96ff1 Author: Paul Rogers Date: 2017-05-16T22:55:41Z DRILL-5518: Test framework enhancements * Create a SubOperatorTest base class to do routine setup and shutdown. * Additional methods to simplify creating complex schemas with field widths. * Define a test workspace with plugin-specific options (as for the CSV storage plugin) * When verifying row sets, add methods to verify and release just the "actual" batch in addition to the existing method for verify and free both the actual and expected batches. * Allow reading of row set values as object for generic comparisons. * "Column builder" within schema builder to simplify building a single MatrializedField for tests. * Misc. code cleanup. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #847: Drill 5545: Update POM to add support for running f...
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/847#discussion_r120474915 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java --- @@ -462,7 +431,10 @@ public Void call() throws IOException { readStatus.setDiskScanTime(timeToRead); assert (totalValuesRead <= totalValuesCount); } -synchronized (queue) { +// FindBugs reports this is a possible bug, but it is not. You do need the synchronized block --- End diff -- I'd intended to suppress the warning initially, but then got rid of it by using a different object to synchronize on. Fixed the comment to not refer to FindBugs though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r120453820 --- Diff: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java --- @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop; + +import static org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.collect.Sets; +import org.apache.drill.exec.store.parquet.ParquetDirectByteBufferAllocator; +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.bytes.CapacityByteArrayOutputStream; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.CodecFactory.BytesCompressor; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.bytes.ByteBufferAllocator; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This is a copy of ColumnChunkPageWriteStore from parquet library except of OutputStream that is used here. + * Using of CapacityByteArrayOutputStream allows to use different ByteBuffer allocators. --- End diff -- I'm not sure that 20 GB is "OK" to write to a file: whether on the heap or otherwise. But, if we accept the poor design that needs this much, then I agree it is better to have uncontrolled use of direct memory than the heap. But, in either case, uncontrolled memory use is a very bad idea. Note that the only buffering needed is 4K: the disk block size. Experiments with spilling show that there is zero benefit to buffering above 32K. So, for 10 runs * 32K = 320K. This is FAR less than the 20 GB proposed, and plenty for heap, to avoid the cost of copying data into, then out of, direct memory. Further, what is wrong with Java's own `BufferedOutputStream`? Was this tried? It works well for simple cases, not sure about more complex cases. Is the problem that Parquet must somehow buffer an entire page so it can "backpatch" the first bytes after writing all the rest? If so, that part of Parquet design is incompatible with a write-once file system such as HDFS: it forces unnecessary load on the memory system. (Note, however, than an updatable file system, such as Linux or MFS, would allow us to write data to disk, then go back and write the header, without buffering.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5571) Unable to cancel running queries from Web UI
Kedar Sankar Behera created DRILL-5571: -- Summary: Unable to cancel running queries from Web UI Key: DRILL-5571 URL: https://issues.apache.org/jira/browse/DRILL-5571 Project: Apache Drill Issue Type: Bug Components: Client - HTTP Affects Versions: 1.11.0 Reporter: Kedar Sankar Behera We are unable to access profiles of some running queries. Hit the following error on the Web UI: {code} { “errorMessage” : “VALIDATION ERROR: No profile with given query id ‘26c90b95-928b-15e3-bedc-bfb4a046cc8b’ exists. Please verify the query id.\n\n\n[Error Id: e6896a23-6932-469d-9968-d315fdd06dd4 ]” } {code} And we cannot cancel the running queries whose profile page can be accessed: {code} Failure attempting to cancel query 26c90b33-cf7e-0495-8f76-55220f71f809. Unable to find information about where query is actively running. {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r119834619 --- Diff: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java --- @@ -0,0 +1,269 @@ +/* --- End diff -- See below my answers regarding to the using of memory by parquet library. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r119835458 --- Diff: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java --- @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop; + +import static org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.collect.Sets; +import org.apache.drill.exec.store.parquet.ParquetDirectByteBufferAllocator; +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.bytes.CapacityByteArrayOutputStream; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.CodecFactory.BytesCompressor; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.bytes.ByteBufferAllocator; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This is a copy of ColumnChunkPageWriteStore from parquet library except of OutputStream that is used here. + * Using of CapacityByteArrayOutputStream allows to use different ByteBuffer allocators. --- End diff -- To avoid of frequent I/O writes to the disk, parquet library buffers an every page as `OutputStream` and writes the data only when `ColumnWriteStore` size is reached the block size [[link](https://github.com/apache/drill/blob/874bf6296dcd1a42c7cf7f097c1a6b5458010cbb/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java#L288)]. For large block size value and big number of runs a lot of memory can be used before flushing the data. And using of ByteBuffers can allow to choose the type of memory to use. `ColumnChunkPageWriter` used `CapacityByteArrayOutputStream` insted of `ByteArrayOutputStream` and `ConcatenatingByteArrayCollector` before [PARQUET-160 fix](https://github.com/apache/parquet-mr/pull/98/commits/8b54667650873c03ea66721d0f06bfad0b968f19). The fact that `ByteBufferAllocator` was introduced in PARQUET-77, but never used in the `ColumnChunkPageWriter`. See the explanation of why direct memory is more suitable for buffering than heap space in the next comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r119836892 --- Diff: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java --- @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop; + +import static org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.collect.Sets; +import org.apache.drill.exec.store.parquet.ParquetDirectByteBufferAllocator; +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.bytes.CapacityByteArrayOutputStream; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.CodecFactory.BytesCompressor; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.bytes.ByteBufferAllocator; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This is a copy of ColumnChunkPageWriteStore from parquet library except of OutputStream that is used here. + * Using of CapacityByteArrayOutputStream allows to use different ByteBuffer allocators. + * It will be no need in this class once PARQUET-1006 is resolved. + */ +public class ParquetColumnChunkPageWriteStore implements PageWriteStore, Closeable { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetDirectByteBufferAllocator.class); + + private static ParquetMetadataConverter parquetMetadataConverter = new ParquetMetadataConverter(); + + private final Map writers = Maps.newHashMap(); + private final MessageType schema; + + public ParquetColumnChunkPageWriteStore(BytesCompressor compressor, + MessageType schema, + int initialSlabSize, + int maxCapacityHint, + ByteBufferAllocator allocator) { +this.schema = schema; +for (ColumnDescriptor path : schema.getColumns()) { + writers.put(path, new ColumnChunkPageWriter(path, compressor, initialSlabSize, maxCapacityHint, allocator)); +} + } + + @Override + public PageWriter getPageWriter(ColumnDescriptor path) { +return writers.get(path); + } + + public void flushToFileWriter(ParquetFileWriter writer) throws IOException { +for (ColumnDescriptor path : schema.getColumns()) { + ColumnChunkPageWriter pageWriter = writers.get(path); + pageWriter.writeToFileWriter(writer); +} + } + + @Override + public void close() { +for (ColumnChunkPageWriter pageWriter : writers.values()) { + pageWriter.close(); +} + } + + private static final class ColumnChunkPageWriter implements PageWriter, Closeable { + +private final ColumnDescriptor path; +private final BytesCompressor compressor; + +private final CapacityByteArrayOutputStream buf; +private DictionaryPage dictionaryPage; + +private long uncompressedLength; +private long compressedLength; +private long totalValueCount; +private int pageCount; + +// repetition and definition level encodings are used only for v1 pages a
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r119836279 --- Diff: exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java --- @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop; + +import static org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.collect.Sets; +import org.apache.drill.exec.store.parquet.ParquetDirectByteBufferAllocator; +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.bytes.CapacityByteArrayOutputStream; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.CodecFactory.BytesCompressor; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.bytes.ByteBufferAllocator; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This is a copy of ColumnChunkPageWriteStore from parquet library except of OutputStream that is used here. + * Using of CapacityByteArrayOutputStream allows to use different ByteBuffer allocators. --- End diff -- Your first case is correct - for 10 runs * 512 (512Mb - default parquet block size in Drill) = 5Gb or for 40 runs we need 20Gb of memory. It is so much for heap memory, but ok for direct one. And we can control how much data can be buffered before writing to the disk with `store.parquet.block-size` option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #846: DRILL-5544: Out of heap running CTAS against text d...
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/846#discussion_r119833975 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java --- @@ -268,8 +278,7 @@ private void flush() throws IOException { } store.close(); -// TODO(jaltekruse) - review this close method should no longer be necessary -//ColumnChunkPageWriteStoreExposer.close(pageStore); +pageStore.close(); --- End diff -- Agree. Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5569) NullPointerException
Khurram Faraaz created DRILL-5569: - Summary: NullPointerException Key: DRILL-5569 URL: https://issues.apache.org/jira/browse/DRILL-5569 Project: Apache Drill Issue Type: Bug Components: Storage - Parquet Affects Versions: 1.11.0 Reporter: Khurram Faraaz The below Exception was seen when TPC-DS Query 4 was executed against Drill 1.11.0 Drill 1.11.0 git commit ID: d11aba2 [root@centos-01 mapr]# cat MapRBuildVersion 5.2.1.42646.GA Stack trace from drillbit.log {noformat} 2017-06-06 07:46:43,160 [Drillbit-ShutdownHook#0] WARN o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 80 running fragments. 2017-06-06 07:46:43,207 [Drillbit-ShutdownHook#0] INFO o.a.drill.exec.compile.CodeCompiler - Stats: code gen count: 959, cache miss count: 12, hit rate: 99% 2017-06-06 07:46:43,504 [scan-3] ERROR o.a.d.e.u.f.BufferedDirectBufInputStream - Error reading from stream 1_1_0.parquet. Error was : Error reading out of an FSDataInputStream using the Hadoop 2 ByteBuffer based read method. 2017-06-06 07:46:43,510 [scan-8] ERROR o.a.d.e.u.f.BufferedDirectBufInputStream - Error reading from stream 1_1_0.parquet. Error was : Error reading out of an FSDataInputStream using the Hadoop 2 ByteBuffer based read method. 2017-06-06 07:46:43,514 [scan-8] INFO o.a.d.e.s.p.c.AsyncPageReader - User Error Occurred: Exception occurred while reading from disk. (java.io.IOException: Error reading out of an FSDataInputStream using the Hadoop 2 ByteBuffer based read method.) org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception occurred while reading from disk. File: /drill/testdata/tpcds_sf1/parquet/store_sales/1_1_0.parquet Column: ss_ext_list_price Row Group Start: 75660513 [Error Id: 3a758095-fcc4-4364-a50b-33a027c1beb6 ] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:199) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.access$600(AsyncPageReader.java:81) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:483) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:392) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_65] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65] Caused by: java.io.IOException: java.io.IOException: Error reading out of an FSDataInputStream using the Hadoop 2 ByteBuffer based read method. at org.apache.drill.exec.util.filereader.BufferedDirectBufInputStream.getNextBlock(BufferedDirectBufInputStream.java:185) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.util.filereader.BufferedDirectBufInputStream.readInternal(BufferedDirectBufInputStream.java:212) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.util.filereader.BufferedDirectBufInputStream.read(BufferedDirectBufInputStream.java:277) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.util.filereader.DirectBufInputStream.getNext(DirectBufInputStream.java:111) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:437) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] ... 5 common frames omitted Caused by: java.io.IOException: Error reading out of an FSDataInputStream using the Hadoop 2 ByteBuffer based read method. at org.apache.parquet.hadoop.util.CompatibilityUtil.getBuf(CompatibilityUtil.java:99) ~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.drill.exec.util.filereader.BufferedDirectBufInputStream.getNextBlock(BufferedDirectBufInputStream.java:182) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] ... 9 common frames omitted Caused by: java.lang.NullPointerException: null at com.mapr.fs.MapRFsInStream.readIntoDirectByteBuffer(MapRFsInStream.java:219) ~[maprfs-5.2.1-mapr.jar:5.2.1-mapr] at com.mapr.fs.MapRFsInStream.read(MapRFsInStream.java:333) ~[maprfs-5.2.1-mapr.jar:5.
[jira] [Created] (DRILL-5570) InterruptedException: null
Khurram Faraaz created DRILL-5570: - Summary: InterruptedException: null Key: DRILL-5570 URL: https://issues.apache.org/jira/browse/DRILL-5570 Project: Apache Drill Issue Type: Bug Components: Execution - Flow Affects Versions: 1.11.0 Environment: 3 node CentOS cluster Reporter: Khurram Faraaz When TPC-DS query11 was executed concurrently and one of the non-foreman Drillbits was stopped (./bin/drillbit.sh stop) we see the below system error InterruptedException in the drillbit.log of the non-foreman node Drill 1.11.0 git commit ID: d11aba2 [root@centos-01 mapr]# cat MapRBuildVersion 5.2.1.42646.GA {noformat} 2017-06-06 07:46:44,288 [26c9a242-dfa1-35be-b5f1-ff6b4fa66086:frag:11:0] ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: InterruptedException Fragment 11:0 [Error Id: 40723399-8983-4777-a2bb-dc9d55ae338e on centos-02.qa.lab:31010] org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: InterruptedException Fragment 11:0 [Error Id: 40723399-8983-4777-a2bb-dc9d55ae338e on centos-02.qa.lab:31010] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:295) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65] Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Interrupted but context.shouldContinue() is true at org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:178) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_65] at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_65] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) ~[hadoop-common-2.7.0-mapr-1607.jar:na] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227) [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] ... 4 common frames omitted Caused by: java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) ~[na:1.8.0_65] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) ~[na:1.8.0_65] at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) ~[na:1.8.0_65] at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) ~[na:1.8.0_65] at org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(Un
[jira] [Resolved] (DRILL-5164) Equi-join query results in CompileException when inputs have large number of columns
[ https://issues.apache.org/jira/browse/DRILL-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Vysotskyi resolved DRILL-5164. Resolution: Fixed Fixed in [b14e30b|https://github.com/apache/drill/commit/b14e30b3df9803fef418d083837c91d57d7a5fe3] > Equi-join query results in CompileException when inputs have large number of > columns > > > Key: DRILL-5164 > URL: https://issues.apache.org/jira/browse/DRILL-5164 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Codegen >Affects Versions: 1.9.0 >Reporter: Khurram Faraaz >Assignee: Volodymyr Vysotskyi >Priority: Critical > Fix For: 1.11.0 > > Attachments: manyColsInJson.json > > > Drill 1.9.0 > git commit ID : 4c1b420b > 4 node CentOS cluster > JSON file has 4095 keys (columns) > {noformat} > 0: jdbc:drill:schema=dfs.tmp> select * from `manyColsInJson.json` t1, > `manyColsInJson.json` t2 where t1.key2000 = t2.key2000; > Error: SYSTEM ERROR: CompileException: File > 'org.apache.drill.exec.compile.DrillJavaFileObject[HashJoinProbeGen294.java]', > Line 16397, Column 17: HashJoinProbeGen294.java:16397: error: code too large > public void doSetup(FragmentContext context, VectorContainer buildBatch, > RecordBatch probeBatch, RecordBatch outgoing) > ^ (compiler.err.limit.code) > Fragment 0:0 > [Error Id: 7d0efa7e-e183-4c40-939a-4908699f94bf on centos-01.qa.lab:31010] > (state=,code=0) > {noformat} > Stack trace from drillbit.log > {noformat} > 2016-12-26 09:52:11,321 [279f17fd-c8f0-5d18-1124-76099f0a5cc8:frag:0:0] ERROR > o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: CompileException: File > 'org.apache.drill.exec.compile.DrillJavaFileObject[HashJoinProbeGen294.java]', > Line 16397, Column 17: HashJoinProbeGen294.java:16397: error: code too large > public void doSetup(FragmentContext context, VectorContainer buildBatch, > RecordBatch probeBatch, RecordBatch outgoing) > ^ (compiler.err.limit.code) > Fragment 0:0 > [Error Id: 7d0efa7e-e183-4c40-939a-4908699f94bf on centos-01.qa.lab:31010] > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > CompileException: File > 'org.apache.drill.exec.compile.DrillJavaFileObject[HashJoinProbeGen294.java]', > Line 16397, Column 17: HashJoinProbeGen294.java:16397: error: code too large > public void doSetup(FragmentContext context, VectorContainer buildBatch, > RecordBatch probeBatch, RecordBatch outgoing) > ^ (compiler.err.limit.code) > Fragment 0:0 > [Error Id: 7d0efa7e-e183-4c40-939a-4908699f94bf on centos-01.qa.lab:31010] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543) > ~[drill-common-1.9.0.jar:1.9.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:293) > [drill-java-exec-1.9.0.jar:1.9.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) > [drill-java-exec-1.9.0.jar:1.9.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:262) > [drill-java-exec-1.9.0.jar:1.9.0] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [drill-common-1.9.0.jar:1.9.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_91] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: > org.apache.drill.exec.exception.SchemaChangeException: > org.apache.drill.exec.exception.ClassTransformationException: > java.util.concurrent.ExecutionException: > org.apache.drill.exec.exception.ClassTransformationException: Failure > generating transformation classes for value: > package org.apache.drill.exec.test.generated; > ... > public class HashJoinProbeGen294 { > NullableVarCharVector[] vv0; > NullableVarCharVector vv3; > NullableVarCharVector[] vv6; > ... > vv49137 .copyFromSafe((probeIndex), (outIndex), vv49134); > vv49143 .copyFromSafe((probeIndex), (outIndex), vv49140); > vv49149 .copyFromSafe((probeIndex), (outIndex), vv49146); > } > } > > public void __DRILL_INIT__() > throws SchemaChangeException > { > } > } > at > org.apache.drill.exec.compile.ClassTransformer.getImplementationClass(ClassTransformer.java:302) > ~[drill-java-exec-1.9.0.jar:1.9.0] > at > org.apache.drill.exec.compile.CodeCompiler$Loader.load(CodeCompiler.java:78) > ~[drill-java-ex