[ https://issues.apache.org/jira/browse/DRILL-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847987#comment-15847987 ]
ASF GitHub Bot commented on DRILL-5080: --------------------------------------- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/717#discussion_r98804964 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java --- @@ -0,0 +1,1321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.xsort.managed; + +import java.io.IOException; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; + +import org.apache.drill.common.AutoCloseables; +import org.apache.drill.common.config.DrillConfig; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.MetricDef; +import org.apache.drill.exec.physical.config.ExternalSort; +import org.apache.drill.exec.physical.impl.sort.RecordBatchData; +import org.apache.drill.exec.physical.impl.xsort.MSortTemplate; +import org.apache.drill.exec.physical.impl.xsort.SingleBatchSorter; +import org.apache.drill.exec.physical.impl.xsort.managed.BatchGroup.InputBatch; +import org.apache.drill.exec.record.AbstractRecordBatch; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.SchemaUtil; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.VectorWrapper; +import org.apache.drill.exec.record.WritableBatch; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; +import org.apache.drill.exec.testing.ControlsInjector; +import org.apache.drill.exec.testing.ControlsInjectorFactory; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.complex.AbstractContainerVector; + +import com.google.common.collect.Lists; + +/** + * External sort batch: a sort batch which can spill to disk in + * order to operate within a defined memory footprint. + * <p> + * <h4>Basic Operation</h4> + * The operator has three key phases: + * <p> + * <ul> + * <li>The load phase in which batches are read from upstream.</li> + * <li>The merge phase in which spilled batches are combined to + * reduce the number of files below the configured limit. (Best + * practice is to configure the system to avoid this phase.) + * <li>The delivery phase in which batches are combined to produce + * the final output.</li> + * </ul> + * During the load phase: + * <p> + * <ul> + * <li>The incoming (upstream) operator provides a series of batches.</li> + * <li>This operator sorts each batch, and accumulates them in an in-memory + * buffer.</li> + * <li>If the in-memory buffer becomes too large, this operator selects + * a subset of the buffered batches to spill.</li> + * <li>Each spill set is merged to create a new, sorted collection of + * batches, and each is spilled to disk.</li> + * <li>To allow the use of multiple disk storage, each spill group is written + * round-robin to a set of spill directories.</li> + * </ul> + * <p> + * During the sort/merge phase: + * <p> + * <ul> + * <li>When the input operator is complete, this operator merges the accumulated + * batches (which may be all in memory or partially on disk), and returns + * them to the output (downstream) operator in chunks of no more than + * 32K records.</li> --- End diff -- Isn't it 64K ? See in exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java <code> /** max batch size, limited by 2-byte length in SV2: 65536 = 2^16 */ </code> <code>public static final int MAX_BATCH_SIZE = 65536; </code> > Create a memory-managed version of the External Sort operator > ------------------------------------------------------------- > > Key: DRILL-5080 > URL: https://issues.apache.org/jira/browse/DRILL-5080 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.8.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Fix For: 1.10 > > Attachments: ManagedExternalSortDesign.pdf > > > We propose to create a "managed" version of the external sort operator that > works to a clearly-defined memory limit. Attached is a design specification > for the work. > The project will include fixing a number of bugs related to the external > sort, include as sub-tasks of this umbrella task. -- This message was sent by Atlassian JIRA (v6.3.15#6346)