[jira] [Commented] (DRILL-5542) Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
[ https://issues.apache.org/jira/browse/DRILL-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025745#comment-16025745 ] Paul Rogers commented on DRILL-5542: Thanks for tracking this down! I wonder, how does the downstream operator know to remove the implicit columns? There is nothing in the column name or (it seems) physical plan to identify those columns as implicit. In the example for CSV, say, how would the downstream know that "columns" is OK, but "fqn" is not? Is this hard-coded somewhere? If hardcoded, how does it know to pass along the "fqn" when it is requested? In any event, for the readers that DRILL-5211 touches, I will address the issue in the revised scan batch code. Others will need attention by others. > Scan unnecessary adds implicit columns to ScanRecordBatch for select * query > > > Key: DRILL-5542 > URL: https://issues.apache.org/jira/browse/DRILL-5542 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Reporter: Jinfeng Ni > > It seems that Drill would add several implicit columns (`fqn`, `filepath`, > `filename`, `suffix`) to ScanBatch, where it's actually not required at > downstream operator. Although those implicit columns would be dropped off > later on, it increases both memory and CPU overhead. > 1. JSON > ``` > {a: 100} > ``` > {code} > select * from dfs.tmp.`1.json`; > +--+ > | a | > +--+ > | 100 | > +--+ > {code} > The schema from ScanRecordBatch is : > {code} > [ schema: > BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], > selectionVector=NONE], > {code} > 2. Parquet > {code} > elect * from cp.`tpch/nation.parquet`; > +--+-+--+-+ > | n_nationkey | n_name | n_regionkey | > n_comment > | > +--+-+--+-+ > | 0| ALGERIA | 0| haggle. carefully final > deposits detect slyly agai > | > ... > {code} > The schema of ScanRecordBatch: > {code} > schema: > BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), > n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), > fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], > {code} > 3. Text > {code} > cat 1.csv > a, b, c > select * from dfs.tmp.`1.csv`; > ++ > |columns | > ++ > | ["a","b","c"] | > ++ > {code} > Schema of ScanRecordBatch > {code} > schema: > BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], > fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], > {code} > If implicit columns are not part of query result of `select * query`, then > Scan operator should not populate those implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025664#comment-16025664 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118614384 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapDrillTable.java --- @@ -0,0 +1,73 @@ +/* --- End diff -- Would be very helpful if this PR can include a package-info.java file to describe this work. For example, what is pcap? Links to good sources? What features of Drill does it use (push-downs)? Etc. > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025656#comment-16025656 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118616554 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); +}
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025663#comment-16025663 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118619851 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java --- @@ -0,0 +1,371 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap.decoder; + +import com.google.common.base.Preconditions; + +import java.io.IOException; +import java.io.InputStream; +import java.net.InetAddress; +import java.net.UnknownHostException; + +import static org.apache.drill.exec.store.pcap.Utils.convertInt; +import static org.apache.drill.exec.store.pcap.Utils.convertShort; +import static org.apache.drill.exec.store.pcap.Utils.getByte; +import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder; +import static org.apache.drill.exec.store.pcap.Utils.getShort; + +public class Packet { --- End diff -- Would it have been possible to use one of the existing pcap Java libraries here? Four are listed [here](https://en.wikipedia.org/wiki/Pcap). > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025667#comment-16025667 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118620276 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); +}
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025661#comment-16025661 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118616240 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; --- End diff -- Do you want to do this at construct time? If you scan 1000 pcap files in a single fragment, Drill will create 1000 record readers at the start of execution. Each will allocate a 100K buffer. You'll have 100MB of heap in buffers, of which only one will ever be used. Suggestion: allocate the buffer in setup, clear it in close, so that only one buffer is used per fragment. > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 M
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025665#comment-16025665 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118619502 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); +}
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025658#comment-16025658 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118616406 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); --- End diff -- As noted above, by opening the file here, if you are scanning 1000 files, you'll have 1000 open file handles at the start of the fragment. Better to postpone opening files until setup. > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025662#comment-16025662 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118617482 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); --- End diff -- Drill has certain protocols that are not entirely obvious, but that are needed here. Each call to `ne
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025655#comment-16025655 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118615907 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapFormatPlugin.java --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Lists; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.logical.StoragePluginConfig; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.planner.logical.DrillTable; +import org.apache.drill.exec.server.DrillbitContext; +import org.apache.drill.exec.store.RecordReader; +import org.apache.drill.exec.store.RecordWriter; +import org.apache.drill.exec.store.dfs.BasicFormatMatcher; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.FileSelection; +import org.apache.drill.exec.store.dfs.FileSystemPlugin; +import org.apache.drill.exec.store.dfs.FormatMatcher; +import org.apache.drill.exec.store.dfs.FormatSelection; +import org.apache.drill.exec.store.dfs.MagicString; +import org.apache.drill.exec.store.dfs.NamedFormatPluginConfig; +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; +import org.apache.drill.exec.store.dfs.easy.EasyWriter; +import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; + +import java.io.IOException; +import java.util.List; +import java.util.regex.Pattern; + +public class PcapFormatPlugin extends EasyFormatPlugin { + + private final PcapFormatMatcher matcher; + + public PcapFormatPlugin(String name, DrillbitContext context, Configuration fsConf, + StoragePluginConfig storagePluginConfig) { +this(name, context, fsConf, storagePluginConfig, new PcapFormatConfig()); + } + + public PcapFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapFormatConfig formatPluginConfig) { +super(name, context, fsConf, config, formatPluginConfig, true, false, true, false, Lists.newArrayList("pcap"), "pcap"); +this.matcher = new PcapFormatMatcher(this); + } + + @Override + public boolean supportsPushDown() { +return true; + } + + @Override + public RecordReader getRecordReader(FragmentContext context, DrillFileSystem dfs, FileWork fileWork, List columns, String userName) throws ExecutionSetupException { +String path = dfs.makeQualified(new Path(fileWork.getPath())).toUri().getPath(); +return new PcapRecordReader(path, columns); + } + + @Override + public RecordWriter getRecordWriter(FragmentContext context, EasyWriter writer) throws IOException { +return null; + } + + @Override + public int getReaderOperatorType() { +return 0; --- End diff -- Seems akward, but it seems that other format plugins add a type to a protobuf, then return that here: ``` return CoreOperatorType.JSON_SUB_SCAN_VALUE; ``` And `UserBitShared.proto`: ``` JSON_SUB_SCAN = 29; ``` The next available number is 37. This seems rather brittle. Seems we should have a more general solution. But, until we do, I'd guess you'll need to add the enum value. As an alternative, `SequenceFileForamtPlugin` just makes up a number: ``` public int getReaderOperatorType() { return 4001; } ``` > Want
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025666#comment-16025666 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118618438 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); +}
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025659#comment-16025659 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118615528 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapFormatPlugin.java --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Lists; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.logical.StoragePluginConfig; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.planner.logical.DrillTable; +import org.apache.drill.exec.server.DrillbitContext; +import org.apache.drill.exec.store.RecordReader; +import org.apache.drill.exec.store.RecordWriter; +import org.apache.drill.exec.store.dfs.BasicFormatMatcher; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.FileSelection; +import org.apache.drill.exec.store.dfs.FileSystemPlugin; +import org.apache.drill.exec.store.dfs.FormatMatcher; +import org.apache.drill.exec.store.dfs.FormatSelection; +import org.apache.drill.exec.store.dfs.MagicString; +import org.apache.drill.exec.store.dfs.NamedFormatPluginConfig; +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; +import org.apache.drill.exec.store.dfs.easy.EasyWriter; +import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; + +import java.io.IOException; +import java.util.List; +import java.util.regex.Pattern; + +public class PcapFormatPlugin extends EasyFormatPlugin { + + private final PcapFormatMatcher matcher; + + public PcapFormatPlugin(String name, DrillbitContext context, Configuration fsConf, + StoragePluginConfig storagePluginConfig) { +this(name, context, fsConf, storagePluginConfig, new PcapFormatConfig()); + } + + public PcapFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapFormatConfig formatPluginConfig) { +super(name, context, fsConf, config, formatPluginConfig, true, false, true, false, Lists.newArrayList("pcap"), "pcap"); +this.matcher = new PcapFormatMatcher(this); + } + + @Override + public boolean supportsPushDown() { +return true; + } + + @Override + public RecordReader getRecordReader(FragmentContext context, DrillFileSystem dfs, FileWork fileWork, List columns, String userName) throws ExecutionSetupException { +String path = dfs.makeQualified(new Path(fileWork.getPath())).toUri().getPath(); +return new PcapRecordReader(path, columns); + } + + @Override + public RecordWriter getRecordWriter(FragmentContext context, EasyWriter writer) throws IOException { +return null; + } + + @Override + public int getReaderOperatorType() { +return 0; + } + + @Override + public int getWriterOperatorType() { +return 0; --- End diff -- Other format plugins do the following when a writer is not supported: ``` throw new UnsupportedOperationException("unimplemented"); ``` > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In >
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025660#comment-16025660 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118617811 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); --- End dif
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025657#comment-16025657 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118619596 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/Utils.java --- @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.primitives.Ints; +import com.google.common.primitives.Shorts; + +public class Utils { + + public static int getIntFileOrder(boolean byteOrder, final byte[] buf, final int offset) { +if (byteOrder) { --- End diff -- Maybe an explanation of mapping byte order to booleans? true/false = which/which endian? > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025668#comment-16025668 ] ASF GitHub Bot commented on DRILL-5432: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/831#discussion_r118616911 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.pcap; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.pcap.decoder.Packet; +import org.apache.drill.exec.store.pcap.decoder.PacketDecoder; +import org.apache.drill.exec.store.pcap.dto.ColumnDto; +import org.apache.drill.exec.store.pcap.schema.PcapTypes; +import org.apache.drill.exec.store.pcap.schema.Schema; +import org.apache.drill.exec.vector.NullableBigIntVector; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.drill.exec.vector.NullableTimeStampVector; +import org.apache.drill.exec.vector.NullableVarCharVector; +import org.apache.drill.exec.vector.ValueVector; + +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; + +import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII; + +public class PcapRecordReader extends AbstractRecordReader { + + private OutputMutator output; + + private final PacketDecoder decoder; + private ImmutableList projectedCols; + + private byte[] buffer = new byte[10]; + private int offset = 0; + private InputStream in; + private int validBytes; + + private static final Map TYPES; + + private static class ProjectedColumnInfo { +ValueVector vv; +ColumnDto pcapColumn; + } + + static { +TYPES = ImmutableMap.builder() +.put(PcapTypes.STRING, MinorType.VARCHAR) +.put(PcapTypes.INTEGER, MinorType.INT) +.put(PcapTypes.LONG, MinorType.BIGINT) +.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP) +.build(); + } + + public PcapRecordReader(final String inputPath, + final List projectedColumns) { +try { + this.in = new FileInputStream(inputPath); + this.decoder = getPacketDecoder(); + validBytes = in.read(buffer); +} catch (IOException e) { + throw new RuntimeException("File " + inputPath + " not Found"); +} +setColumns(projectedColumns); + } + + @Override + public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { +this.output = output; + } + + @Override + public int next() { +projectedCols = getProjectedColsIfItNull(); +try { + return parsePcapFilesAndPutItToTable(); +} catch (IOException io) { + throw new RuntimeException("Trouble with reading packets in file!"); +}
[jira] [Commented] (DRILL-5457) Support Spill to Disk for the Hash Aggregate Operator
[ https://issues.apache.org/jira/browse/DRILL-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025621#comment-16025621 ] ASF GitHub Bot commented on DRILL-5457: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/822#discussion_r118616162 --- Diff: exec/java-exec/src/main/resources/drill-module.conf --- @@ -205,10 +225,10 @@ drill.exec: { // Deprecated for managed xsort; used only by legacy xsort threshold: 4, // File system to use. Local file system by default. -fs: "file:///" +fs: ${drill.exec.spill.fs}, --- End diff -- Done. Added: // -- The two options below can be used to override the options common // -- for all spilling operators (see "spill" above). // -- This is done for backward compatibility; in the future they // -- would be deprecated (you should be using only the common ones) > Support Spill to Disk for the Hash Aggregate Operator > - > > Key: DRILL-5457 > URL: https://issues.apache.org/jira/browse/DRILL-5457 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.10.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi > Fix For: 1.11.0 > > > Support gradual spilling memory to disk as the available memory gets too > small to allow in memory work for the Hash Aggregate Operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5542) Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
Jinfeng Ni created DRILL-5542: - Summary: Scan unnecessary adds implicit columns to ScanRecordBatch for select * query Key: DRILL-5542 URL: https://issues.apache.org/jira/browse/DRILL-5542 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Reporter: Jinfeng Ni It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead. 1. JSON ``` {a: 100} ``` {code} select * from dfs.tmp.`1.json`; +--+ | a | +--+ | 100 | +--+ {code} The schema from ScanRecordBatch is : {code} [ schema: BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE], {code} 2. Parquet {code} elect * from cp.`tpch/nation.parquet`; +--+-+--+-+ | n_nationkey | n_name | n_regionkey | n_comment | +--+-+--+-+ | 0| ALGERIA | 0| haggle. carefully final deposits detect slyly agai | ... {code} The schema of ScanRecordBatch: {code} schema: BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], {code} 3. Text {code} cat 1.csv a, b, c select * from dfs.tmp.`1.csv`; ++ |columns | ++ | ["a","b","c"] | ++ {code} Schema of ScanRecordBatch {code} schema: BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], {code} If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5504) Vector validator to diagnose offset vector issues
[ https://issues.apache.org/jira/browse/DRILL-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025573#comment-16025573 ] ASF GitHub Bot commented on DRILL-5504: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/832 Fixed typo in log message and rebased onto latest master. > Vector validator to diagnose offset vector issues > - > > Key: DRILL-5504 > URL: https://issues.apache.org/jira/browse/DRILL-5504 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > DRILL-5470 describes a case in which an offset vector appears to have become > corrupted, yielding a bogus field-length value that is orders of magnitude > larger than the vector that contains the data. > Debugging such cases is slow and tedious. To help, we propose to create a > "vector validator" that spins through vectors looking for problems. > Then, to allow the validator to be used in the field, extend the "iterator > validator batch iterator" to optionally allow vector validation on each batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5541) C++ Client Crashes During Simple "Man in the Middle" Attack Test with Exploitable Write AV
Rob Wu created DRILL-5541: - Summary: C++ Client Crashes During Simple "Man in the Middle" Attack Test with Exploitable Write AV Key: DRILL-5541 URL: https://issues.apache.org/jira/browse/DRILL-5541 Project: Apache Drill Issue Type: Bug Components: Client - C++ Affects Versions: 1.10.0 Reporter: Rob Wu Priority: Critical drillClient!boost_sb::shared_ptr::reset+0xa7: 07fe`c292f827 f0ff4b08lock dec dword ptr [rbx+8] ds:07fe`c2b3de78=c29e6060 Exploitability Classification: EXPLOITABLE Recommended Bug Title: Exploitable - User Mode Write AV starting at drillClient!boost_sb::shared_ptr::reset+0x00a7 (Hash=0x4ae7fdff.0xb15af658) User mode write access violations that are not near NULL are exploitable. == Stack Trace: Child-SP RetAddr Call Site `030df630 07fe`c295bca1 drillClient!boost_sb::shared_ptr::reset+0xa7 [c:\users\bamboo\desktop\make_win_drill\sb_boost\include\boost-1_57\boost\smart_ptr\shared_ptr.hpp @ 620] `030df680 07fe`c295433c drillClient!Drill::DrillClientImpl::processSchemasResult+0x281 [c:\users\bamboo\desktop\make_win_drill\drill-1.10.0.1\drill-1.10.0.1\contrib\native\client\src\clientlib\drillclientimpl.cpp @ 1227] `030df7a0 07fe`c294cbf6 drillClient!Drill::DrillClientImpl::handleRead+0x75c [c:\users\bamboo\desktop\make_win_drill\drill-1.10.0.1\drill-1.10.0.1\contrib\native\client\src\clientlib\drillclientimpl.cpp @ 1555] `030df9c0 07fe`c294ce9f drillClient!boost_sb::asio::detail::win_iocp_socket_recv_op >,boost_sb::asio::mutable_buffers_1,boost_sb::asio::detail::transfer_all_t,boost_sb::_bi::bind_t,boost_sb::_bi::list4,boost_sb::_bi::value,boost_sb::arg<1>,boost_sb::arg<2> > > > >::do_complete+0x166 [c:\users\bamboo\desktop\make_win_drill\sb_boost\include\boost-1_57\boost\asio\detail\win_iocp_socket_recv_op.hpp @ 97] `030dfa90 07fe`c296009d drillClient!boost_sb::asio::detail::win_iocp_io_service::do_one+0x27f [c:\users\bamboo\desktop\make_win_drill\sb_boost\include\boost-1_57\boost\asio\detail\impl\win_iocp_io_service.ipp @ 406] `030dfb70 07fe`c295ffc9 drillClient!boost_sb::asio::detail::win_iocp_io_service::run+0xad [c:\users\bamboo\desktop\make_win_drill\sb_boost\include\boost-1_57\boost\asio\detail\impl\win_iocp_io_service.ipp @ 164] `030dfbd0 07fe`c2aa5b53 drillClient!boost_sb::asio::io_service::run+0x29 [c:\users\bamboo\desktop\make_win_drill\sb_boost\include\boost-1_57\boost\asio\impl\io_service.ipp @ 60] `030dfc10 07fe`c2ad3e03 drillClient!boost_sb::`anonymous namespace'::thread_start_function+0x43 `030dfc50 07fe`c2ad404e drillClient!_callthreadstartex+0x17 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 376] `030dfc80 `779e59cd drillClient!_threadstartex+0x102 [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 354] `030dfcb0 `77c1a561 kernel32!BaseThreadInitThunk+0xd `030dfce0 ` ntdll!RtlUserThreadStart+0x1d == Register: rax=0284bae0 rbx=07fec2b3de70 rcx=027ec210 rdx=027ec210 rsi=027f2638 rdi=027f25d0 rip=07fec292f827 rsp=030df630 rbp=027ec210 r8=027ec210 r9= r10=027d32fc r11=27eb001b0003 r12= r13=028035a0 r14=027ec210 r15= iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010200 drillClient!boost_sb::shared_ptr::reset+0xa7: 07fe`c292f827 f0ff4b08lock dec dword ptr [rbx+8] ds:07fe`c2b3de78=c29e6060 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5457) Support Spill to Disk for the Hash Aggregate Operator
[ https://issues.apache.org/jira/browse/DRILL-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025568#comment-16025568 ] ASF GitHub Bot commented on DRILL-5457: --- Github user rchallapalli commented on the issue: https://github.com/apache/drill/pull/822 Based on the current design, if the code senses that there is not sufficient memory then it goes back to the old code. Now I have encountered a case where this happened and the old agg did not respect the memory constraints imposed by me. I gave 116MB memory and the old hash agg code consumed ~130MB and completed the query. This doesn't play well with the overall resource management plan > Support Spill to Disk for the Hash Aggregate Operator > - > > Key: DRILL-5457 > URL: https://issues.apache.org/jira/browse/DRILL-5457 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.10.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi > Fix For: 1.11.0 > > > Support gradual spilling memory to disk as the available memory gets too > small to allow in memory work for the Hash Aggregate Operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025488#comment-16025488 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/789 Cleaned up the multi-commit mess, rebased on the latest master, and fixed minor issues raised in code review comments. Should be read to commit. > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5457) Support Spill to Disk for the Hash Aggregate Operator
[ https://issues.apache.org/jira/browse/DRILL-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025441#comment-16025441 ] ASF GitHub Bot commented on DRILL-5457: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/822#discussion_r118592786 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java --- @@ -133,6 +133,9 @@ the need to turn off join optimization may go away. */ public static final BooleanValidator JOIN_OPTIMIZATION = new BooleanValidator("planner.enable_join_optimization", true); + // for testing purpose --- End diff -- @VisibleForTesting annotates methods; but this is a session option. Also (hidden) is the possibility that this option may be used in production in case some query yields a single phase hashagg but still has too much data to handle. > Support Spill to Disk for the Hash Aggregate Operator > - > > Key: DRILL-5457 > URL: https://issues.apache.org/jira/browse/DRILL-5457 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.10.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi > Fix For: 1.11.0 > > > Support gradual spilling memory to disk as the available memory gets too > small to allow in memory work for the Hash Aggregate Operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025437#comment-16025437 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118591297 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetSchema.java --- @@ -0,0 +1,262 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet.columnreaders; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.Types; +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.server.options.OptionManager; +import org.apache.drill.exec.store.parquet.ParquetReaderUtility; +import org.apache.drill.exec.vector.NullableIntVector; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.format.SchemaElement; +import org.apache.parquet.hadoop.metadata.BlockMetaData; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; + +import com.google.common.collect.Lists; + +/** + * Mapping from the schema of the Parquet file to that of the record reader + * to the schema that Drill and the Parquet reader uses. + */ + +public class ParquetSchema { + /** + * Set of columns specified in the SELECT clause. Will be null for + * a SELECT * query. + */ + private final Collection selectedCols; + /** + * Parallel list to the columns list above, it is used to determine the subset of the project + * pushdown columns that do not appear in this file. + */ + private final boolean[] columnsFound; + private final OptionManager options; + private final int rowGroupIndex; + private ParquetMetadata footer; + /** + * List of metadata for selected columns. This list does two things. + * First, it identifies the Parquet columns we wish to select. Second, it + * provides metadata for those columns. Note that null columns (columns + * in the SELECT clause but not in the file) appear elsewhere. + */ + private List selectedColumnMetadata = new ArrayList<>(); + private int bitWidthAllFixedFields; + private boolean allFieldsFixedLength; + private long groupRecordCount; + private int recordsPerBatch; + + /** + * Build the Parquet schema. The schema can be based on a "SELECT *", + * meaning we want all columns defined in the Parquet file. In this case, + * the list of selected columns is null. Or, the query can be based on + * an explicit list of selected columns. In this case, the + * columns need not exist in the Parquet file. If a column does not exist, + * the reader returns null for that column. If no selected column exists + * in the file, then we return "mock" records: records with only null + * values, but repeated for the number of rows in the Parquet file. + * + * @param options session options + * @param rowGroupIndex row group to read + * @param selectedCols columns specified in the SELECT clause, or null if + * this is a SELECT * query + */ + + public ParquetSchema(OptionManager options, int rowGroupIndex, Collection selectedCols) { +this.options = options; +this.rowGroupInd
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025432#comment-16025432 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118590127 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/BatchReader.java --- @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet.columnreaders; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.concurrent.Future; +import java.util.concurrent.TimeUnit; + +import com.google.common.base.Stopwatch; +import com.google.common.collect.Lists; + +/** + * Base strategy for reading a batch of Parquet records. + */ +public abstract class BatchReader { + + protected final ReadState readState; + + public BatchReader(ReadState readState) { +this.readState = readState; + } + + public int readBatch() throws Exception { +ColumnReader firstColumnStatus = readState.getFirstColumnReader(); +long recordsToRead = Math.min(getReadCount(firstColumnStatus), readState.getRecordsToRead()); +int readCount = readRecords(firstColumnStatus, recordsToRead); +readState.fillNullVectors(readCount); +return readCount; + } + + protected abstract long getReadCount(ColumnReader firstColumnStatus); + + protected abstract int readRecords(ColumnReader firstColumnStatus, long recordsToRead) throws Exception; + + protected void readAllFixedFields(long recordsToRead) throws Exception { +Stopwatch timer = Stopwatch.createStarted(); +if(readState.useAsyncColReader()){ + readAllFixedFieldsParallel(recordsToRead); +} else { + readAllFixedFieldsSerial(recordsToRead); +} + readState.parquetReaderStats().timeFixedColumnRead.addAndGet(timer.elapsed(TimeUnit.NANOSECONDS)); + } + + protected void readAllFixedFieldsSerial(long recordsToRead) throws IOException { +for (ColumnReader crs : readState.getColumnReaders()) { + crs.processPages(recordsToRead); +} + } + + protected void readAllFixedFieldsParallel(long recordsToRead) throws Exception { +ArrayList> futures = Lists.newArrayList(); +for (ColumnReader crs : readState.getColumnReaders()) { + Future f = crs.processPagesAsync(recordsToRead); + futures.add(f); +} +Exception exception = null; +for(Future f: futures){ + if (exception != null) { +f.cancel(true); + } else { +try { + f.get(); +} catch (Exception e) { + f.cancel(true); + exception = e; +} + } +} +if (exception != null) { + throw exception; +} + } + + /** + * Strategy for reading mock records. (What are these?) + */ --- End diff -- Fixed. Finally found out what this means. Thanks Jinfeng! > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-densit
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025433#comment-16025433 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118590427 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetColumnMetadata.java --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet.columnreaders; + +import java.util.Map; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.server.options.OptionManager; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.complex.RepeatedValueVector; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.format.SchemaElement; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.schema.PrimitiveType; +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName; + +/** + * Represents a single column read from the Parquet file by the record reader. + */ + +public class ParquetColumnMetadata { + + ColumnDescriptor column; + private SchemaElement se; + MaterializedField field; + int length; + private MajorType type; + ColumnChunkMetaData columnChunkMetaData; + private ValueVector vector; + + public ParquetColumnMetadata(ColumnDescriptor column) { +this.column = column; + } + + public void resolveDrillType(Map schemaElements, OptionManager options) { +se = schemaElements.get(column.getPath()[0]); +type = ParquetToDrillTypeConverter.toMajorType(column.getType(), se.getType_length(), +getDataMode(column), se, options); +field = MaterializedField.create(toFieldName(column.getPath()), type); +length = getDataTypeLength(); + } + + private String toFieldName(String[] paths) { +return SchemaPath.getCompoundPath(paths).getAsUnescapedPath(); + } + + private TypeProtos.DataMode getDataMode(ColumnDescriptor column) { +if (isRepeated()) { + return DataMode.REPEATED; +} else if (column.getMaxDefinitionLevel() == 0) { + return TypeProtos.DataMode.REQUIRED; +} else { + return TypeProtos.DataMode.OPTIONAL; +} + } + + /** + * @param type + * @param type a fixed length type from the parquet library enum + * @return the length in pageDataByteArray of the type + */ + public static int getTypeLengthInBits(PrimitiveTypeName type) { +switch (type) { + case INT64: return 64; + case INT32: return 32; + case BOOLEAN: return 1; + case FLOAT: return 32; + case DOUBLE: return 64; + case INT96: return 96; + // binary and fixed length byte array + default: +throw new IllegalStateException("Length cannot be determined for type " + type); +} + } + + /** + * Returns data type length for a given {@see ColumnDescriptor} and it's corresponding + * {@see SchemaElement}. Neither is enough information alone as the max + * repetition level (indicating if it is an array type) is in the ColumnDescriptor and + * the length of a fixed width field is stored at the schema
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025435#comment-16025435 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118591078 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -308,163 +232,50 @@ public FragmentContext getFragmentContext() { } /** - * Returns data type length for a given {@see ColumnDescriptor} and it's corresponding - * {@see SchemaElement}. Neither is enough information alone as the max - * repetition level (indicating if it is an array type) is in the ColumnDescriptor and - * the length of a fixed width field is stored at the schema level. - * - * @return the length if fixed width, else -1 + * Prepare the Parquet reader. First determine the set of columns to read (the schema + * for this read.) Then, create a state object to track the read across calls to + * the reader next() method. Finally, create one of three readers to + * read batches depending on whether this scan is for only fixed-width fields, + * contains at least one variable-width field, or is a "mock" scan consisting + * only of null fields (fields in the SELECT clause but not in the Parquet file.) */ - private int getDataTypeLength(ColumnDescriptor column, SchemaElement se) { -if (column.getType() != PrimitiveType.PrimitiveTypeName.BINARY) { - if (column.getMaxRepetitionLevel() > 0) { -return -1; - } - if (column.getType() == PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY) { -return se.getType_length() * 8; - } else { -return getTypeLengthInBits(column.getType()); - } -} else { - return -1; -} - } - @SuppressWarnings({ "resource", "unchecked" }) @Override public void setup(OperatorContext operatorContext, OutputMutator output) throws ExecutionSetupException { this.operatorContext = operatorContext; -if (!isStarQuery()) { - columnsFound = new boolean[getColumns().size()]; - nullFilledVectors = new ArrayList<>(); -} -columnStatuses = new ArrayList<>(); -List columns = footer.getFileMetaData().getSchema().getColumns(); -allFieldsFixedLength = true; -ColumnDescriptor column; -ColumnChunkMetaData columnChunkMetaData; -int columnsToScan = 0; -mockRecordsRead = 0; - -MaterializedField field; +schema = new ParquetSchema(fragmentContext.getOptions(), rowGroupIndex, isStarQuery() ? null : getColumns()); logger.debug("Reading row group({}) with {} records in file {}.", rowGroupIndex, footer.getBlocks().get(rowGroupIndex).getRowCount(), hadoopPath.toUri().getPath()); -totalRecordsRead = 0; - -// TODO - figure out how to deal with this better once we add nested reading, note also look where this map is used below -// store a map from column name to converted types if they are non-null -Map schemaElements = ParquetReaderUtility.getColNameToSchemaElementMapping(footer); - -// loop to add up the length of the fixed width columns and build the schema -for (int i = 0; i < columns.size(); ++i) { - column = columns.get(i); - SchemaElement se = schemaElements.get(column.getPath()[0]); - MajorType mt = ParquetToDrillTypeConverter.toMajorType(column.getType(), se.getType_length(), - getDataMode(column), se, fragmentContext.getOptions()); - field = MaterializedField.create(toFieldName(column.getPath()), mt); - if ( ! fieldSelected(field)) { -continue; - } - columnsToScan++; - int dataTypeLength = getDataTypeLength(column, se); - if (dataTypeLength == -1) { -allFieldsFixedLength = false; - } else { -bitWidthAllFixedFields += dataTypeLength; - } -} - -if (columnsToScan != 0 && allFieldsFixedLength) { - recordsPerBatch = (int) Math.min(Math.min(batchSize / bitWidthAllFixedFields, - footer.getBlocks().get(0).getColumns().get(0).getValueCount()), DEFAULT_RECORDS_TO_READ_IF_FIXED_WIDTH); -} -else { - recordsPerBatch = DEFAULT_RECORDS_TO_READ_IF_VARIABLE_WIDTH; -} try { - ValueVector vector; - SchemaElement schemaElement; - final ArrayList> varLengthColumns = new ArrayList<>(); - // initialize all of the column read status objects - boolean fieldFixedLength; -
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025436#comment-16025436 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118590602 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetColumnMetadata.java --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet.columnreaders; + +import java.util.Map; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.server.options.OptionManager; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.complex.RepeatedValueVector; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.format.SchemaElement; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.schema.PrimitiveType; +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName; + +/** + * Represents a single column read from the Parquet file by the record reader. + */ + +public class ParquetColumnMetadata { + + ColumnDescriptor column; + private SchemaElement se; + MaterializedField field; + int length; + private MajorType type; + ColumnChunkMetaData columnChunkMetaData; + private ValueVector vector; + + public ParquetColumnMetadata(ColumnDescriptor column) { +this.column = column; + } + + public void resolveDrillType(Map schemaElements, OptionManager options) { +se = schemaElements.get(column.getPath()[0]); +type = ParquetToDrillTypeConverter.toMajorType(column.getType(), se.getType_length(), +getDataMode(column), se, options); +field = MaterializedField.create(toFieldName(column.getPath()), type); +length = getDataTypeLength(); + } + + private String toFieldName(String[] paths) { +return SchemaPath.getCompoundPath(paths).getAsUnescapedPath(); + } + + private TypeProtos.DataMode getDataMode(ColumnDescriptor column) { +if (isRepeated()) { + return DataMode.REPEATED; +} else if (column.getMaxDefinitionLevel() == 0) { + return TypeProtos.DataMode.REQUIRED; +} else { + return TypeProtos.DataMode.OPTIONAL; +} + } + + /** + * @param type + * @param type a fixed length type from the parquet library enum + * @return the length in pageDataByteArray of the type + */ + public static int getTypeLengthInBits(PrimitiveTypeName type) { +switch (type) { + case INT64: return 64; + case INT32: return 32; + case BOOLEAN: return 1; + case FLOAT: return 32; + case DOUBLE: return 64; + case INT96: return 96; + // binary and fixed length byte array + default: +throw new IllegalStateException("Length cannot be determined for type " + type); +} + } + + /** + * Returns data type length for a given {@see ColumnDescriptor} and it's corresponding + * {@see SchemaElement}. Neither is enough information alone as the max + * repetition level (indicating if it is an array type) is in the ColumnDescriptor and + * the length of a fixed width field is stored at the schema
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025434#comment-16025434 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/789#discussion_r118591443 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetInternalsTest.java --- @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet; + +import static org.junit.Assert.*; + +import java.util.HashMap; +import java.util.Map; + +import org.apache.drill.TestBuilder; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.Types; +import org.apache.drill.test.ClusterFixture; +import org.apache.drill.test.ClusterTest; +import org.apache.drill.test.FixtureBuilder; +import org.junit.BeforeClass; +import org.junit.Test; + +public class ParquetInternalsTest extends ClusterTest { + + @BeforeClass + public static void setup( ) throws Exception { +FixtureBuilder builder = ClusterFixture.builder() + // Set options, etc. + ; +startCluster(builder); + } + + @Test + public void testFixedWidth() throws Exception { +String sql = "SELECT l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity\n" + + "FROM `cp`.`tpch/lineitem.parquet` LIMIT 20"; +//client.queryBuilder().sql(sql).printCsv(); + +Map typeMap = new HashMap<>(); +typeMap.put(TestBuilder.parsePath("l_orderkey"), Types.required(TypeProtos.MinorType.INT)); +typeMap.put(TestBuilder.parsePath("l_partkey"), Types.required(TypeProtos.MinorType.INT)); +typeMap.put(TestBuilder.parsePath("l_suppkey"), Types.required(TypeProtos.MinorType.INT)); +typeMap.put(TestBuilder.parsePath("l_linenumber"), Types.required(TypeProtos.MinorType.INT)); +typeMap.put(TestBuilder.parsePath("l_quantity"), Types.required(TypeProtos.MinorType.FLOAT8)); +client.testBuilder() + .sqlQuery(sql) + .unOrdered() + .csvBaselineFile("parquet/expected/fixedWidth.csv") + .baselineColumns("l_orderkey", "l_partkey", "l_suppkey", "l_linenumber", "l_quantity") + .baselineTypes(typeMap) + .build() + .run(); + } + + --- End diff -- Fixed. > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5485) Remove WebServer dependency on DrillClient
[ https://issues.apache.org/jira/browse/DRILL-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025417#comment-16025417 ] ASF GitHub Bot commented on DRILL-5485: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/829#discussion_r118587783 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebServer.java --- @@ -219,12 +232,43 @@ public void sessionDestroyed(HttpSessionEvent se) { securityHandler.logout(sessionAuth); session.removeAttribute(SessionAuthentication.__J_AUTHENTICATED); } + +// Clear all the custom attributes set as part of session +clearSessionCustomAttributes(session); } }); return new SessionHandler(sessionManager); } + private void clearSessionCustomAttributes(HttpSession session) { --- End diff -- (I somehow managed to delete this in my comment..) The life cycle of those resources could be placed together in one class, as resources are being initialized in one place but closed in different places. > Remove WebServer dependency on DrillClient > -- > > Key: DRILL-5485 > URL: https://issues.apache.org/jira/browse/DRILL-5485 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Reporter: Sorabh Hamirwasia > Fix For: 1.11.0 > > > With encryption support using SASL, client's won't be able to authenticate > using PLAIN mechanism when encryption is enabled on the cluster. Today > WebServer which is embedded inside Drillbit creates a DrillClient instance > for each WebClient session. And the WebUser is authenticated as part of > authentication between DrillClient instance and Drillbit using PLAIN > mechanism. But with encryption enabled this will fail since encryption > doesn't support authentication using PLAN mechanism, hence no WebClient can > connect to a Drillbit. There are below issues as well with this approach: > 1) Since DrillClient is used per WebUser session this is expensive as it has > heavyweight RPC layer for DrillClient and all it's dependencies. > 2) If the Foreman for a WebUser is also selected to be a different node then > there will be extra hop of transferring data back to WebClient. > To resolve all the above issue it would be better to authenticate the WebUser > locally using the Drillbit on which WebServer is running without creating > DrillClient instance. We can use the local PAMAuthenticator to authenticate > the user. After authentication is successful the local Drillbit can also > serve as the Foreman for all the queries submitted by WebUser. This can be > achieved by submitting the query to the local Drillbit Foreman work queue. > This will also remove the requirement to encrypt the channel opened between > WebServer (DrillClient) and selected Drillbit since with this approach there > won't be any physical channel opened between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5485) Remove WebServer dependency on DrillClient
[ https://issues.apache.org/jira/browse/DRILL-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025383#comment-16025383 ] ASF GitHub Bot commented on DRILL-5485: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/829#discussion_r118582764 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/rpc/UserClientConnection.java --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.rpc; + +import io.netty.channel.ChannelFuture; +import org.apache.drill.exec.physical.impl.materialize.QueryWritableBatch; +import org.apache.drill.exec.proto.GeneralRPCProtos; +import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.exec.rpc.user.UserSession; + +import java.net.SocketAddress; + +/** + * Interface for getting user session properties and interacting with user connection. Separating this interface from + * {@link AbstractRemoteConnection} implementation for user connection: + * + * Connection is passed to Foreman and Screen operators. Instead passing this interface exposes few details. + * Makes it easy to have wrappers around user connection which can be helpful to tap the messages and data + * going to the actual client. + * + */ +public interface UserClientConnection { + /** + * @return User session object. + */ + UserSession getSession(); + + /** + * Send query result outcome to client. Outcome is returned through listener + * + * @param listener + * @param result + */ + void sendResult(RpcOutcomeListener listener, UserBitShared.QueryResult result); --- End diff -- Not fixed? > Remove WebServer dependency on DrillClient > -- > > Key: DRILL-5485 > URL: https://issues.apache.org/jira/browse/DRILL-5485 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Reporter: Sorabh Hamirwasia > Fix For: 1.11.0 > > > With encryption support using SASL, client's won't be able to authenticate > using PLAIN mechanism when encryption is enabled on the cluster. Today > WebServer which is embedded inside Drillbit creates a DrillClient instance > for each WebClient session. And the WebUser is authenticated as part of > authentication between DrillClient instance and Drillbit using PLAIN > mechanism. But with encryption enabled this will fail since encryption > doesn't support authentication using PLAN mechanism, hence no WebClient can > connect to a Drillbit. There are below issues as well with this approach: > 1) Since DrillClient is used per WebUser session this is expensive as it has > heavyweight RPC layer for DrillClient and all it's dependencies. > 2) If the Foreman for a WebUser is also selected to be a different node then > there will be extra hop of transferring data back to WebClient. > To resolve all the above issue it would be better to authenticate the WebUser > locally using the Drillbit on which WebServer is running without creating > DrillClient instance. We can use the local PAMAuthenticator to authenticate > the user. After authentication is successful the local Drillbit can also > serve as the Foreman for all the queries submitted by WebUser. This can be > achieved by submitting the query to the local Drillbit Foreman work queue. > This will also remove the requirement to encrypt the channel opened between > WebServer (DrillClient) and selected Drillbit since with this approach there > won't be any physical channel opened between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5485) Remove WebServer dependency on DrillClient
[ https://issues.apache.org/jira/browse/DRILL-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025369#comment-16025369 ] ASF GitHub Bot commented on DRILL-5485: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/829#discussion_r118580820 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/rpc/AbstractUserClientConnectionWrapper.java --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.rpc; + +import com.google.common.base.Preconditions; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.exceptions.UserRemoteException; +import org.apache.drill.exec.proto.GeneralRPCProtos; +import org.apache.drill.exec.proto.UserBitShared.DrillPBError; +import org.apache.drill.exec.proto.UserBitShared.QueryId; +import org.apache.drill.exec.proto.UserBitShared.QueryResult; +import org.apache.drill.exec.proto.helper.QueryIdHelper; + +import java.util.concurrent.CountDownLatch; +import java.util.concurrent.TimeUnit; + +public abstract class AbstractUserClientConnectionWrapper implements UserClientConnection { + private static final org.slf4j.Logger logger = + org.slf4j.LoggerFactory.getLogger(AbstractUserClientConnectionWrapper.class); + + protected final CountDownLatch latch = new CountDownLatch(1); + + protected volatile DrillPBError error; + + protected volatile UserException exception; + + /** + * Wait until the query has completed or timeout is passed. + * + * @throws InterruptedException + */ + public boolean await(final long timeoutMillis) throws InterruptedException { +return latch.await(timeoutMillis, TimeUnit.MILLISECONDS); + } + + /** + * Wait indefinitely until the query is completed. Used only in case of WebUser + * + * @throws Exception + */ + public void await() throws Exception { +latch.await(); +if (exception != null) { + throw exception; +} + } + + @Override + public void sendResult(RpcOutcomeListener listener, QueryResult result) { + +Preconditions.checkState(result.hasQueryState()); + +// Release the wait latch if the query is terminated. +final QueryResult.QueryState state = result.getQueryState(); +final QueryId queryId = result.getQueryId(); + +if (logger.isDebugEnabled()) { + logger.debug("Result arrived for QueryId: {} with QueryState: {}", QueryIdHelper.getQueryId(queryId), state); +} + +switch (state) { + case FAILED: +error = result.getError(0); +exception = new UserRemoteException(error); +latch.countDown(); +break; + case CANCELED: + case COMPLETED: +Preconditions.checkState(result.getErrorCount() == 0); +latch.countDown(); +break; + default: +logger.error("Query with QueryId: {} is in unexpected state: {}", queryId, state); --- End diff -- That maybe an issue as well. AFAIK [DRILL-2498](https://github.com/apache/drill/commit/1d9d82b001810605e3f94ab3a5517dc0ed739715#diff-158c887d198393117d3a1bbc42114a8b) ensures that only the final state is sent to client using `sendResult`; this is the terminal message from server to client for that query. So if that message is wrong, the query is in an illegal state. > Remove WebServer dependency on DrillClient > -- > > Key: DRILL-5485 > URL: https://issues.apache.org/jira/browse/DRILL-5485 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Reporter: Sorabh Hamirwasia > Fix For: 1.11.0 > > > With encryption support using SASL, client's won't be able t
[jira] [Updated] (DRILL-5229) Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0
[ https://issues.apache.org/jira/browse/DRILL-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-5229: --- Labels: ready-to-commit (was: ) > Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0 > - > > Key: DRILL-5229 > URL: https://issues.apache.org/jira/browse/DRILL-5229 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.8.0 >Reporter: Rahul Raj >Assignee: Sudheesh Katkam > Labels: ready-to-commit > > Getting an error -" out-of-order key" for a query select v,count(k) from > kudu.test group by v where k is the primary key. This happens only when the > aggregation is done on primary key. Should drill move to the latest kudu > client to investigate this further? > Current drill kudu connector uses org.kududb:kudu-client:0.6.0 from > cloudera repository, where the latest released library > org.apache.kudu:kudu-client:1.2.0 is hosted on maven central. There are a > few breaking changes with the new library: >1. TIMESTAMP renamed to UNIXTIME_MICROS >2. In KuduRecordReader#setup - >KuduScannerBuilder#lowerBoundPartitionKeyRaw renamed to lowerBoundRaw >andKuduScannerBuilder#exclusiveUpperBoundPartitionKeyRaw renamed >exclusiveUpperBoundRaw. Both methods are deprecated. >3. In KuduRecordWriterImpl#updateSchema - client.createTable(name, >kuduSchema) requires CreateTableOperatios as the third argument -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5229) Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0
[ https://issues.apache.org/jira/browse/DRILL-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-5229: --- Fix Version/s: (was: 2.0.0) > Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0 > - > > Key: DRILL-5229 > URL: https://issues.apache.org/jira/browse/DRILL-5229 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.8.0 >Reporter: Rahul Raj > > Getting an error -" out-of-order key" for a query select v,count(k) from > kudu.test group by v where k is the primary key. This happens only when the > aggregation is done on primary key. Should drill move to the latest kudu > client to investigate this further? > Current drill kudu connector uses org.kududb:kudu-client:0.6.0 from > cloudera repository, where the latest released library > org.apache.kudu:kudu-client:1.2.0 is hosted on maven central. There are a > few breaking changes with the new library: >1. TIMESTAMP renamed to UNIXTIME_MICROS >2. In KuduRecordReader#setup - >KuduScannerBuilder#lowerBoundPartitionKeyRaw renamed to lowerBoundRaw >andKuduScannerBuilder#exclusiveUpperBoundPartitionKeyRaw renamed >exclusiveUpperBoundRaw. Both methods are deprecated. >3. In KuduRecordWriterImpl#updateSchema - client.createTable(name, >kuduSchema) requires CreateTableOperatios as the third argument -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (DRILL-5229) Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0
[ https://issues.apache.org/jira/browse/DRILL-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam reassigned DRILL-5229: -- Assignee: Sudheesh Katkam > Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0 > - > > Key: DRILL-5229 > URL: https://issues.apache.org/jira/browse/DRILL-5229 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.8.0 >Reporter: Rahul Raj >Assignee: Sudheesh Katkam > > Getting an error -" out-of-order key" for a query select v,count(k) from > kudu.test group by v where k is the primary key. This happens only when the > aggregation is done on primary key. Should drill move to the latest kudu > client to investigate this further? > Current drill kudu connector uses org.kududb:kudu-client:0.6.0 from > cloudera repository, where the latest released library > org.apache.kudu:kudu-client:1.2.0 is hosted on maven central. There are a > few breaking changes with the new library: >1. TIMESTAMP renamed to UNIXTIME_MICROS >2. In KuduRecordReader#setup - >KuduScannerBuilder#lowerBoundPartitionKeyRaw renamed to lowerBoundRaw >andKuduScannerBuilder#exclusiveUpperBoundPartitionKeyRaw renamed >exclusiveUpperBoundRaw. Both methods are deprecated. >3. In KuduRecordWriterImpl#updateSchema - client.createTable(name, >kuduSchema) requires CreateTableOperatios as the third argument -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5229) Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0
[ https://issues.apache.org/jira/browse/DRILL-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025336#comment-16025336 ] ASF GitHub Bot commented on DRILL-5229: --- Github user sudheeshkatkam commented on the issue: https://github.com/apache/drill/pull/828 +1 The error seems unrelated to the changes, and all tests pass. Thank you for the PR! > Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0 > - > > Key: DRILL-5229 > URL: https://issues.apache.org/jira/browse/DRILL-5229 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.8.0 >Reporter: Rahul Raj > Fix For: 2.0.0 > > > Getting an error -" out-of-order key" for a query select v,count(k) from > kudu.test group by v where k is the primary key. This happens only when the > aggregation is done on primary key. Should drill move to the latest kudu > client to investigate this further? > Current drill kudu connector uses org.kududb:kudu-client:0.6.0 from > cloudera repository, where the latest released library > org.apache.kudu:kudu-client:1.2.0 is hosted on maven central. There are a > few breaking changes with the new library: >1. TIMESTAMP renamed to UNIXTIME_MICROS >2. In KuduRecordReader#setup - >KuduScannerBuilder#lowerBoundPartitionKeyRaw renamed to lowerBoundRaw >andKuduScannerBuilder#exclusiveUpperBoundPartitionKeyRaw renamed >exclusiveUpperBoundRaw. Both methods are deprecated. >3. In KuduRecordWriterImpl#updateSchema - client.createTable(name, >kuduSchema) requires CreateTableOperatios as the third argument -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4984) Limit 0 raises NullPointerException on JDBC storage sources
[ https://issues.apache.org/jira/browse/DRILL-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025177#comment-16025177 ] Holger Kiel commented on DRILL-4984: Also unable to use Drill as jdbc source in Spark/Scala because of this bug. > Limit 0 raises NullPointerException on JDBC storage sources > --- > > Key: DRILL-4984 > URL: https://issues.apache.org/jira/browse/DRILL-4984 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Affects Versions: 1.8.0, 1.9.0, 1.10.0 > Environment: Latest 1.9 release also 1.8 release version, > mysql-connector-java-5.1.30, mysql-connector-java-5.1.40 >Reporter: Holger Kiel > > NullPointerExceptions occur when a query with 'limit 0' is executed on a jdbc > storage source (e.g. Mysql): > {code} > 0: jdbc:drill:zk=local> select * from mysql.sugarcrm.sales_person limit 0; > Error: SYSTEM ERROR: NullPointerException > [Error Id: 6cd676fc-6db9-40b3-81d5-c2db044aeb77 on localhost:31010] > (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception > during fragment initialization: null > org.apache.drill.exec.work.foreman.Foreman.run():281 > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > java.lang.Thread.run():745 > Caused By (java.lang.NullPointerException) null > > org.apache.drill.exec.planner.sql.handlers.FindHardDistributionScans.visit():55 > org.apache.calcite.rel.core.TableScan.accept():166 > org.apache.calcite.rel.RelShuttleImpl.visitChild():53 > org.apache.calcite.rel.RelShuttleImpl.visitChildren():68 > org.apache.calcite.rel.RelShuttleImpl.visit():126 > org.apache.calcite.rel.AbstractRelNode.accept():256 > org.apache.calcite.rel.RelShuttleImpl.visitChild():53 > org.apache.calcite.rel.RelShuttleImpl.visitChildren():68 > org.apache.calcite.rel.RelShuttleImpl.visit():126 > org.apache.calcite.rel.AbstractRelNode.accept():256 > > org.apache.drill.exec.planner.sql.handlers.FindHardDistributionScans.canForceSingleMode():45 > > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel():262 > > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel():290 > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():168 > org.apache.drill.exec.planner.sql.DrillSqlWorker.getPhysicalPlan():123 > org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():97 > org.apache.drill.exec.work.foreman.Foreman.runSQL():1008 > org.apache.drill.exec.work.foreman.Foreman.run():264 > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > java.lang.Thread.run():745 (state=,code=0) > 0: jdbc:drill:zk=local> select * from mysql.sugarcrm.sales_person limit 1; > +-+-+++-+ > | id | first_name | last_name| full_name | manager_id | > +-+-+++-+ > | 1 | null| Administrator | admin | 0 | > +-+-+++-+ > 1 row selected (0,235 seconds) > {code} > Other datasources are okay: > {code} > 0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 0; > +--+---+---+-+--++-++--+-+---++-++-++--+-+-+--+ > | fqn | filename | filepath | suffix | employee_id | full_name | > first_name | last_name | position_id | position_title | store_id | > department_id | birth_date | hire_date | salary | supervisor_id | > education_level | marital_status | gender | management_role | > +--+---+---+-+--++-++--+-+---++-++-++--+-+-+--+ > +--+---+---+-+--++-++--+-+---++-++-++--+-+-+--+ > No rows selected (0,309 seconds) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5457) Support Spill to Disk for the Hash Aggregate Operator
[ https://issues.apache.org/jira/browse/DRILL-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025124#comment-16025124 ] ASF GitHub Bot commented on DRILL-5457: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/822#discussion_r118545693 --- Diff: exec/java-exec/src/main/resources/drill-module.conf --- @@ -179,6 +179,26 @@ drill.exec: { // Use plain Java compilation where available prefer_plain_java: false }, + spill: { --- End diff -- Added "spill" and "hashagg" sections in the override example file, with some comments: spill: { # These options are common to all spilling operators. # They can be overriden, per operator (but this is just for # backward compatibility, and may be deprecated in the future) directories : [ "/tmp/drill/spill" ], fs : "file:///" } hashagg: { # The partitions divide the work inside the hashagg, to ease # handling spilling. This initial figure is tuned down when # memory is limited. # Setting this option to 1 disables spilling ! num_partitions: 32, spill: { # The 2 options below override the common ones # they should be deprecated in the future directories : [ "/tmp/drill/spill" ], fs : "file:///" } }, > Support Spill to Disk for the Hash Aggregate Operator > - > > Key: DRILL-5457 > URL: https://issues.apache.org/jira/browse/DRILL-5457 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.10.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi > Fix For: 1.11.0 > > > Support gradual spilling memory to disk as the available memory gets too > small to allow in memory work for the Hash Aggregate Operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5533) Fix flag assignment in FunctionInitializer.checkInit() method
[ https://issues.apache.org/jira/browse/DRILL-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-5533: --- Labels: ready-to-commit (was: ) > Fix flag assignment in FunctionInitializer.checkInit() method > - > > Key: DRILL-5533 > URL: https://issues.apache.org/jira/browse/DRILL-5533 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Minor > Labels: ready-to-commit > > FunctionInitializer.checkInit() method uses DCL to ensure that function body > is loaded only once. But flag parameter is never updated and all threads are > entering synchronized block. > Also FunctionInitializer.getImports() always returns empty list. > https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionInitializer.java > Changes: > 1. Fix DCL in FunctionInitializer.checkInit() method (update flag parameter > when function body is loaded). > 2. Fix ImportGrabber.getImports() method to return list with imports. > 3. Add unit tests for FunctionInitializer. > 4. Minor refactoring (rename methods, add javadoc). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5533) Fix flag assignment in FunctionInitializer.checkInit() method
[ https://issues.apache.org/jira/browse/DRILL-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025064#comment-16025064 ] ASF GitHub Bot commented on DRILL-5533: --- Github user sudheeshkatkam commented on the issue: https://github.com/apache/drill/pull/843 +1 > Fix flag assignment in FunctionInitializer.checkInit() method > - > > Key: DRILL-5533 > URL: https://issues.apache.org/jira/browse/DRILL-5533 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Minor > Labels: ready-to-commit > > FunctionInitializer.checkInit() method uses DCL to ensure that function body > is loaded only once. But flag parameter is never updated and all threads are > entering synchronized block. > Also FunctionInitializer.getImports() always returns empty list. > https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionInitializer.java > Changes: > 1. Fix DCL in FunctionInitializer.checkInit() method (update flag parameter > when function body is loaded). > 2. Fix ImportGrabber.getImports() method to return list with imports. > 3. Add unit tests for FunctionInitializer. > 4. Minor refactoring (rename methods, add javadoc). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025045#comment-16025045 ] ASF GitHub Bot commented on DRILL-5356: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/789 Thanks. I'll clean up the messy commits today. Not sure how it picked up the other six commits... > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4824) Add not-provided and null states for map and list fields in JSON
[ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025040#comment-16025040 ] Paul Rogers commented on DRILL-4824: The trick, of course, to adding the new null states is that the existing "bit" vector is used by all operators in code generation, and by Drill clients such as ODBC and JDBC drivers. Further, Apache Arrow is a fork of Drill, so improving our null support will drive the two projects further apart. Planning for all this stuff is required before we start writing code. For example, if we know that a client is a version before this fix, we can translate the new null vector into the "legacy" bit vector. But, Drill does not have a versioned client API, so we have no way to know the version of the client. So, we have to tackle that problem as well. In short, this is an important, but non-trivial, project. > Add not-provided and null states for map and list fields in JSON > > > Key: DRILL-4824 > URL: https://issues.apache.org/jira/browse/DRILL-4824 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON >Affects Versions: 1.0.0 >Reporter: Roman >Assignee: Volodymyr Vysotskyi > > There is incorrect output in case of JSON file with complex nested data. > _JSON:_ > {code:none|title=example.json|borderStyle=solid} > { > "Field1" : { > } > } > { > "Field1" : { > "InnerField1": {"key1":"value1"}, > "InnerField2": {"key2":"value2"} > } > } > { > "Field1" : { > "InnerField3" : ["value3", "value4"], > "InnerField4" : ["value5", "value6"] > } > } > {code} > _Query:_ > {code:sql} > select Field1 from dfs.`/tmp/example.json` > {code} > _Incorrect result:_ > {code:none} > +---+ > | Field1 | > +---+ > {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{"key1":"value1"},"InnerField2" > {"key2":"value2"},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} > Theres is no need to output missing fields. In case of deeply nested > structure we will get unreadable result for user. > _Correct result:_ > {code:none} > +--+ > | Field1 | > +--+ > |{} > {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}} > {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024996#comment-16024996 ] ASF GitHub Bot commented on DRILL-5356: --- Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/789 I took the entire patch and applied it to master (use git am -3). Git manages to figure out that the commits are already applied. One commit caused a merge conflict and I skipped it. In the end it left me with only the one commit. > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4824) Add not-provided and null states for map and list fields in JSON
[ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024994#comment-16024994 ] Paul Rogers commented on DRILL-4824: Turns there is a flaw in the value vector code that does not "back fill" missing offset vector values for repeated types. The logic works fine for Varchar columns, but not repeated columns. The repeated type problem will be fixed as part of the memory fragmentation work in which we are creating a new version of the "writers" used to move data into value vectors. Please don't spend time fixing this part of the current code as that existing code will be retired. > Add not-provided and null states for map and list fields in JSON > > > Key: DRILL-4824 > URL: https://issues.apache.org/jira/browse/DRILL-4824 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON >Affects Versions: 1.0.0 >Reporter: Roman >Assignee: Volodymyr Vysotskyi > > There is incorrect output in case of JSON file with complex nested data. > _JSON:_ > {code:none|title=example.json|borderStyle=solid} > { > "Field1" : { > } > } > { > "Field1" : { > "InnerField1": {"key1":"value1"}, > "InnerField2": {"key2":"value2"} > } > } > { > "Field1" : { > "InnerField3" : ["value3", "value4"], > "InnerField4" : ["value5", "value6"] > } > } > {code} > _Query:_ > {code:sql} > select Field1 from dfs.`/tmp/example.json` > {code} > _Incorrect result:_ > {code:none} > +---+ > | Field1 | > +---+ > {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{"key1":"value1"},"InnerField2" > {"key2":"value2"},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} > Theres is no need to output missing fields. In case of deeply nested > structure we will get unreadable result for user. > _Correct result:_ > {code:none} > +--+ > | Field1 | > +--+ > |{} > {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}} > {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size
[ https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam reassigned DRILL-5379: -- Assignee: Sudheesh Katkam > Set Hdfs Block Size based on Parquet Block Size > --- > > Key: DRILL-5379 > URL: https://issues.apache.org/jira/browse/DRILL-5379 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: F Méthot >Assignee: Sudheesh Katkam > Labels: ready-to-commit > Fix For: Future > > > It seems there a way to force Drill to store CTAS generated parquet file as a > single block when using HDFS. Java HDFS API allows to do that, files could be > created with the Parquet block-size set in a session or system config. > Since it is ideal to have single parquet file per hdfs block. > Here is the HDFS API that allow to do that: > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > Drill uses the hadoop ParquetFileWriter > (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java). > This is where the file creation occurs so it might be tricky. > However, ParquetRecordWriter.java > (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) > in Drill creates the ParquetFileWriter with an hadoop configuration object. > something to explore: Could the block size be set as a property within the > Configuration object before passing it to ParquetFileWriter constructor? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size
[ https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam reassigned DRILL-5379: -- Assignee: Padma Penumarthy (was: Sudheesh Katkam) > Set Hdfs Block Size based on Parquet Block Size > --- > > Key: DRILL-5379 > URL: https://issues.apache.org/jira/browse/DRILL-5379 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: F Méthot >Assignee: Padma Penumarthy > Labels: ready-to-commit > Fix For: Future > > > It seems there a way to force Drill to store CTAS generated parquet file as a > single block when using HDFS. Java HDFS API allows to do that, files could be > created with the Parquet block-size set in a session or system config. > Since it is ideal to have single parquet file per hdfs block. > Here is the HDFS API that allow to do that: > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > Drill uses the hadoop ParquetFileWriter > (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java). > This is where the file creation occurs so it might be tricky. > However, ParquetRecordWriter.java > (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) > in Drill creates the ParquetFileWriter with an hadoop configuration object. > something to explore: Could the block size be set as a property within the > Configuration object before passing it to ParquetFileWriter constructor? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024983#comment-16024983 ] ASF GitHub Bot commented on DRILL-5356: --- Github user sudheeshkatkam commented on the issue: https://github.com/apache/drill/pull/789 Are the changes only in 1494915dbef5dbd5996c19d0a2e89ca450a8ae3a (to cherry pick)? > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5356) Refactor Parquet Record Reader
[ https://issues.apache.org/jira/browse/DRILL-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-5356: --- Labels: ready-to-commit (was: ) > Refactor Parquet Record Reader > -- > > Key: DRILL-5356 > URL: https://issues.apache.org/jira/browse/DRILL-5356 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.11.0 > > > The Parquet record reader class is a key part of Drill that has evolved over > time to become somewhat hard to follow. > A number of us are working on Parquet-related tasks and find we have to spend > an uncomfortable amount of time trying to understand the code. In particular, > this writer needs to figure out how to convince the reader to provide > higher-density record batches. > Rather than continue to decypher the complex code multiple times, this ticket > requests to refactor the code to make it functionally identical, but > structurally cleaner. The result will be faster time to value when working > with this code. > This is a lower-priority change and will be coordinated with others working > on this code base. This ticket is only for the record reader class itself; it > does not include the various readers and writers that Parquet uses since > another project is actively modifying those classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-4824) Add not-provided and null states for map and list fields in JSON
[ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Vysotskyi updated DRILL-4824: --- Summary: Add not-provided and null states for map and list fields in JSON (was: JSON with complex nested data produces incorrect output with missing fields) > Add not-provided and null states for map and list fields in JSON > > > Key: DRILL-4824 > URL: https://issues.apache.org/jira/browse/DRILL-4824 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON >Affects Versions: 1.0.0 >Reporter: Roman >Assignee: Volodymyr Vysotskyi > > There is incorrect output in case of JSON file with complex nested data. > _JSON:_ > {code:none|title=example.json|borderStyle=solid} > { > "Field1" : { > } > } > { > "Field1" : { > "InnerField1": {"key1":"value1"}, > "InnerField2": {"key2":"value2"} > } > } > { > "Field1" : { > "InnerField3" : ["value3", "value4"], > "InnerField4" : ["value5", "value6"] > } > } > {code} > _Query:_ > {code:sql} > select Field1 from dfs.`/tmp/example.json` > {code} > _Incorrect result:_ > {code:none} > +---+ > | Field1 | > +---+ > {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{"key1":"value1"},"InnerField2" > {"key2":"value2"},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} > Theres is no need to output missing fields. In case of deeply nested > structure we will get unreadable result for user. > _Correct result:_ > {code:none} > +--+ > | Field1 | > +--+ > |{} > {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}} > {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5539) drillbit.sh script breaks if the working directory contains spaces
[ https://issues.apache.org/jira/browse/DRILL-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024975#comment-16024975 ] Paul Rogers commented on DRILL-5539: On the surface, this looks pretty easy: just put quotes where needed. As it turns out, {{drillbit.sh}} calls {{drill-config.sh}} to do all the heavy lifting, and that drill-config.sh is called by many of our scripts. And, it does lots of path work to find the config files, find directories, find Java and so on. The scripts do presently assume no spaces in directory names. Spaceless names is the general rule on Linux, but obviously Windows often uses spaces, most notably in the {{C:\Program Files}} directory. Further, we have a unit test (not yet checked in) for the scripts that should be modified to test for the case you found. See DRILL-5540 for a request to check the shell script unit tests into Apache Drill master. > drillbit.sh script breaks if the working directory contains spaces > -- > > Key: DRILL-5539 > URL: https://issues.apache.org/jira/browse/DRILL-5539 > Project: Apache Drill > Issue Type: Bug > Environment: Linux >Reporter: Muhammad Gelbana > > The following output occurred when we tried running the drillbit.sh script in > a path that contains spaces: */home/folder1/Folder Name/drill/bin* > {noformat} > [mgelbana@regression-sysops bin]$ ./drillbit.sh start > ./drillbit.sh: line 114: [: /home/folder1/Folder: binary operator expected > Starting drillbit, logging to /home/folder1/Folder Name/drill/log/drillbit.out > ./drillbit.sh: line 147: $pid: ambiguous redirect > [mgelbana@regression-sysops bin]$ pwd > /home/folder1/Folder Name/drill/bin > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5540) Provide unit tests for the Drill shell scripts
Paul Rogers created DRILL-5540: -- Summary: Provide unit tests for the Drill shell scripts Key: DRILL-5540 URL: https://issues.apache.org/jira/browse/DRILL-5540 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Paul Rogers Assignee: Paul Rogers Priority: Minor Fix For: 1.11.0 The Drill-on-YARN project created a unit test that exercises the Drill shell scripts to ensure that they work as expected. (It is very hard to debug the scripts when launched under YARN, so we had to fully test them stand-alone to ensure that they work properly under YARN.) This ticket asks to commit those scripts to Drill separate from the large DoY commit as the YARN dependencies can be easily removed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (DRILL-5467) Issue with column alias for nested table calculated columns
[ https://issues.apache.org/jira/browse/DRILL-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva closed DRILL-5467. --- Resolution: Duplicate Duplicates DRILL-5537. > Issue with column alias for nested table calculated columns > --- > > Key: DRILL-5467 > URL: https://issues.apache.org/jira/browse/DRILL-5467 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Rakesh >Assignee: Vitalii Diravka > > The column alias is not always correctly used in output. When columns as > calculated in nested table, the outer most project doesn't show the column > alias correctly: > SELECT `Custom_SQL_Query`.`Bucket` AS `Bucket`, > SUM(`Custom_SQL_Query`.`male`) AS `sum_male` > FROM (SELECT first_name as `Bucket`, salary as `num`, case when gender = 'M' > then 1 else 0 end as male, case when gender = 'F' then 1 else 0 end as female > FROM cp.`employee.json`) `Custom_SQL_Query` > GROUP BY `Custom_SQL_Query`.`Bucket` > Here 'sum_male' appears as $f1 instead -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5537) Display columns alias for queries with sum() when RDBMS storage plugin is enabled
[ https://issues.apache.org/jira/browse/DRILL-5537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024768#comment-16024768 ] ASF GitHub Bot commented on DRILL-5537: --- GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/845 DRILL-5537: Display columns alias for queries with sum() when RDBMS s… …torage plugin is enabled For sum() queries DrillConvertSumToSumZero rule is applied. But during converting to new aggregate call, this call was created with name set to null, therefore column alias was lost when RDBMS storage plugin was enabled. RDBMS storage plugin was adding new rule during PHYSICAL phase - ReduceProjectRule, since project stage was omitted, column alias was lost. With this fix even if project stage is omitted, column alias still will be shown. Changes: 1. Added old call aggregate name during new call aggregate creation in DrillConvertSumToSumZero rule. 2. Replaced deprecated AggregateCall constructor to `AggregateCall.create`. 3. Minor refactoring. You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-5537 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/845.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #845 commit 5e83d6d17232d4ddbff7e11eaadecad9ef992b10 Author: Arina Ielchiieva Date: 2017-05-25T13:23:43Z DRILL-5537: Display columns alias for queries with sum() when RDBMS storage plugin is enabled > Display columns alias for queries with sum() when RDBMS storage plugin is > enabled > - > > Key: DRILL-5537 > URL: https://issues.apache.org/jira/browse/DRILL-5537 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva > > When [RDBMS storage > plugin|https://drill.apache.org/docs/rdbms-storage-plugin/] is enabled, > alias is not displayed for column with sum function: > {noformat} > 0: jdbc:drill:zk=local> select version, sum(1) as s from sys.version group by > version; > +--+--+ > | version | $f1 | > +--+--+ > | 1.11.0-SNAPSHOT | 1| > +--+--+ > 1 row selected (0.444 seconds) > {noformat} > Other functions like avg, count are not affected. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5538) Exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage plugins
[ https://issues.apache.org/jira/browse/DRILL-5538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024727#comment-16024727 ] ASF GitHub Bot commented on DRILL-5538: --- GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/844 DRILL-5538: Exclude ProjectRemoveRule during PHYSICAL phase if it com… …es from storage plugins Details in DRILL-5538 description. You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-5538 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/844.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #844 commit 874edfc86d4f69ecb917bd158b6afc1282ff34e7 Author: Arina Ielchiieva Date: 2017-05-25T11:34:31Z DRILL-5538: Exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage plugins > Exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage > plugins > > > Key: DRILL-5538 > URL: https://issues.apache.org/jira/browse/DRILL-5538 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Affects Versions: 1.10.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva > > When [RDBMS storage > plugin|https://drill.apache.org/docs/rdbms-storage-plugin/] is enabled, > during query execution certain JDBC rules are added. > One of the rules is > [ProjectRemoveRule|https://github.com/apache/drill/blob/master/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L140]. > Drill also uses this rule but during phases when it considers it useful, for > example, during LOGICAL and JOIN_PLANNING. On the contrary, storage plugin > rules are added to any phase of query planning. Thus it results to project > stage to be removed when actually it is needed. > Sometimes when ProjectRemoveRule decides that project is trivial and removes > it, during this stage Drill added column alias or removed implicit columns. > For example, with RDBMS plugin enabled, alias is not displayed for simple > query: > {noformat} > 0: jdbc:drill:zk=local> create temporary table t as select * from sys.version; > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > +---++ > | Fragment | Number of records written | > +---++ > | 0_0 | 1 | > +---++ > 1 row selected (0.623 seconds) > 0: jdbc:drill:zk=local> select version as current_version from t; > +--+ > | version | > +--+ > | 1.11.0-SNAPSHOT | > +--+ > 1 row selected (0.28 seconds) > {noformat} > Proposed fix is to exclude ProjectRemoveRule during PHYSICAL phase if it > comes from storage plugins to prevent Drill losing column alias or displaying > implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5539) drillbit.sh script breaks if the working directory contains spaces
Muhammad Gelbana created DRILL-5539: --- Summary: drillbit.sh script breaks if the working directory contains spaces Key: DRILL-5539 URL: https://issues.apache.org/jira/browse/DRILL-5539 Project: Apache Drill Issue Type: Bug Environment: Linux Reporter: Muhammad Gelbana The following output occurred when we tried running the drillbit.sh script in a path that contains spaces: */home/folder1/Folder Name/drill/bin* {noformat} [mgelbana@regression-sysops bin]$ ./drillbit.sh start ./drillbit.sh: line 114: [: /home/folder1/Folder: binary operator expected Starting drillbit, logging to /home/folder1/Folder Name/drill/log/drillbit.out ./drillbit.sh: line 147: $pid: ambiguous redirect [mgelbana@regression-sysops bin]$ pwd /home/folder1/Folder Name/drill/bin {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5538) Exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage plugins
Arina Ielchiieva created DRILL-5538: --- Summary: Exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage plugins Key: DRILL-5538 URL: https://issues.apache.org/jira/browse/DRILL-5538 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 1.10.0 Reporter: Arina Ielchiieva Assignee: Arina Ielchiieva When [RDBMS storage plugin|https://drill.apache.org/docs/rdbms-storage-plugin/] is enabled, during query execution certain JDBC rules are added. One of the rules is [ProjectRemoveRule|https://github.com/apache/drill/blob/master/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L140]. Drill also uses this rule but during phases when it considers it useful, for example, during LOGICAL and JOIN_PLANNING. On the contrary, storage plugin rules are added to any phase of query planning. Thus it results to project stage to be removed when actually it is needed. Sometimes when ProjectRemoveRule decides that project is trivial and removes it, during this stage Drill added column alias or removed implicit columns. For example, with RDBMS plugin enabled, alias is not displayed for simple query: {noformat} 0: jdbc:drill:zk=local> create temporary table t as select * from sys.version; SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. +---++ | Fragment | Number of records written | +---++ | 0_0 | 1 | +---++ 1 row selected (0.623 seconds) 0: jdbc:drill:zk=local> select version as current_version from t; +--+ | version | +--+ | 1.11.0-SNAPSHOT | +--+ 1 row selected (0.28 seconds) {noformat} Proposed fix is to exclude ProjectRemoveRule during PHYSICAL phase if it comes from storage plugins to prevent Drill losing column alias or displaying implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5537) Display columns alias for queries with sum() when RDBMS storage plugin is enabled
Arina Ielchiieva created DRILL-5537: --- Summary: Display columns alias for queries with sum() when RDBMS storage plugin is enabled Key: DRILL-5537 URL: https://issues.apache.org/jira/browse/DRILL-5537 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Arina Ielchiieva Assignee: Arina Ielchiieva When [RDBMS storage plugin|https://drill.apache.org/docs/rdbms-storage-plugin/] is enabled, alias is not displayed for column with sum function: {noformat} 0: jdbc:drill:zk=local> select version, sum(1) as s from sys.version group by version; +--+--+ | version | $f1 | +--+--+ | 1.11.0-SNAPSHOT | 1| +--+--+ 1 row selected (0.444 seconds) {noformat} Other functions like avg, count are not affected. -- This message was sent by Atlassian JIRA (v6.3.15#6346)