[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-07-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072906#comment-16072906
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
Squashed commits and committed to Apache master. Congratulations!


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-07-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072905#comment-16072905
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/831


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-07-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072788#comment-16072788
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on the issue:

https://github.com/apache/drill/pull/831
  
@paul-rogers , is anything else we can do for this PR ?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070954#comment-16070954
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on the issue:

https://github.com/apache/drill/pull/831
  
On Fri, Jun 30, 2017 at 5:34 PM, Paul Rogers 
wrote:

> That only works if Drill has an autoloading capability that allows storage
> formats to be loaded and authenticated easily.
>
> Why? To get pcap in 1.11, one must install a new Drill version, which
> requires a restart. Assuming that pcap were a separate project, you'd
> install a new jar, and restart. Neither provides auto loading.
>
No. But if I have 1.11 already and decide that I want pcap (or any similar
data format), it would be nice if I could do the equivalent of pip (for
python) or install.packages("...") (for R) or mvn test (for Java) and get
whatever cool capability I like.

This may be described as "just install a new jar" but the convenience level
is a proven Big Deal (tm). It would even be possible to leverage Maven
central to make it happen, but there needs to be convenience sugar around
the process to make it consumable.

> Not clear on the authentication issue. If a plugin were a separate
> project, wouldn't that project have a way of certifying its jars?
>
Dunno. Not some random project.

If Drill requires it, then yes.  And I am saying that Drill should require
resolution back to some level of trust.



> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070901#comment-16070901
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
> That only works if Drill has an autoloading capability that allows storage
> formats to be loaded and authenticated easily.

Why? To get pcap in 1.11, one must install a new Drill version, which 
requires a restart. Assuming that pcap were a separate project, you'd install a 
new jar, and restart. Neither provides auto loading.

Not clear on the authentication issue. If a plugin were a separate project, 
wouldn't that project have a way of certifying its jars?

Understanding this will help us figure out how to handle the growing set of 
specialized storage plugins...


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070590#comment-16070590
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on the issue:

https://github.com/apache/drill/pull/831
  
On Fri, Jun 30, 2017 at 9:55 AM, Paul Rogers 
wrote:

> Later, once Drill provides the correct framework, I'd suggest that this
> code move into a separate Github repo to be maintained by experts in pcap.
> Frankly, most Drill developers are familiar with query engines, not pcap
> (or other specialized formats.)
>
>
That only works if Drill has an autoloading capability that allows storage
formats to be loaded and authenticated easily.



> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070503#comment-16070503
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on the issue:

https://github.com/apache/drill/pull/831
  
This is really good. Will do this immediately.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070396#comment-16070396
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
Looking at the big picture, Drill should allow specialized plugins such as 
this one to exist as independent projects. Users should be able to download the 
plugin jar, add it to Drill, and go.

As we've discussed, Drill has a bit of work before we get there. We can't 
hold up this work waiting for a better solution.

So, please fix the two minor issues you identified. The code will then be 
ready for a final quick review and approval.

Later, once Drill provides the correct framework, I'd suggest that this 
code move into a separate Github repo to be maintained by experts in pcap. 
Frankly, most Drill developers are familiar with query engines, not pcap (or 
other specialized formats.)

The same is true, for example, of the "indexr" and TSDB plugins which are 
(slowly) working their way through the review process.

Summary: please add the package-info file and the comments in utils. We can 
then give approval.

Can we do this by, say, July 10? If so, we can likely get this PR into 
1.11, if the Release Manager agrees.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068537#comment-16068537
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on the issue:

https://github.com/apache/drill/pull/831
  
We will not move this to contrib because of issue. Request for 
package-info.java will be satisfied. Also comment in utils will be added. If I 
did not miss something, that is all ?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067459#comment-16067459
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
Filed DRILL-5618 to describe the `bootstrap-storage-plugin.json` issue.

There are one or two remaining open comments. Will those be addressed, or 
should we do a final review based on the code as it stands now?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065708#comment-16065708
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124427466
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065707#comment-16065707
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124427218
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapDrillTable.java
 ---
@@ -0,0 +1,73 @@
+/*
--- End diff --

The JIRA link is also good. Once the code is into Drill, it will be hard 
for future users to locate the corresponding JIRA. But, if you place a comment 
into a file somewhere, then future developers can find the JIRA.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065697#comment-16065697
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
Regarding bootstrap-storage-plugin.json, this is a generic issue. Drill 
must provide a solution that allows each module to provide its own file, as is 
done for drill-module.conf. I did not file a JIRA for this issue, feel free to 
file one.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064601#comment-16064601
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user Vlad-Storona commented on the issue:

https://github.com/apache/drill/pull/831
  
As a result of project transferring problems were found out. In java-exec 
package exists file bootstrap-storage-plugin.json from which drill takes 
information about supported files formats. But in contrib package, there is no 
such file. If to transfer pcap-reader to contrib package and not to remove 
information from bootstrap-storage-plugin.json about pcap format, then there 
will be JsonMappingException. And if you remove this information from the 
config file, then drill will can`t find pcap files. Maybe I have not enough 
info/experience about drill. Maybe you can provide any solution how to handle 
this ?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064515#comment-16064515
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124220475
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapDrillTable.java
 ---
@@ -0,0 +1,73 @@
+/*
--- End diff --

I think a link to a JIRA can be added. All needed information about already 
listed there.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064512#comment-16064512
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124220049
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064510#comment-16064510
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124219513
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/Utils.java ---
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.primitives.Ints;
+import com.google.common.primitives.Shorts;
+
+public class Utils {
+
+  public static int getIntFileOrder(boolean byteOrder, final byte[] buf, 
final int offset) {
+if (byteOrder) {
--- End diff --

Comment will be added here, explaining this case


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064505#comment-16064505
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user dmitriyHavrilovich commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r124218971
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
--- End diff --

This java libraries uses JNI to work with existing libpicap c library. So 
they are just wrappers above. This plugin uses only java code to process pcap.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061367#comment-16061367
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/831
  
Can we move this to contrib? I think we should get this one into the next 
release even as the rough edges (if any) are being smoothed out.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060109#comment-16060109
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/831
  
Trying to figure out where we are on this PR. Looking at the review 
comments, some were addressed (thanks!), others are open. The question is: what 
is the resolution of the open issues? Anything to fix? Is the comment based on 
a misunderstanding? Is the requested work beyond the scope of this particular 
PR? Is there a disagreement?

Perhaps respond to each open comment saying something like "will fix", 
"won't fix", "beyond scope" or something so we know if we're ready to move 
ahead with this PR, or if we are waiting for a few more improvements.

Thanks!


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-06-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058525#comment-16058525
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r123397439
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+  private static final Logger logger = 
LoggerFactory.getLogger(PcapRecordReader.class);
+
+  private static final int BATCH_SIZE = 40_000;
+
+  private OutputMutator output;
+
+  private PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer;
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private String inputPath;
+  private List projectedColumns;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+this.inputPath = inputPath;
+this.projectedColumns = projectedColumns;
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+try {
+
+  this.output = output;
+  this.buffer = new byte[10];
+  this.in = new FileInputStream(inputPath);
+  this.decoder = new 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031564#comment-16031564
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/831
  
The contrib directory is where we have, in the past, added new storage and 
format plugins that are new and may not have been sufficiently tested.
For this plugin, I think testing with pcap files from different sources 
would be useful. [1,2] are useful sources for data that will test boundary 
conditions. I tried on a file from [2] and got an NPE (didn't investigate the 
cause). A random sample of files from [1] worked very nicely indeed, though I 
didn't validate the output.
You might have already done this level of testing; if so, I will withdraw 
the suggestion.

[1] 
https://wiki.wireshark.org/SampleCaptures#Captures_used_in_Wireshark_testing
[2] http://www.netresec.com/?page=PcapFiles


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030473#comment-16030473
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on the issue:

https://github.com/apache/drill/pull/831
  
What is the effective difference between the contrib director and where the
plugin already is?

What sort of testing do you think is necessary?



On Tue, May 30, 2017 at 7:08 PM, Parth Chandra 
wrote:

> I would recommend moving this format plugin to the contrib directory,
> since this is a new and untested implementation.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 

> .
>



> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029750#comment-16029750
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/831
  
I would recommend moving this format plugin to the contrib directory, since 
this is a new and untested implementation.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025664#comment-16025664
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118614384
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapDrillTable.java
 ---
@@ -0,0 +1,73 @@
+/*
--- End diff --

Would be very helpful if this PR can include a package-info.java file to 
describe this work. For example, what is pcap? Links to good sources? What 
features of Drill does it use (push-downs)? Etc.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025656#comment-16025656
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118616554
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025663#comment-16025663
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118619851
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
--- End diff --

Would it have been possible to use one of the existing pcap Java libraries 
here? Four are listed [here](https://en.wikipedia.org/wiki/Pcap).


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025667#comment-16025667
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118620276
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025661#comment-16025661
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118616240
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
--- End diff --

Do you want to do this at construct time? If you scan 1000 pcap files in a 
single fragment, Drill will create 1000 record readers at the start of 
execution. Each will allocate a 100K buffer. You'll have 100MB of heap in 
buffers, of which only one will ever be used.

Suggestion: allocate the buffer in setup, clear it in close, so that only 
one buffer is used per fragment.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025665#comment-16025665
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118619502
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025658#comment-16025658
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118616406
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
--- End diff --

As noted above, by opening the file here, if you are scanning 1000 files, 
you'll have 1000 open file handles at the start of the fragment. Better to 
postpone opening files until setup.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025662#comment-16025662
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118617482
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
--- End diff --

Drill has certain protocols that are not entirely obvious, but 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025655#comment-16025655
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118615907
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapFormatPlugin.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.BasicFormatMatcher;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSelection;
+import org.apache.drill.exec.store.dfs.FileSystemPlugin;
+import org.apache.drill.exec.store.dfs.FormatMatcher;
+import org.apache.drill.exec.store.dfs.FormatSelection;
+import org.apache.drill.exec.store.dfs.MagicString;
+import org.apache.drill.exec.store.dfs.NamedFormatPluginConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.regex.Pattern;
+
+public class PcapFormatPlugin extends EasyFormatPlugin {
+
+  private final PcapFormatMatcher matcher;
+
+  public PcapFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf,
+  StoragePluginConfig storagePluginConfig) {
+this(name, context, fsConf, storagePluginConfig, new 
PcapFormatConfig());
+  }
+
+  public PcapFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, PcapFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, false, 
true, false, Lists.newArrayList("pcap"), "pcap");
+this.matcher = new PcapFormatMatcher(this);
+  }
+
+  @Override
+  public boolean supportsPushDown() {
+return true;
+  }
+
+  @Override
+  public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork, List columns, String 
userName) throws ExecutionSetupException {
+String path = dfs.makeQualified(new 
Path(fileWork.getPath())).toUri().getPath();
+return new PcapRecordReader(path, columns);
+  }
+
+  @Override
+  public RecordWriter getRecordWriter(FragmentContext context, EasyWriter 
writer) throws IOException {
+return null;
+  }
+
+  @Override
+  public int getReaderOperatorType() {
+return 0;
--- End diff --

Seems akward, but it seems that other format plugins add a type to a 
protobuf, then return that here:

```
return CoreOperatorType.JSON_SUB_SCAN_VALUE;
```

And `UserBitShared.proto`:

```
  JSON_SUB_SCAN = 29;
```

The next available number is 37.

This seems rather brittle. Seems we should have a more general solution. 
But, until we do, I'd guess you'll need to add the enum value.

As an alternative, `SequenceFileForamtPlugin` just makes up a number:

```
  public int getReaderOperatorType() {
return 4001;
  }
```


> Want a memory format 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025657#comment-16025657
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118619596
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/Utils.java ---
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.primitives.Ints;
+import com.google.common.primitives.Shorts;
+
+public class Utils {
+
+  public static int getIntFileOrder(boolean byteOrder, final byte[] buf, 
final int offset) {
+if (byteOrder) {
--- End diff --

Maybe an explanation of mapping byte order to booleans? true/false = 
which/which endian?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025668#comment-16025668
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118616911
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025659#comment-16025659
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118615528
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapFormatPlugin.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.BasicFormatMatcher;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSelection;
+import org.apache.drill.exec.store.dfs.FileSystemPlugin;
+import org.apache.drill.exec.store.dfs.FormatMatcher;
+import org.apache.drill.exec.store.dfs.FormatSelection;
+import org.apache.drill.exec.store.dfs.MagicString;
+import org.apache.drill.exec.store.dfs.NamedFormatPluginConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.regex.Pattern;
+
+public class PcapFormatPlugin extends EasyFormatPlugin {
+
+  private final PcapFormatMatcher matcher;
+
+  public PcapFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf,
+  StoragePluginConfig storagePluginConfig) {
+this(name, context, fsConf, storagePluginConfig, new 
PcapFormatConfig());
+  }
+
+  public PcapFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, PcapFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, false, 
true, false, Lists.newArrayList("pcap"), "pcap");
+this.matcher = new PcapFormatMatcher(this);
+  }
+
+  @Override
+  public boolean supportsPushDown() {
+return true;
+  }
+
+  @Override
+  public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork, List columns, String 
userName) throws ExecutionSetupException {
+String path = dfs.makeQualified(new 
Path(fileWork.getPath())).toUri().getPath();
+return new PcapRecordReader(path, columns);
+  }
+
+  @Override
+  public RecordWriter getRecordWriter(FragmentContext context, EasyWriter 
writer) throws IOException {
+return null;
+  }
+
+  @Override
+  public int getReaderOperatorType() {
+return 0;
+  }
+
+  @Override
+  public int getWriterOperatorType() {
+return 0;
--- End diff --

Other format plugins do the following when a writer is not supported:

```
throw new UnsupportedOperationException("unimplemented");
```


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025660#comment-16025660
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118617811
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024065#comment-16024065
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118406369
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
+  // pcap header
+  //typedef struct pcaprec_hdr_s {
+  //guint32 ts_sec; // timestamp seconds
+  //guint32 ts_usec;// timestamp microseconds */
+  //guint32 incl_len;   // number of octets of packet 
saved in file */
+  //guint32 orig_len;   // actual length of packet */
+  //} pcaprec_hdr_t;
+  private long timestamp;
+  private int originalLength;
+
+  private byte[] raw;
+
+  private int etherOffset;
+  private int ipOffset;
+
+  private int packetLength;
+  private int etherProtocol;
+  private int protocol;
+
+  private boolean isRoutingV6;
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean readPcap(final InputStream in, final boolean byteOrder, 
final int maxLength) throws IOException {
+byte[] pcapHeader = new byte[PacketConstants.PCAP_HEADER_SIZE];
+int n = in.read(pcapHeader);
+if (n < pcapHeader.length) {
+  return false;
+}
+decodePcapHeader(pcapHeader, byteOrder, maxLength, 0);
+
+raw = new byte[originalLength];
+n = in.read(raw);
+if (n < 0) {
+  return false;
+}
+etherOffset = 0;
+
+decodeEtherPacket();
+return true;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public int decodePcap(final byte[] buffer, final int offset, final 
boolean byteOrder, final int maxLength) {
+raw = buffer;
+etherOffset = offset + PacketConstants.PCAP_HEADER_SIZE;
+decodePcapHeader(raw, byteOrder, maxLength, offset);
+decodeEtherPacket();
+return offset + PacketConstants.PCAP_HEADER_SIZE + originalLength;
+  }
+
+  public String getPacketType() {
+if (isTcpPacket()) {
+  return "TCP";
+} else if (isUdpPacket()) {
+  return "UDP";
+} else if (isArpPacket()) {
+  return "ARP";
+} else if (isIcmpPacket()) {
+  return "ICMP";
+} else {
+  return "unknown";
+}
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV4Packet() {
+return etherProtocol == PacketConstants.IPv4_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV6Packet() {
+return etherProtocol == PacketConstants.IPv6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isPPPoV6Packet() {
+return etherProtocol == PacketConstants.PPPoV6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isTcpPacket() {
+return protocol == PacketConstants.TCP_PROTOCOL;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isUdpPacket() {
+return protocol == PacketConstants.UDP_PROTOCOL;
+  }
+
+  

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024075#comment-16024075
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118406754
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Murmur128.java
 ---
@@ -0,0 +1,161 @@
+/*
--- End diff --

I was thinking of simply moving the code to the existing implementation. 


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024073#comment-16024073
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118406626
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/dto/ColumnDto.java
 ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.dto;
+
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+
+import java.util.Objects;
+
+public class ColumnDto {
+
+  private final String columnName;
+  private final PcapTypes columnType;
+
+  public ColumnDto(String columnName, PcapTypes columnType) {
+this.columnName = columnName;
+this.columnType = columnType;
+  }
+
+  public String getColumnName() {
+return columnName;
+  }
+
+  public PcapTypes getColumnType() {
+return columnType;
+  }
+
+  public boolean isNullable() {
+return true;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (o == null || getClass() != o.getClass()) {
+  return false;
+}
+ColumnDto columnDto = (ColumnDto) o;
+return Objects.equals(columnName, columnDto.columnName) &&
--- End diff --

OK, my mistake. The code is `Objects.equals` and not `Object.equals` so is 
in fact correct.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023992#comment-16023992
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118399899
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with reading 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023991#comment-16023991
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118399867
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
+  // pcap header
+  //typedef struct pcaprec_hdr_s {
+  //guint32 ts_sec; // timestamp seconds
+  //guint32 ts_usec;// timestamp microseconds */
+  //guint32 incl_len;   // number of octets of packet 
saved in file */
+  //guint32 orig_len;   // actual length of packet */
+  //} pcaprec_hdr_t;
+  private long timestamp;
+  private int originalLength;
+
+  private byte[] raw;
+
+  private int etherOffset;
+  private int ipOffset;
+
+  private int packetLength;
+  private int etherProtocol;
+  private int protocol;
+
+  private boolean isRoutingV6;
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean readPcap(final InputStream in, final boolean byteOrder, 
final int maxLength) throws IOException {
+byte[] pcapHeader = new byte[PacketConstants.PCAP_HEADER_SIZE];
+int n = in.read(pcapHeader);
+if (n < pcapHeader.length) {
+  return false;
+}
+decodePcapHeader(pcapHeader, byteOrder, maxLength, 0);
+
+raw = new byte[originalLength];
+n = in.read(raw);
+if (n < 0) {
+  return false;
+}
+etherOffset = 0;
+
+decodeEtherPacket();
+return true;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public int decodePcap(final byte[] buffer, final int offset, final 
boolean byteOrder, final int maxLength) {
+raw = buffer;
+etherOffset = offset + PacketConstants.PCAP_HEADER_SIZE;
+decodePcapHeader(raw, byteOrder, maxLength, offset);
+decodeEtherPacket();
+return offset + PacketConstants.PCAP_HEADER_SIZE + originalLength;
+  }
+
+  public String getPacketType() {
+if (isTcpPacket()) {
+  return "TCP";
+} else if (isUdpPacket()) {
+  return "UDP";
+} else if (isArpPacket()) {
+  return "ARP";
+} else if (isIcmpPacket()) {
+  return "ICMP";
+} else {
+  return "unknown";
+}
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV4Packet() {
+return etherProtocol == PacketConstants.IPv4_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV6Packet() {
+return etherProtocol == PacketConstants.IPv6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isPPPoV6Packet() {
+return etherProtocol == PacketConstants.PPPoV6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isTcpPacket() {
+return protocol == PacketConstants.TCP_PROTOCOL;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isUdpPacket() {
+return protocol == PacketConstants.UDP_PROTOCOL;
+  }
+
+  

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023988#comment-16023988
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118399621
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
+  // pcap header
+  //typedef struct pcaprec_hdr_s {
+  //guint32 ts_sec; // timestamp seconds
+  //guint32 ts_usec;// timestamp microseconds */
+  //guint32 incl_len;   // number of octets of packet 
saved in file */
+  //guint32 orig_len;   // actual length of packet */
+  //} pcaprec_hdr_t;
+  private long timestamp;
+  private int originalLength;
+
+  private byte[] raw;
+
+  private int etherOffset;
+  private int ipOffset;
+
+  private int packetLength;
+  private int etherProtocol;
+  private int protocol;
+
+  private boolean isRoutingV6;
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean readPcap(final InputStream in, final boolean byteOrder, 
final int maxLength) throws IOException {
+byte[] pcapHeader = new byte[PacketConstants.PCAP_HEADER_SIZE];
+int n = in.read(pcapHeader);
+if (n < pcapHeader.length) {
+  return false;
+}
+decodePcapHeader(pcapHeader, byteOrder, maxLength, 0);
+
+raw = new byte[originalLength];
+n = in.read(raw);
+if (n < 0) {
+  return false;
+}
+etherOffset = 0;
+
+decodeEtherPacket();
+return true;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public int decodePcap(final byte[] buffer, final int offset, final 
boolean byteOrder, final int maxLength) {
+raw = buffer;
+etherOffset = offset + PacketConstants.PCAP_HEADER_SIZE;
+decodePcapHeader(raw, byteOrder, maxLength, offset);
+decodeEtherPacket();
+return offset + PacketConstants.PCAP_HEADER_SIZE + originalLength;
+  }
+
+  public String getPacketType() {
+if (isTcpPacket()) {
+  return "TCP";
+} else if (isUdpPacket()) {
+  return "UDP";
+} else if (isArpPacket()) {
+  return "ARP";
+} else if (isIcmpPacket()) {
+  return "ICMP";
+} else {
+  return "unknown";
+}
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV4Packet() {
+return etherProtocol == PacketConstants.IPv4_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV6Packet() {
+return etherProtocol == PacketConstants.IPv6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isPPPoV6Packet() {
+return etherProtocol == PacketConstants.PPPoV6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isTcpPacket() {
+return protocol == PacketConstants.TCP_PROTOCOL;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isUdpPacket() {
+return protocol == PacketConstants.UDP_PROTOCOL;
+  }
+
+  

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023986#comment-16023986
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118399563
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/pcap/TestPcapDecoder.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.io.Resources;
+import org.apache.drill.BaseTestQuery;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.BufferedInputStream;
+import java.io.DataOutputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class TestPcapDecoder extends BaseTestQuery {
+  private static File bigFile;
+
+  /**
+   * Creates an ephemeral file of about a GB in size
+   *
+   * @throws IOException If input file can't be read or output can't be 
written.
+   */
+  @BeforeClass
+  public static void buildBigTcpFile() throws IOException {
+bigFile = File.createTempFile("tcp", ".pcap");
+bigFile.deleteOnExit();
+boolean first = true;
+System.out.printf("Building large test file\n");
+try (DataOutputStream out = new DataOutputStream(new 
FileOutputStream(bigFile))) {
+  for (int i = 0; i < 1000e6 / (29208 - 24) + 1; i++) {
+// might be faster to keep this open and rewind each time, but
+// that is hard to do with a resource, especially if it comes
+// from the class path instead of files.
+try (InputStream in = 
Resources.getResource("store/pcap/tcp-2.pcap").openStream()) {
+  ConcatPcap.copy(first, in, out);
+}
+first = false;
+  }
+  System.out.printf("Created file is %.1f MB\n", bigFile.length() / 
1e6);
--- End diff --

I changed those methods to be called from a public static void main(). That 
allows them to be used to get information about speeds, but doesn't include 
their output in the test.

I think that addresses this comment.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023971#comment-16023971
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118397733
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/pcap/TestPcapDecoder.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.io.Resources;
+import org.apache.drill.BaseTestQuery;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.BufferedInputStream;
+import java.io.DataOutputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class TestPcapDecoder extends BaseTestQuery {
+  private static File bigFile;
+
+  /**
+   * Creates an ephemeral file of about a GB in size
+   *
+   * @throws IOException If input file can't be read or output can't be 
written.
+   */
+  @BeforeClass
+  public static void buildBigTcpFile() throws IOException {
+bigFile = File.createTempFile("tcp", ".pcap");
+bigFile.deleteOnExit();
+boolean first = true;
+System.out.printf("Building large test file\n");
+try (DataOutputStream out = new DataOutputStream(new 
FileOutputStream(bigFile))) {
+  for (int i = 0; i < 1000e6 / (29208 - 24) + 1; i++) {
+// might be faster to keep this open and rewind each time, but
+// that is hard to do with a resource, especially if it comes
+// from the class path instead of files.
+try (InputStream in = 
Resources.getResource("store/pcap/tcp-2.pcap").openStream()) {
+  ConcatPcap.copy(first, in, out);
+}
+first = false;
+  }
+  System.out.printf("Created file is %.1f MB\n", bigFile.length() / 
1e6);
--- End diff --

Sure.  This good and easy to delete.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023970#comment-16023970
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118397652
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with reading 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023967#comment-16023967
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118397540
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Murmur128.java
 ---
@@ -0,0 +1,161 @@
+/*
--- End diff --

I tried to merge them, but the assumptions of working on unsafe code were 
very difficult to remove. I also had a difficult time figuring out a useful API.

Happy to look at specific suggestions, but the way that the existing 
implementation always iterates through unsafe pointers meant that there is 
likely to be nearly no shared code. Merging by simply moving my methods to the 
existing implementation is certainly doable.



> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022039#comment-16022039
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/831
  
It seems to me that others (Ted, Charles) also contributed to this. Would 
be nice if the PR acknowledged their contribution.
Also, we should document limitations and what is not (yet) supported. 
Perhaps a README?



> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021887#comment-16021887
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118102624
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
+  // pcap header
+  //typedef struct pcaprec_hdr_s {
+  //guint32 ts_sec; // timestamp seconds
+  //guint32 ts_usec;// timestamp microseconds */
+  //guint32 incl_len;   // number of octets of packet 
saved in file */
+  //guint32 orig_len;   // actual length of packet */
+  //} pcaprec_hdr_t;
+  private long timestamp;
+  private int originalLength;
+
+  private byte[] raw;
+
+  private int etherOffset;
+  private int ipOffset;
+
+  private int packetLength;
+  private int etherProtocol;
+  private int protocol;
+
+  private boolean isRoutingV6;
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean readPcap(final InputStream in, final boolean byteOrder, 
final int maxLength) throws IOException {
+byte[] pcapHeader = new byte[PacketConstants.PCAP_HEADER_SIZE];
+int n = in.read(pcapHeader);
+if (n < pcapHeader.length) {
+  return false;
+}
+decodePcapHeader(pcapHeader, byteOrder, maxLength, 0);
+
+raw = new byte[originalLength];
+n = in.read(raw);
+if (n < 0) {
+  return false;
+}
+etherOffset = 0;
+
+decodeEtherPacket();
+return true;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public int decodePcap(final byte[] buffer, final int offset, final 
boolean byteOrder, final int maxLength) {
+raw = buffer;
+etherOffset = offset + PacketConstants.PCAP_HEADER_SIZE;
+decodePcapHeader(raw, byteOrder, maxLength, offset);
+decodeEtherPacket();
+return offset + PacketConstants.PCAP_HEADER_SIZE + originalLength;
+  }
+
+  public String getPacketType() {
+if (isTcpPacket()) {
+  return "TCP";
+} else if (isUdpPacket()) {
+  return "UDP";
+} else if (isArpPacket()) {
+  return "ARP";
+} else if (isIcmpPacket()) {
+  return "ICMP";
+} else {
+  return "unknown";
+}
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV4Packet() {
+return etherProtocol == PacketConstants.IPv4_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV6Packet() {
+return etherProtocol == PacketConstants.IPv6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isPPPoV6Packet() {
+return etherProtocol == PacketConstants.PPPoV6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isTcpPacket() {
+return protocol == PacketConstants.TCP_PROTOCOL;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isUdpPacket() {
+return protocol == PacketConstants.UDP_PROTOCOL;
+  }
+
+  

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021884#comment-16021884
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118109637
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Murmur128.java
 ---
@@ -0,0 +1,161 @@
+/*
--- End diff --

We already have a Murmur Hash implementation for use with Direct buffers. 
(see `org.apache.drill.exec.expr.fn.impl.MurmurHash3` ). Can we merge these two 
and have only one copy?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021885#comment-16021885
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118108814
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapRecordReader.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.apache.drill.exec.store.pcap.dto.ColumnDto;
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+import org.apache.drill.exec.store.pcap.schema.Schema;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableIntVector;
+import org.apache.drill.exec.vector.NullableTimeStampVector;
+import org.apache.drill.exec.vector.NullableVarCharVector;
+import org.apache.drill.exec.vector.ValueVector;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+
+import static java.nio.charset.StandardCharsets.UTF_8;
+import static org.apache.drill.exec.store.pcap.Utils.parseBytesToASCII;
+
+public class PcapRecordReader extends AbstractRecordReader {
+
+  private OutputMutator output;
+
+  private final PacketDecoder decoder;
+  private ImmutableList projectedCols;
+
+  private byte[] buffer = new byte[10];
+  private int offset = 0;
+  private InputStream in;
+  private int validBytes;
+
+  private static final Map TYPES;
+
+  private static class ProjectedColumnInfo {
+ValueVector vv;
+ColumnDto pcapColumn;
+  }
+
+  static {
+TYPES = ImmutableMap.builder()
+.put(PcapTypes.STRING, MinorType.VARCHAR)
+.put(PcapTypes.INTEGER, MinorType.INT)
+.put(PcapTypes.LONG, MinorType.BIGINT)
+.put(PcapTypes.TIMESTAMP, MinorType.TIMESTAMP)
+.build();
+  }
+
+  public PcapRecordReader(final String inputPath,
+  final List projectedColumns) {
+try {
+  this.in = new FileInputStream(inputPath);
+  this.decoder = getPacketDecoder();
+  validBytes = in.read(buffer);
+} catch (IOException e) {
+  throw new RuntimeException("File " + inputPath + " not Found");
+}
+setColumns(projectedColumns);
+  }
+
+  @Override
+  public void setup(final OperatorContext context, final OutputMutator 
output) throws ExecutionSetupException {
+this.output = output;
+  }
+
+  @Override
+  public int next() {
+projectedCols = getProjectedColsIfItNull();
+try {
+  return parsePcapFilesAndPutItToTable();
+} catch (IOException io) {
+  throw new RuntimeException("Trouble with 

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021883#comment-16021883
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118097440
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/dto/ColumnDto.java
 ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.dto;
+
+import org.apache.drill.exec.store.pcap.schema.PcapTypes;
+
+import java.util.Objects;
+
+public class ColumnDto {
+
+  private final String columnName;
+  private final PcapTypes columnType;
+
+  public ColumnDto(String columnName, PcapTypes columnType) {
+this.columnName = columnName;
+this.columnType = columnType;
+  }
+
+  public String getColumnName() {
+return columnName;
+  }
+
+  public PcapTypes getColumnType() {
+return columnType;
+  }
+
+  public boolean isNullable() {
+return true;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (o == null || getClass() != o.getClass()) {
+  return false;
+}
+ColumnDto columnDto = (ColumnDto) o;
+return Objects.equals(columnName, columnDto.columnName) &&
--- End diff --

Are you sure `Object.equals` is what you want here, and not `String.equals` 
?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021886#comment-16021886
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118107316
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/decoder/Packet.java
 ---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap.decoder;
+
+import com.google.common.base.Preconditions;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
+
+import static org.apache.drill.exec.store.pcap.Utils.convertInt;
+import static org.apache.drill.exec.store.pcap.Utils.convertShort;
+import static org.apache.drill.exec.store.pcap.Utils.getByte;
+import static org.apache.drill.exec.store.pcap.Utils.getIntFileOrder;
+import static org.apache.drill.exec.store.pcap.Utils.getShort;
+
+public class Packet {
+  // pcap header
+  //typedef struct pcaprec_hdr_s {
+  //guint32 ts_sec; // timestamp seconds
+  //guint32 ts_usec;// timestamp microseconds */
+  //guint32 incl_len;   // number of octets of packet 
saved in file */
+  //guint32 orig_len;   // actual length of packet */
+  //} pcaprec_hdr_t;
+  private long timestamp;
+  private int originalLength;
+
+  private byte[] raw;
+
+  private int etherOffset;
+  private int ipOffset;
+
+  private int packetLength;
+  private int etherProtocol;
+  private int protocol;
+
+  private boolean isRoutingV6;
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean readPcap(final InputStream in, final boolean byteOrder, 
final int maxLength) throws IOException {
+byte[] pcapHeader = new byte[PacketConstants.PCAP_HEADER_SIZE];
+int n = in.read(pcapHeader);
+if (n < pcapHeader.length) {
+  return false;
+}
+decodePcapHeader(pcapHeader, byteOrder, maxLength, 0);
+
+raw = new byte[originalLength];
+n = in.read(raw);
+if (n < 0) {
+  return false;
+}
+etherOffset = 0;
+
+decodeEtherPacket();
+return true;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public int decodePcap(final byte[] buffer, final int offset, final 
boolean byteOrder, final int maxLength) {
+raw = buffer;
+etherOffset = offset + PacketConstants.PCAP_HEADER_SIZE;
+decodePcapHeader(raw, byteOrder, maxLength, offset);
+decodeEtherPacket();
+return offset + PacketConstants.PCAP_HEADER_SIZE + originalLength;
+  }
+
+  public String getPacketType() {
+if (isTcpPacket()) {
+  return "TCP";
+} else if (isUdpPacket()) {
+  return "UDP";
+} else if (isArpPacket()) {
+  return "ARP";
+} else if (isIcmpPacket()) {
+  return "ICMP";
+} else {
+  return "unknown";
+}
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV4Packet() {
+return etherProtocol == PacketConstants.IPv4_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isIpV6Packet() {
+return etherProtocol == PacketConstants.IPv6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isPPPoV6Packet() {
+return etherProtocol == PacketConstants.PPPoV6_TYPE;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isTcpPacket() {
+return protocol == PacketConstants.TCP_PROTOCOL;
+  }
+
+  @SuppressWarnings("WeakerAccess")
+  public boolean isUdpPacket() {
+return protocol == PacketConstants.UDP_PROTOCOL;
+  }
+
+  

[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021888#comment-16021888
 ] 

ASF GitHub Bot commented on DRILL-5432:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/831#discussion_r118098082
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/pcap/TestPcapDecoder.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pcap;
+
+import com.google.common.io.Resources;
+import org.apache.drill.BaseTestQuery;
+import org.apache.drill.exec.store.pcap.decoder.Packet;
+import org.apache.drill.exec.store.pcap.decoder.PacketDecoder;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.BufferedInputStream;
+import java.io.DataOutputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class TestPcapDecoder extends BaseTestQuery {
+  private static File bigFile;
+
+  /**
+   * Creates an ephemeral file of about a GB in size
+   *
+   * @throws IOException If input file can't be read or output can't be 
written.
+   */
+  @BeforeClass
+  public static void buildBigTcpFile() throws IOException {
+bigFile = File.createTempFile("tcp", ".pcap");
+bigFile.deleteOnExit();
+boolean first = true;
+System.out.printf("Building large test file\n");
+try (DataOutputStream out = new DataOutputStream(new 
FileOutputStream(bigFile))) {
+  for (int i = 0; i < 1000e6 / (29208 - 24) + 1; i++) {
+// might be faster to keep this open and rewind each time, but
+// that is hard to do with a resource, especially if it comes
+// from the class path instead of files.
+try (InputStream in = 
Resources.getResource("store/pcap/tcp-2.pcap").openStream()) {
+  ConcatPcap.copy(first, in, out);
+}
+first = false;
+  }
+  System.out.printf("Created file is %.1f MB\n", bigFile.length() / 
1e6);
--- End diff --

Can we not use System.out in a test? We're trying to get the output from a 
full build to be smaller. 


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-04-24 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981793#comment-15981793
 ] 

Ted Dunning commented on DRILL-5432:



The version in github is now working. Thanks for Charles for the mac address 
code.

{code}
0: jdbc:drill:zk=local> select src_ip, count(1), sum(packet_length) from 
dfs.`/Users/tdunning/Apache/drill-pcap-format/x.pcap`   group by src_ip;
+--+-+-+
|  src_ip  | EXPR$1  | EXPR$2  |
+--+-+-+
| 10.0.1.5 | 24  | 3478|
| 23.72.217.110| 1   | 66  |
| 199.59.150.11| 1   | 66  |
| 35.167.153.146   | 2   | 194 |
| 149.174.66.131   | 1   | 54  |
| 152.163.13.6 | 1   | 54  |
| 35.166.185.92| 2   | 194 |
| 173.194.202.189  | 2   | 145 |
| 23.72.187.41 | 2   | 132 |
| 108.174.10.10| 4   | 561 |
| 12.220.154.66| 1   | 174 |
| 52.20.156.183| 1   | 98  |
| 74.125.28.189| 1   | 73  |
| 192.30.253.124   | 1   | 66  |
+--+-+-+
{code}

This is now up to the basic idea that we would like to have. The only major 
thing missing is the ability to group by TCP stream. You can emulate that by 
grouping by src_ip, dst_ip, src_port, dst_port, but we want something better.

Can somebody take a look at the code?


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/drill-pcap-format
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-04-12 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967049#comment-15967049
 ] 

Ted Dunning commented on DRILL-5432:



Wow.  Missed that.

New URL: https://github.com/mapr-demos/drill-pcap-format

I will update the original comment so as to limit the number of people who are 
confused.


> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/pcap-query
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-04-12 Thread Charles Givre (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967044#comment-15967044
 ] 

Charles Givre commented on DRILL-5432:
--

Never mind.   I misread the JIRA. 

> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/pcap-query
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-04-12 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967041#comment-15967041
 ] 

Ted Dunning commented on DRILL-5432:


Charles,

I don't understand your comment. Tug reported the following output from a 
sample file:
{code}
select *
from dfs.`data`.`airtunes.pcap`
limit 10

+---+--+--+-+-+---+---++---+
| Type  | Network  |Timestamp | dst_ip  | src_ip
  | src_port  | dst_port  | packet_length  | data  |
+---+--+--+-+-+---+---++---+
| TCP   | 1| 2012-03-29 22:05:41.808  | /192.168.3.123  | 
/192.168.3.107  | 51594 | 5000  | 78 | []|
| TCP   | 1| 2012-03-29 22:05:41.808  | /192.168.3.107  | 
/192.168.3.123  | 5000  | 51594 | 78 | []|
| TCP   | 1| 2012-03-29 22:05:41.808  | /192.168.3.123  | 
/192.168.3.107  | 51594 | 5000  | 66 | []|
+---+--+--+-+-+---+---++---+
{code}

What is your change going to do?

> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/pcap-query
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files

2017-04-12 Thread Charles Givre (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967021#comment-15967021
 ] 

Charles Givre commented on DRILL-5432:
--

Hi Ted, 
I started working on getting the fields from your PCAP decoder mapped to Drill. 
 Do you want me to share that in this JIRA?
-- C

> Want a memory format for PCAP files
> ---
>
> Key: DRILL-5432
> URL: https://issues.apache.org/jira/browse/DRILL-5432
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Ted Dunning
>
> PCAP files [1] are the de facto standard for storing network capture data. In 
> security and protocol applications, it is very common to want to extract 
> particular packets from a capture for further analysis.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port or by protocol. Beyond that, however, it would be 
> very useful to be able to group packets by TCP session and eventually to look 
> at packet contents. For now, however, the most critical requirement is that 
> we should be able to scan captures at very high speed.
> I previously wrote a (kind of working) proof of concept for a PCAP decoder 
> that did lazy deserialization and could traverse hundreds of MB of PCAP data 
> per second per core. This compares to roughly 2-3 MB/s for widely available 
> Apache-compatible open source PCAP decoders.
> This JIRA covers the integration and extension of that proof of concept as a 
> Drill file format.
> Initial work is available at https://github.com/mapr-demos/pcap-query
> [1] https://en.wikipedia.org/wiki/Pcap



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)