Actually, some good news here…  
I ran some test queries on the corrupted file and it seemed to work pretty 
well.  I didn’t get any exceptions!

 jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from 
dfs.test.`testv1.pcap`WHERE is_corrupt=1 GROUP BY src_ip ORDER BY packet_count 
DESC
. . . . . . .semicolon> LIMIT 10;
+-----------------------------------------+---------------+
|                 src_ip                  | packet_count  |
+-----------------------------------------+---------------+
| 150.249.255.161                         | 176           |
| 150.249.255.24                          | 28            |
| 131.38.3.15                             | 26            |
| 111.248.196.128                         | 25            |
| 202.13.230.242                          | 20            |
| 163.28.217.199                          | 19            |
| 27.18.36.151                            | 18            |
| 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed  | 17            |
| 203.70.190.81                           | 16            |
| 203.70.182.104                          | 13            |
+-----------------------------------------+---------------+
10 rows selected (0.944 seconds)


select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT 10;
+------------------+------------------+
|      src_ip      |      dst_ip      |
+------------------+------------------+
| 118.233.244.60   | 150.249.255.161  |
| 150.249.255.161  | 165.63.110.188   |
| 150.249.255.161  | 165.63.110.188   |
| 172.40.96.180    | 131.39.133.22    |
| 150.249.255.161  | 165.63.110.188   |
| 150.249.255.161  | 165.63.110.188   |
| 150.249.255.161  | 165.63.110.188   |
| 150.249.255.161  | 165.63.110.188   |
| 150.249.162.60   | 180.32.119.25    |
| 150.249.255.161  | 165.63.110.188   |
+------------------+------------------+
10 rows selected (1.031 seconds)


0: jdbc:drill:zk=local> SELECT  src_port , dst_port , src_mac_address , 
dst_mac_address
. . . . . . .semicolon> FROM dfs.test.`testv1.pcap`
. . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10;
+-----------+-----------+--------------------+--------------------+
| src_port  | dst_port  |  src_mac_address   |  dst_mac_address   |
+-----------+-----------+--------------------+--------------------+
| 57058     | 443       | 00:0C:DB:1F:72:41  | 88:E0:F3:7A:66:F0  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 443       | 55972     | 00:0C:DB:1F:72:41  | CC:4E:24:1F:4E:00  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 4016      | 7699      | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
| 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
+-----------+-----------+--------------------+--------------------+
10 rows selected (0.751 seconds)

SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM 
dfs.test.`testv1.pcap` WHERE is_corrupt=1  GROUP BY getCountryName(src_ip) 
ORDER BY packet_count DESC LIMIT 10;
+----------------+---------------+
|    country     | packet_count  |
+----------------+---------------+
| Japan          | 269           |
| Taiwan         | 124           |
| United States  | 105           |
| Unknown        | 49            |
| China          | 26            |
| South Korea    | 8             |
| Australia      | 4             |
| Germany        | 3             |
| Hong Kong      | 2             |
| Italy          | 1             |
+----------------+---------------+
10 rows selected (1.519 seconds)

SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` GROUP 
BY is_corrupt;
+-------------+---------------+
| is_corrupt  | packet_count  |
+-------------+---------------+
| 0           | 6408          |
| 1           | 592           |
+-------------+---------------+
2 rows selected (0.931 seconds)


This PCAP file worked well with Superset also. 


> On Feb 10, 2019, at 10:59, Charles Givre <cgi...@gmail.com> wrote:
> 
> If I can get some more examples of corrupted files I’ll test more thoroughly. 
>  Also, we’ll need to apply the same methodology to PCAP-NG, so I’ll need some 
> examples there as well.  My strategy is going to be get as much data as 
> possible out of the corrupt packet. 
> — C
> 
> 
> 
>> On Feb 10, 2019, at 10:54, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> 
>> I think that accessing fields in corrupted packets will also cause
>> exceptions. But this is a great start. Conditionalizing field access on
>> !is_corrupt() might be sufficient for the next step.
>> 
>> 
>> 
>> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <cgi...@gmail.com> wrote:
>> 
>>> All,
>>> I posted the following PR for this issue:
>>> https://github.com/apache/drill/pull/1637 <
>>> https://github.com/apache/drill/pull/1637>
>>> 
>>> Basically this PR does two things.
>>> 1.  It creates a boolean column called is_corrupt and
>>> 2.  If the PCAP file has a corrupt row, it marks that row as corrupt by
>>> setting is_corrupt to true and keeps going
>>> 
>>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows
>>> out of 7000 in that PCAP file.  It was late and I don’t know if that was
>>> what ti was supposed to find, but it worked and was able to query that.
>>> If you guys could send a few more examples, I’d like to test this on other
>>> files to make sure it works with them.  We’re also going to have to do the
>>> same thing for the PCAP-NG format I would assume.
>>> 
>>>> On Feb 10, 2019, at 03:07, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>>> 
>>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <b...@rud.is> wrote:
>>>> 
>>>>> ...
>>>>> And, I did indeed find a few and am just waiting for a formal review so
>>> I
>>>>> can submit them for the Drill dev & tests.
>>>>> 
>>>> 
>>>> Awesome!
>>> 
>>> 
> 

Reply via email to