Actually, some good news here… I ran some test queries on the corrupted file and it seemed to work pretty well. I didn’t get any exceptions!
jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from dfs.test.`testv1.pcap`WHERE is_corrupt=1 GROUP BY src_ip ORDER BY packet_count DESC . . . . . . .semicolon> LIMIT 10; +-----------------------------------------+---------------+ | src_ip | packet_count | +-----------------------------------------+---------------+ | 150.249.255.161 | 176 | | 150.249.255.24 | 28 | | 131.38.3.15 | 26 | | 111.248.196.128 | 25 | | 202.13.230.242 | 20 | | 163.28.217.199 | 19 | | 27.18.36.151 | 18 | | 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed | 17 | | 203.70.190.81 | 16 | | 203.70.182.104 | 13 | +-----------------------------------------+---------------+ 10 rows selected (0.944 seconds) select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT 10; +------------------+------------------+ | src_ip | dst_ip | +------------------+------------------+ | 118.233.244.60 | 150.249.255.161 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 172.40.96.180 | 131.39.133.22 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.162.60 | 180.32.119.25 | | 150.249.255.161 | 165.63.110.188 | +------------------+------------------+ 10 rows selected (1.031 seconds) 0: jdbc:drill:zk=local> SELECT src_port , dst_port , src_mac_address , dst_mac_address . . . . . . .semicolon> FROM dfs.test.`testv1.pcap` . . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10; +-----------+-----------+--------------------+--------------------+ | src_port | dst_port | src_mac_address | dst_mac_address | +-----------+-----------+--------------------+--------------------+ | 57058 | 443 | 00:0C:DB:1F:72:41 | 88:E0:F3:7A:66:F0 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 443 | 55972 | 00:0C:DB:1F:72:41 | CC:4E:24:1F:4E:00 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 4016 | 7699 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | +-----------+-----------+--------------------+--------------------+ 10 rows selected (0.751 seconds) SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` WHERE is_corrupt=1 GROUP BY getCountryName(src_ip) ORDER BY packet_count DESC LIMIT 10; +----------------+---------------+ | country | packet_count | +----------------+---------------+ | Japan | 269 | | Taiwan | 124 | | United States | 105 | | Unknown | 49 | | China | 26 | | South Korea | 8 | | Australia | 4 | | Germany | 3 | | Hong Kong | 2 | | Italy | 1 | +----------------+---------------+ 10 rows selected (1.519 seconds) SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` GROUP BY is_corrupt; +-------------+---------------+ | is_corrupt | packet_count | +-------------+---------------+ | 0 | 6408 | | 1 | 592 | +-------------+---------------+ 2 rows selected (0.931 seconds) This PCAP file worked well with Superset also. > On Feb 10, 2019, at 10:59, Charles Givre <cgi...@gmail.com> wrote: > > If I can get some more examples of corrupted files I’ll test more thoroughly. > Also, we’ll need to apply the same methodology to PCAP-NG, so I’ll need some > examples there as well. My strategy is going to be get as much data as > possible out of the corrupt packet. > — C > > > >> On Feb 10, 2019, at 10:54, Ted Dunning <ted.dunn...@gmail.com> wrote: >> >> I think that accessing fields in corrupted packets will also cause >> exceptions. But this is a great start. Conditionalizing field access on >> !is_corrupt() might be sufficient for the next step. >> >> >> >> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <cgi...@gmail.com> wrote: >> >>> All, >>> I posted the following PR for this issue: >>> https://github.com/apache/drill/pull/1637 < >>> https://github.com/apache/drill/pull/1637> >>> >>> Basically this PR does two things. >>> 1. It creates a boolean column called is_corrupt and >>> 2. If the PCAP file has a corrupt row, it marks that row as corrupt by >>> setting is_corrupt to true and keeps going >>> >>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows >>> out of 7000 in that PCAP file. It was late and I don’t know if that was >>> what ti was supposed to find, but it worked and was able to query that. >>> If you guys could send a few more examples, I’d like to test this on other >>> files to make sure it works with them. We’re also going to have to do the >>> same thing for the PCAP-NG format I would assume. >>> >>>> On Feb 10, 2019, at 03:07, Ted Dunning <ted.dunn...@gmail.com> wrote: >>>> >>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <b...@rud.is> wrote: >>>> >>>>> ... >>>>> And, I did indeed find a few and am just waiting for a formal review so >>> I >>>>> can submit them for the Drill dev & tests. >>>>> >>>> >>>> Awesome! >>> >>> >