HI, Not sure since when the filtering system has been worked on in this depth, but I suspect it has been a while. Finding someone completely up to speed about this may be a challenge.
Thanks, Jaap > On 15 Jun 2020, at 05:38, Sidhant Bansal <sidhban...@gmail.com> wrote: > > Hi all, > > I want to propose an improvement to speed up the display filters by avoiding > to re-dissect all the packets again and again when not required and instead > maintaining a cache of the fields that have been queried recently. > > Motivation: Benchmarking filtering on capture files > 100 MB shows that the > re-dissection step, i.e the amount of time spent inside the dissector tends > to be a lot, i.e > ~40-50% of the total time spent is consumed to re-dissect. > I believe we can make huge savings here. > > Example: > 1st Filter applied: tcp.srcport >= 1200 && tcp.dstport <= 1500 > This filter runs normally as it does right now AND stores the tcp.srcport and > tcp.dstport for all the packets on-memory in wireshark > 2nd Filter applied: tcp.srcport == 80 > We don't need to re-dissect all the packets again and can simply refer to the > information stored to apply the filter. > 3rd Filter applied: tcp.srcport == 120 || udp.srcport == 80 > Since we haven't stored "udp.srcport" in our cache, therefore we need to > re-dissect again AND we will store udp.srcport for all the packets also (to > speed-up future filter queries) > 4th Filter applied: tcp.srcport == 40 || udp.srcport >= 1000 || tcp.dstport > <= 500 > Since all of these fields are in cache, so we can refer to them directly from > the on-memory information stored and don't need to re-dissect any of the > packets. > > We can limit the number of fields we store on-memory at any given moment of > time depending on how many packets we have and how much memory we can afford > to allocate. And deleting the fields from the cache can be done according to > a specific cache replacement policy (I haven't thought about which one will > the most apt, input is welcome) > > Most of the fields tend to be fixed-length in terms of bytes and are small, > i.e <= 8bytes. For fields such as strings that are variable-length and can be > arbitrarily large we can avoid doing this caching procedure and instead > re-dissect all the packets if the filter expression consists of such a field. > > From an implementation point of view: The cached fields information can be > stored inside the frame_data since that remains persistent throughout > wireshark's execution for a single capture file opened. Now whenever we > encounter a new filter query we can check if all the fields are in the cache > or not? If yes, then once we convert our abstract syntax tree of the filter > query to DFVM and then query, we should lookup the cache instead of > re-dissecting. If no, then we do what we do currently, i.e re-dissect but we > also store this new field into our cache (according to the specific > replacement policy) > > Want to know about any feedback or objections to this optimization. > > ___________________________________________________________________________ > Sent via: Wireshark-dev mailing list <wireshark-dev@wireshark.org> > Archives: https://www.wireshark.org/lists/wireshark-dev > Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev > mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe ___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev@wireshark.org> Archives: https://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe