[tcpdump-workers] BPF Extended: addressing BPF's shortcomings

Darren Reed Wed, 10 Jun 2015 06:17:25 -0700

Extending BPF
=============

Introduction
------------
BPF was originally designed to provide very fast packet matching
capabilities for IPv4 but as a result of its generic nature, is
capable of being used for just about any protocol. With IPv6 the
limitations of BPF became apparent.


BPF & IPv6
----------
The problem with IPv6 and BPF is that the transport header (TCP,
UDP, etc) can have a number of extension headers between it and
the network header that is present for IPv6. There's no hints in
the IPv6 header as to how many of these extension headers there
are, or how many bytes the extension header(s) take up. This leaves
BPF in a precarious situation because it cannot be reliably used to
match on layer 4 packets. What's missing is the ability to either
find a specific header after the IPv6 network header or just to
determine what the last one is.

There is also the problem that with IPv6 a single BPF instruction
is no longer enough to perform mathematics on the address: IPv6
instructions are 128bits long and BPF is limited to 32bit instructions.
Only the bit operations such as OR, AND and XOR can be easily used.
The traditional BPF instruction has 3 bits to represent the size of
operands, allowing for "word" (32bit), "half-word" (16bit) and "byte"
(8 bit). This limitation also means that BPF is not capable of taking
full advantage of 64bit CPUs that are common today.

Other gaps
----------
One of the more common uses of BPF is to select a packet based on
a number of conditions, such as ports 80, 443, 8000 and 8080. To
do this requires 4 different comparisons when all that is really
required is to be able to do a search amongst a set of values.
This is also a problem when selection is done based on IP address(es).

The maximum size of a BPF program is effectively limited by the range
of a jump that is stored in an 8bit value. Whilst the current limit
on instructions is 512, correctly coding a program requires that
all blocks of code requiring a jump around are no larger than 255
instructions.

Looking at other work
---------------------
eBPF on Linux has evolved a long way from BPF in its original form
where "raw" instructions and has overloaded some bit definitions.
An example is BPF_ALU64 (eBPF) that is the same instruction class
as BPF_MISC. eBPF also introduces the concept of maps that are a
container of objects to match up against with a lookup. Whilst eBPF
adds some double word functionality, it doesn't provide single a
solution to providing a single instruction to work on a 128bit
address. The use of maps can be used to move looking for a match
in a set of numbers into a single instruction, however that isn't
implemented as a native operation, rather it is a function call
(BPF_CALL) with parameters to define which map to lookup and for
what.

In NetBSD, BPF_COP has been introduced that is somewhat similar
in implementation to eBPF's BPF_CALL whereby a more complex
function that is outside of BPF is available to be called with
some args passed/returned through the A/X registers. This has
the potential to solve the "lookup a value in a set" problem
as well as deal with IPv6 extension headers but does not address
the problem of instruction space crowding or provide for 64bit
native instructions.

Proposed Solution
-----------------
There is no capacity within the BPF instruction set as it exists
today to further expand enough to solve the above limitations with
IPv6 in a meaningful and elegant fashion. Thus any solution is forced
to redesign the instruction format so that operands with 64 and 128
bits are possible, along with new instructions that are capable of
filling the gap with IPv6 plus create room for further growth. In
recognition of the constraints now being put on BPF programs, the
maximum size is increased to 64k. The extra size is supported through
the use of a larger instruction format and larger operands, allowing
for jumps to the end.

Part of redefining the instruction set allows for more space to be
allowed for future growth - such as in the register set. At present
I've only defined 3 to match those that exist now, but there is room
to grow that as has Linux.

Note that I haven't designed this with JIT compilers in mind, rather
I have tried to think about what are the common operations that are
required and how do they fit into what an engine that matches packets
would be expected to do natively.

Is supporting something like eBPF's MAP instructions better than
doing a lookup to see if something is in a set of values? Or are
the two not exclusive? Does similar functionality turning up in
eBPF (BPF_CALL) and NetBSD (BPF_COP) suggest that this is a missing
feature from BPF itself? On the one hand extending the instruction
set to do advanced steps such as "find the last header" on an IPv6
packet is a step away from the more simple BPF instructions but on
the other, all behaviour is predefined

In terms of progress in implementing this, I'm working on the code
to generate the BPF instructions (the kernel matching is easy) but
I thought it prudent to seek feedback before going too far down
this path.

Yes, there is no native processor that supports 128bit operands but
BPF is just a virtual machine and in that context, there's no reason
why it can't support 128bit registers and hide the complexity of
dealing with IPv6 addresses due to their size. Thus it becomes a
requirement of the code supporting bpf_filter() to use 64bit native
code when available or 32bit when not.

Cheers,
Darren

/*
 * Extended BPF
 */
 /*
  * Size of args. Support 8, 16, 32, 64 and 128 bit operands.
  */
#define BPFX_WORD               0xf0000000
#define BPFX_SIZE(x)            ((x) & BPFX_WORD)
#define BPFX_B                  0x10000000    /* Byte (8bit) */
#define BPFX_H                  0x20000000    /* Half word (16bit) */
#define BPFX_W                  0x30000000    /* Word (32bit) */
#define BPFX_DW                 0x40000000    /* Double word (64bit) */
#define BPFX_QW                 0x50000000    /* Quad word (128bit) */

/*
 * What type of operation is this? Break it down into a few
 * groups: loads, stores, arithmetic, jumps, return, miscellanous
 * and "private" for vendor specific instructions.
 */
#define BPFX_OPCLASS            0x0f000000
#define BPFX_CLASS(code)        ((code) & BPFX_OPCLASS)
#define BPFX_LD                 0x01000000
#define BPFX_LDX                0x02000000
#define BPFX_ST                 0x03000000
#define BPFX_STX                0x04000000
#define BPFX_ALU                0x05000000
#define BPFX_JMP                0x06000000
#define BPFX_RET                0x07000000
#define BPFX_MISC               0x08000000
#define BPFX_PRIVATE            0x0f000000

/*
 * How does the instruction use its operands to
 * access memory? As an immediate address? Absolute value?
 * ...
 */
#define BPFX_ACCESS             0x00f00000
#define BPFX_MODE(x)            ((x) & BPFX_ACCESS)
#define BPFX_IMM                0x00100000
#define BPFX_ABS                0x00200000
#define BPFX_MEM                0x00300000
#define BPFX_LEN                0x00400000
#define BPFX_MSH                0x00500000

/*
 * Traditional BPF has had 3 registers, so
 * support just those for now.
 */
#define BPFX_SOURCE             0x000f0000
#define BPFX_SRC(x)             ((x) & BPFX_SOURCE)
#define BPFX_X                  0x00010000
#define BPFX_K                  0x00020000
#define BPFX_A                  0x00030000

/*
 * BPF Extended instructions are of variable length
 * where the length is not always predetermined by the
 * instruction itself. An example of where it is would
 * be a load or store.
 * The length is given in terms of 32bit words.
 */
#define BPFX_LENMASK            0x0000ff00
#define BPFX_BYTES(x)           (((x) & BPFX_LENMASK) >> 5)
#define BPFX_WORDS(x)           (((x) & BPFX_LENMASK) >> 8)

#define BPFX_INSTR              0x000000ff
#define BPFX_OP(x)              ((x) & BPFX_INSTR)

#define BPFX_OPCODE(x)          ((x) & ~BPFX_LENMASK)

/*
 * Instructions per class above.
 * Each class can have 256 instructions.
 */

/*
 * Arithmetic operations - BPFX_ALU
 */
#define BPFX_ADD        1
#define BPFX_SUB        2
#define BPFX_MUL        3
#define BPFX_DIV        4
#define BPFX_AND        5
#define BPFX_OR         6
#define BPFX_XOR        7
#define BPFX_LSH        8
#define BPFX_RSH        9
#define BPFX_NEG        10

/*
 * Jump operations - BPFX_JMP
 */
#define BPFX_JA         1    /* Jump Always */
#define BPFX_JEQ        2    /* Jump EQual */
#define BPFX_JGT        3    /* Jump Greater Than */
#define BPFX_JGE        4    /* Jump Greater than or Equal */
#define BPFX_JSET       5    /* Jump bit is SET */
#define BPFX_JINLIST    6    /* Jump valid IN LIST */

typedef union {
        uint128_t       q;
        uint64_t        d;
        uint32_t        w;
        uint16_t        h;
        uint8_t         b;
} bpfwords_t;

struct bpfx_jmp {
        uint32_t        bpfx_code;
        uint16_t        bpfx_jt;
        uint16_t        bpfx_jf;
        bpfwords_t      bpfx_k;
};
typedef struct bpfx_jmp bpfx_jmp_t;

struct bpfx_alu {
        uint32_t        bpfx_code;
        bpfwords_t      bpfx_k;
};
typedef struct bpfx_alu bpfx_alu_t;

/*
 * Miscellaneous operations - BPFX_MISC
 */
#define BPFX_TSX        1    /* Transfer Source into X */
#define BPFX_TXS        2    /* Transfer X into Source */
#define BPFX_FIND       3    /* IPv6: Find transport header */
#define BPFX_TPROTO     4    /* IPv6: Find last transport header */

#define BPFX_NOP        0x00000100

#define BPFX_MAXINSNS   65535

_______________________________________________
tcpdump-workers mailing list
tcpdump-workers@lists.tcpdump.org
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers

[tcpdump-workers] BPF Extended: addressing BPF's shortcomings

Reply via email to