Extending BPF ============= Introduction ------------ BPF was originally designed to provide very fast packet matching capabilities for IPv4 but as a result of its generic nature, is capable of being used for just about any protocol. With IPv6 the limitations of BPF became apparent.
BPF & IPv6 ---------- The problem with IPv6 and BPF is that the transport header (TCP, UDP, etc) can have a number of extension headers between it and the network header that is present for IPv6. There's no hints in the IPv6 header as to how many of these extension headers there are, or how many bytes the extension header(s) take up. This leaves BPF in a precarious situation because it cannot be reliably used to match on layer 4 packets. What's missing is the ability to either find a specific header after the IPv6 network header or just to determine what the last one is. There is also the problem that with IPv6 a single BPF instruction is no longer enough to perform mathematics on the address: IPv6 instructions are 128bits long and BPF is limited to 32bit instructions. Only the bit operations such as OR, AND and XOR can be easily used. The traditional BPF instruction has 3 bits to represent the size of operands, allowing for "word" (32bit), "half-word" (16bit) and "byte" (8 bit). This limitation also means that BPF is not capable of taking full advantage of 64bit CPUs that are common today. Other gaps ---------- One of the more common uses of BPF is to select a packet based on a number of conditions, such as ports 80, 443, 8000 and 8080. To do this requires 4 different comparisons when all that is really required is to be able to do a search amongst a set of values. This is also a problem when selection is done based on IP address(es). The maximum size of a BPF program is effectively limited by the range of a jump that is stored in an 8bit value. Whilst the current limit on instructions is 512, correctly coding a program requires that all blocks of code requiring a jump around are no larger than 255 instructions. Looking at other work --------------------- eBPF on Linux has evolved a long way from BPF in its original form where "raw" instructions and has overloaded some bit definitions. An example is BPF_ALU64 (eBPF) that is the same instruction class as BPF_MISC. eBPF also introduces the concept of maps that are a container of objects to match up against with a lookup. Whilst eBPF adds some double word functionality, it doesn't provide single a solution to providing a single instruction to work on a 128bit address. The use of maps can be used to move looking for a match in a set of numbers into a single instruction, however that isn't implemented as a native operation, rather it is a function call (BPF_CALL) with parameters to define which map to lookup and for what. In NetBSD, BPF_COP has been introduced that is somewhat similar in implementation to eBPF's BPF_CALL whereby a more complex function that is outside of BPF is available to be called with some args passed/returned through the A/X registers. This has the potential to solve the "lookup a value in a set" problem as well as deal with IPv6 extension headers but does not address the problem of instruction space crowding or provide for 64bit native instructions. Proposed Solution ----------------- There is no capacity within the BPF instruction set as it exists today to further expand enough to solve the above limitations with IPv6 in a meaningful and elegant fashion. Thus any solution is forced to redesign the instruction format so that operands with 64 and 128 bits are possible, along with new instructions that are capable of filling the gap with IPv6 plus create room for further growth. In recognition of the constraints now being put on BPF programs, the maximum size is increased to 64k. The extra size is supported through the use of a larger instruction format and larger operands, allowing for jumps to the end. Part of redefining the instruction set allows for more space to be allowed for future growth - such as in the register set. At present I've only defined 3 to match those that exist now, but there is room to grow that as has Linux. Note that I haven't designed this with JIT compilers in mind, rather I have tried to think about what are the common operations that are required and how do they fit into what an engine that matches packets would be expected to do natively. Is supporting something like eBPF's MAP instructions better than doing a lookup to see if something is in a set of values? Or are the two not exclusive? Does similar functionality turning up in eBPF (BPF_CALL) and NetBSD (BPF_COP) suggest that this is a missing feature from BPF itself? On the one hand extending the instruction set to do advanced steps such as "find the last header" on an IPv6 packet is a step away from the more simple BPF instructions but on the other, all behaviour is predefined In terms of progress in implementing this, I'm working on the code to generate the BPF instructions (the kernel matching is easy) but I thought it prudent to seek feedback before going too far down this path. Yes, there is no native processor that supports 128bit operands but BPF is just a virtual machine and in that context, there's no reason why it can't support 128bit registers and hide the complexity of dealing with IPv6 addresses due to their size. Thus it becomes a requirement of the code supporting bpf_filter() to use 64bit native code when available or 32bit when not. Cheers, Darren /* * Extended BPF */ /* * Size of args. Support 8, 16, 32, 64 and 128 bit operands. */ #define BPFX_WORD 0xf0000000 #define BPFX_SIZE(x) ((x) & BPFX_WORD) #define BPFX_B 0x10000000 /* Byte (8bit) */ #define BPFX_H 0x20000000 /* Half word (16bit) */ #define BPFX_W 0x30000000 /* Word (32bit) */ #define BPFX_DW 0x40000000 /* Double word (64bit) */ #define BPFX_QW 0x50000000 /* Quad word (128bit) */ /* * What type of operation is this? Break it down into a few * groups: loads, stores, arithmetic, jumps, return, miscellanous * and "private" for vendor specific instructions. */ #define BPFX_OPCLASS 0x0f000000 #define BPFX_CLASS(code) ((code) & BPFX_OPCLASS) #define BPFX_LD 0x01000000 #define BPFX_LDX 0x02000000 #define BPFX_ST 0x03000000 #define BPFX_STX 0x04000000 #define BPFX_ALU 0x05000000 #define BPFX_JMP 0x06000000 #define BPFX_RET 0x07000000 #define BPFX_MISC 0x08000000 #define BPFX_PRIVATE 0x0f000000 /* * How does the instruction use its operands to * access memory? As an immediate address? Absolute value? * ... */ #define BPFX_ACCESS 0x00f00000 #define BPFX_MODE(x) ((x) & BPFX_ACCESS) #define BPFX_IMM 0x00100000 #define BPFX_ABS 0x00200000 #define BPFX_MEM 0x00300000 #define BPFX_LEN 0x00400000 #define BPFX_MSH 0x00500000 /* * Traditional BPF has had 3 registers, so * support just those for now. */ #define BPFX_SOURCE 0x000f0000 #define BPFX_SRC(x) ((x) & BPFX_SOURCE) #define BPFX_X 0x00010000 #define BPFX_K 0x00020000 #define BPFX_A 0x00030000 /* * BPF Extended instructions are of variable length * where the length is not always predetermined by the * instruction itself. An example of where it is would * be a load or store. * The length is given in terms of 32bit words. */ #define BPFX_LENMASK 0x0000ff00 #define BPFX_BYTES(x) (((x) & BPFX_LENMASK) >> 5) #define BPFX_WORDS(x) (((x) & BPFX_LENMASK) >> 8) #define BPFX_INSTR 0x000000ff #define BPFX_OP(x) ((x) & BPFX_INSTR) #define BPFX_OPCODE(x) ((x) & ~BPFX_LENMASK) /* * Instructions per class above. * Each class can have 256 instructions. */ /* * Arithmetic operations - BPFX_ALU */ #define BPFX_ADD 1 #define BPFX_SUB 2 #define BPFX_MUL 3 #define BPFX_DIV 4 #define BPFX_AND 5 #define BPFX_OR 6 #define BPFX_XOR 7 #define BPFX_LSH 8 #define BPFX_RSH 9 #define BPFX_NEG 10 /* * Jump operations - BPFX_JMP */ #define BPFX_JA 1 /* Jump Always */ #define BPFX_JEQ 2 /* Jump EQual */ #define BPFX_JGT 3 /* Jump Greater Than */ #define BPFX_JGE 4 /* Jump Greater than or Equal */ #define BPFX_JSET 5 /* Jump bit is SET */ #define BPFX_JINLIST 6 /* Jump valid IN LIST */ typedef union { uint128_t q; uint64_t d; uint32_t w; uint16_t h; uint8_t b; } bpfwords_t; struct bpfx_jmp { uint32_t bpfx_code; uint16_t bpfx_jt; uint16_t bpfx_jf; bpfwords_t bpfx_k; }; typedef struct bpfx_jmp bpfx_jmp_t; struct bpfx_alu { uint32_t bpfx_code; bpfwords_t bpfx_k; }; typedef struct bpfx_alu bpfx_alu_t; /* * Miscellaneous operations - BPFX_MISC */ #define BPFX_TSX 1 /* Transfer Source into X */ #define BPFX_TXS 2 /* Transfer X into Source */ #define BPFX_FIND 3 /* IPv6: Find transport header */ #define BPFX_TPROTO 4 /* IPv6: Find last transport header */ #define BPFX_NOP 0x00000100 #define BPFX_MAXINSNS 65535 _______________________________________________ tcpdump-workers mailing list tcpdump-workers@lists.tcpdump.org https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers