On Tue, Sep 9, 2014 at 10:09 PM, Alexei Starovoitov <a...@plumgrid.com> wrote:
> Hi David,
>
> I've managed to reduce this set to 12:
> Patches 1-4 establish BPF syscall shell for maps and programs.
> Patches 5-10 add verifier step by step
> Patch 11 exposes existing instruction macros to user space
> Patch 12 adds test stubs and verifier testsuite from user space
>
> I don't know how to reduce it further. Drop verifier and
> have programs loaded without verification? Sounds wrong.
> If anyone has other ideas, I'll gladly reduce it further.
>
> Note that patches 1,3,4,7 add commands and attributes to the syscall
> while being backwards compatible from each other, which should demonstrate
> how other commands can be added in the future.
>
> Daniel,
> bpf_common.h patch (that we discussed earlier) I didn't include here
> to reduce the number of patches. It can come next.
>
> For those who have looked at the last set of 28 patches, the difference is:
> - moved attaching to tracing and sockets to future patches
> - moved hash table map type implementation to future
> - split verifier further and moved LD_ABS checks and state prunning to future
> - instead of running verifier testsuite on real tracing programs added
>   test_stub.c with fake maps, context and helper functions to test verifier 
> only
> - rebased
>
> Note, after this set the programs can be loaded for testing only. They cannot
> be attached to any events. This will come in the next set.
>
> As requested by Andy and others, here is the man page:
>
> BPF(2)                     Linux Programmer's Manual                    BPF(2)
>
>
>
> NAME
>        bpf - perform a command on eBPF map or program
>
> SYNOPSIS
>        #include <linux/bpf.h>
>
>        int bpf(int cmd, union bpf_attr *attr, unsigned int size);
>
>
> DESCRIPTION
>        bpf()  syscall  is a multiplexor for a range of different operations on
>        eBPF  which  can  be  characterized  as  "universal  in-kernel  virtual
>        machine". eBPF is similar to original Berkeley Packet Filter (or "clas-
>        sic BPF") used to filter network packets. Both statically  analyze  the
>        programs  before  loading  them into the kernel to ensure that programs
>        cannot harm the running system.
>
>        eBPF extends classic BPF in multiple ways including ability to call in-
>        kernel  helper  functions  and  access shared data structures like eBPF
>        maps.  The programs can be written in a restricted C that  is  compiled
>        into  eBPF  bytecode  and executed on the eBPF virtual machine or JITed
>        into native instruction set.
>
>    eBPF Design/Architecture
>        eBPF maps is a generic storage of different types.   User  process  can
>        create  multiple  maps  (with key/value being opaque bytes of data) and
>        access them via file descriptor. In parallel eBPF programs  can  access
>        maps  from inside the kernel.  It's up to user process and eBPF program
>        to decide what they store inside maps.
>
>        eBPF programs are similar to kernel modules. They  are  loaded  by  the
>        user  process  and automatically unloaded when process exits. Each eBPF
>        program is a safe run-to-completion set of instructions. eBPF  verifier
>        statically  determines  that the program terminates and is safe to exe-
>        cute. During verification the program takes a  hold  of  maps  that  it
>        intends to use, so selected maps cannot be removed until the program is
>        unloaded. The program can be attached to different events. These events
>        can  be packets, tracepoint events and other types in the future. A new
>        event triggers execution of the program  which  may  store  information
>        about the event in the maps.  Beyond storing data the programs may call
>        into in-kernel helper functions which may, for example, dump stack,  do
>        trace_printk  or other forms of live kernel debugging. The same program
>        can be attached to multiple events. Different programs can  access  the
>        same map:
>          tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
>           event A     event B     event C      on eth0    on eth1
>            |             |          |            |          |
>            |             |          |            |          |
>            --> tracing <--      tracing       socket      socket
>                 prog_1           prog_2       prog_3      prog_4
>                 |  |               |            |
>              |---  -----|  |-------|           map_3
>            map_1       map_2
>
>    Syscall Arguments
>        bpf()  syscall  operation  is determined by cmd which can be one of the
>        following:
>
>        BPF_MAP_CREATE
>               Create a map with given type and attributes and return map FD
>
>        BPF_MAP_LOOKUP_ELEM
>               Lookup element by key in a given map and return its value
>
>        BPF_MAP_UPDATE_ELEM
>               Create or update element (key/value pair) in a given map
>
>        BPF_MAP_DELETE_ELEM
>               Lookup and delete element by key in a given map
>
>        BPF_MAP_GET_NEXT_KEY
>               Lookup element by key in a given map and return key of next ele-
>               ment
>
>        BPF_PROG_LOAD
>               Verify and load eBPF program
>
>        attr   is a pointer to a union of type bpf_attr as defined below.
>
>        size   is the size of the union.

I find this strange.  Why not just make attr be a pointer to the
relevant struct for the operation being invoked?


>
>        union bpf_attr {
>            struct { /* anonymous struct used by BPF_MAP_CREATE command */
>                enum bpf_map_type map_type;

Does this reliably generate the same type on compat systems?  C++11
has a fix for enum ABI compatibility, but this is plain C :(


>            struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                enum bpf_prog_type    prog_type;
>                __u32                 insn_cnt;
>                const struct bpf_insn *insns;
>                const char            *license;
>                __u32                 log_level; /* verbosity level of eBPF 
> verifier */
>                __u32                 log_size;  /* size of user buffer */
>                void                  *log_buf;  /* user supplied buffer */
>            };
>        };

It might be a bit nicer to have separate in and out arguments.


>
>        BPF_MAP_CREATE
>               int bpf_create_map(enum bpf_map_type map_type, int key_size,
>                                  int value_size, int max_entries)
>               {
>                   union bpf_attr attr = {
>                       .map_type = map_type,
>                       .key_size = key_size,
>                       .value_size = value_size,
>                       .max_entries = max_entries
>                   };

I feel like this is asking for trouble, or at least bizarre namespace
collisions in the anonymous struct members.  At least please give the
structs names.  (Also, the first time I read this, I assumed that
those were union members, which would have made the code be nonsense.)

>
>        BPF_MAP_DELETE_ELEM
>               int bpf_delete_elem(int fd, void *key)
>               {
>                   union bpf_attr attr = {
>                       .map_fd = fd,
>                       .key = key,
>                   };
>
>                   return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
>               }
>               The call deletes an element in a map fd with given key.

What does it return?  (The same question goes for a bunch of the map ops.)

>
>    eBPF programs
>        BPF_PROG_LOAD
>               This cmd is used to load eBPF program into the kernel.
>
>               char bpf_log_buf[LOG_BUF_SIZE];

What happens if the size isn't LOG_BUF_SIZE?

>
>               int bpf_prog_load(enum bpf_prog_type prog_type,
>                                 const struct bpf_insn *insns, int insn_cnt,
>                                 const char *license)
>               {
>                   union bpf_attr attr = {
>                       .prog_type = prog_type,
>                       .insns = insns,
>                       .insn_cnt = insn_cnt,
>                       .license = license,
>                       .log_buf = bpf_log_buf,
>                       .log_size = LOG_BUF_SIZE,
>                       .log_level = 1,
>                   };
>
>                   return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
>               }
>               prog_type one of the available program types:
>               enum bpf_prog_type {
>                       BPF_PROG_TYPE_UNSPEC,
>                       BPF_PROG_TYPE_SOCKET_FILTER,
>                       BPF_PROG_TYPE_TRACING_FILTER,
>               };

Why does the type matter?



>        {
>            static struct bpf_insn prog[] = {
>                BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
>                BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
>                BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp 
> - 4) = r0 */
>                BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>                BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
>                BPF_LD_MAP_FD(BPF_REG_1, 0),
>                BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
>                BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
>                BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
>                BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 
> */
>                BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
>                BPF_EXIT_INSN(),
>            };
>            int sock, map_fd, prog_fd, key;
>            long long value = 0, tcp_cnt, udp_cnt;
>
>            map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), 
> sizeof(value), 2);
>            if (map_fd < 0) {
>                printf("failed to create map '%s'\n", strerror(errno));
>                /* likely not run as root */
>                return 1;
>            }
>
>            key = 6; /* tcp */
>            assert(bpf_update_elem(map_fd, &key, &value) == 0);
>
>            key = 17; /* udp */
>            assert(bpf_update_elem(map_fd, &key, &value) == 0);
>
>            prog[5].imm = map_fd;

This (the .imm thing) is sufficiently weird that I think it needs to
be mentioned in the main docs, not just in an example.  It's
especially odd since AFAIK essentially every other object format in
the world uses a separate relocation table instead of inline magic
opcodes like this.

>
>        All other commands
>               Zero.

Shouldn't delete return different values depending on whether anything
was deleted?

>
>        ENOENT For  BPF_MAP_LOOKUP_ELEM  or BPF_MAP_DELETE_ELEM, indicates that
>               element with given key was not found.

Ah, here it is.  Please document this with the ops.

>
>        E2BIG  program is too large.
>
> NOTES
>        These commands may be used only by a privileged process (one having the
>        CAP_SYS_ADMIN capability).

I hope this goes away :)

I can't shake the feeling that the whole syscall map API is wrong and
that, instead, there should be a more general concept of objects
provided by the eBPF runtime.  Those objects could have methods that
are callable by the syscall and callable from eBPF code.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to