BPF memstore and bpf_validate_ext()

2013-12-09 Thread Mindaugas Rasiukevicius
Hello,

Now that the BPF memory store can be external and can be provided by the
caller of the BPF program, we can use to it pass some values or reuse it
as a cache.  However, this has to be supported by bpf_validate() - we have
to tell it which words are going to be initialised by the caller, since it
currently checks and prevents loads of the uninitialised memstore words.

Hence, here is the proposed API function for this:

int bpf_set_initwords(bpf_ctx_t *bc, uint8_t *wordmask, size_t len);

Note that we already have an extended bpf_validate() variation:

int bpf_validate_ext(bpf_ctx_t *bc, const struct bpf_insn *f, int len);

The original filter/validate API stays as is, only the extended routines
support the external memory store and its initialisation.

Thanks.

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-09 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Now that the BPF memory store can be external and can be provided by the
> caller of the BPF program, we can use to it pass some values or reuse it
> as a cache.

I don't think that external memory is needed. You added an auxiliary
agrument, what is it for if it's not for passing some values?

External memory disables several optimizations in bpfjit for most
filter programs even if they don't use BPF_COP.

1. sljit has a limited number of registers especially memory addressable
   registers. I will have to allocate a register to a memstore pointer.
   Because all registers are already assigned, I will have to start
   moving data. The best I can think of is assigning BPFJIT_TMP2 to a
   new pointer but I will need to switch to EREG register in 32bit
   BPF_LD. This register is emulated on some arches, for instance on
   i386.
2. I have (or plan to have) some simple optimizations which aren't
   possible by rewriting a bpf program. For instance, if there are
   multiple loads in a linear block, I will only generate one index
   check.
   Here is an example: LD+W(0) ST(0) LD+W(4) ST(1) LD+W(8) ST(2) ...
   It loads packet words with increasing offsets and stores them
   in memwords for later processing. I generate only only one index
   check for the first LD instruction (return 0 is packet is shorter
   than 12 bytes). I can exit early without storing any word because
   I know that memory will not be available after that bpf program
   returns.
   If you make it external, I will have to generate three index
   checks to make sure that stores are visible to a caller for all
   possible packet lengths.

> However, this has to be supported by bpf_validate() - we have
> to tell it which words are going to be initialised by the caller, since it
> currently checks and prevents loads of the uninitialised memstore words.
> 
> Hence, here is the proposed API function for this:
> 
> int bpf_set_initwords(bpf_ctx_t *bc, uint8_t *wordmask, size_t len);

Your description of the new API is too terse to be absolutely
certain but it doesn't look like my proposal I sent to you privately.

I think you can add a mask to each copfunc which indicates which
of the words it loads and which words are stored by the copfunc.

For instance, your npf_cop_l3 will look like

{ .copfunc = &npf_cop_l3,
  .loads  = BPF_COP_LOAD_NONE,
  .stores = BPF_COP_STORE(BPF_MW_IPVER) |
BPF_COP_STORE(BPF_MW_L4OFF) |
BPF_COP_STORE(BPF_MW_L4PROTO)
},
/* ... other copfuncs ... */

Validation of BPF_COP instruction will look very natural:

/* inside BPF_COP in bpf_validate_ext */
if (bc->copfuncs) {
if (bc->copfuncs[i].inwords & invalid)
goto out;
invalid &= ~bc->copfuncs[i].outwords;
}

It's a bit more trickier for BPF_COPX because you need to pre-calculate
loads/stores masks but it's doable.

At the moment, bpf_filter_ext() resets the "invalid" mask. This
translates to .stores = BPF_COP_STORE_ALL.

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-09 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> Mindaugas Rasiukevicius wrote:
> > Now that the BPF memory store can be external and can be provided by the
> > caller of the BPF program, we can use to it pass some values or reuse it
> > as a cache.
> 
> I don't think that external memory is needed. You added an auxiliary
> agrument, what is it for if it's not for passing some values?

It is an argument/context to be passed for the BPF coprocessors (only).
We want to let the *caller* of bpf_filter() (or JIT-compiled BPF program)
pass some values to the BPF *program* using the memory store.  It allows
us to re-use the memstore as a cache.

> External memory disables several optimizations in bpfjit for most
> filter programs even if they don't use BPF_COP.
> 
> 1. sljit has a limited number of registers especially memory addressable
>registers. I will have to allocate a register to a memstore pointer.
>Because all registers are already assigned, I will have to start
>moving data. The best I can think of is assigning BPFJIT_TMP2 to a
>new pointer but I will need to switch to EREG register in 32bit
>BPF_LD. This register is emulated on some arches, for instance on
>i386.

Isn't it only i386?  On amd64 or RISC archs it would mean just an extra
fetch, right?  I am not sure what is (or would be) the current register
allocation in the latest bpfjit snapshot.  Still, having memstore as a
cache ought to win way more compared to the extra fetches or push/pops.

> 2. I have (or plan to have) some simple optimizations which aren't
>possible by rewriting a bpf program. For instance, if there are
>multiple loads in a linear block, I will only generate one index
>check.
>Here is an example: LD+W(0) ST(0) LD+W(4) ST(1) LD+W(8) ST(2) ...
>It loads packet words with increasing offsets and stores them
>in memwords for later processing. I generate only only one index
>check for the first LD instruction (return 0 is packet is shorter
>than 12 bytes). I can exit early without storing any word because
>I know that memory will not be available after that bpf program
>returns.
>If you make it external, I will have to generate three index
>checks to make sure that stores are visible to a caller for all
>possible packet lengths.

Yes, but as we already discussed, for tcpdump/pcap filtering patterns
such optimisation is of little relevance.  Alternatively, we may add a
flag to indicate whether the caller cares about the memstore once BPF
program finishes (because normally - it does not).

> > Hence, here is the proposed API function for this:
> > 
> > int bpf_set_initwords(bpf_ctx_t *bc, uint8_t *wordmask, size_t len);
> 
> Your description of the new API is too terse to be absolutely
> certain but it doesn't look like my proposal I sent to you privately.

The bitmask merely tells bpf_validate() which words of the memstore will
be initialised by the caller, so it could allow the BPF_LD calls on them
i.e. it would basically do invalid = ~wordmask.

> I think you can add a mask to each copfunc which indicates which
> of the words it loads and which words are stored by the copfunc.
> 
> For instance, your npf_cop_l3 will look like
> 
> { .copfunc = &npf_cop_l3,
>   .loads  = BPF_COP_LOAD_NONE,
>   .stores = BPF_COP_STORE(BPF_MW_IPVER) |
> BPF_COP_STORE(BPF_MW_L4OFF) |
> BPF_COP_STORE(BPF_MW_L4PROTO)
> },
> /* ... other copfuncs ... */
> 
> Validation of BPF_COP instruction will look very natural:
> 
> /* inside BPF_COP in bpf_validate_ext */
> if (bc->copfuncs) {
>   if (bc->copfuncs[i].inwords & invalid)
>   goto out;
>   invalid &= ~bc->copfuncs[i].outwords;
> }

Right, but this is for the coprocessors.  I am fine with it (as we talked
privately, I was considering to propose a very similar API), but it seems
that bpf_set_initwords() already *efficiently* covers all my cases.  With
this change, I will actually be able to remove NPF_COP_L3!

Generally, it might be rare that the coprocessor would set additional words
other than those already initialised by the caller.  In other words, while
this functionality would add additional granularity, there is no use case
for it yet.  So I would not hurry to add it.

> It's a bit more trickier for BPF_COPX because you need to pre-calculate
> loads/stores masks but it's doable.

Yes, it is run-time.  However, I would say let's not bother with it until
we actually have a need for an improvement (I doubt it will ever happen).

> At the moment, bpf_filter_ext() resets the "invalid" mask. This
> translates to .stores = BPF_COP_STORE_ALL.

Correct, that is why NPF is, temporarily, zeroing all words (until some
API gets adopted).

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-10 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> It is an argument/context to be passed for the BPF coprocessors (only).
> We want to let the *caller* of bpf_filter() (or JIT-compiled BPF program)
> pass some values to the BPF *program* using the memory store.  It allows
> us to re-use the memstore as a cache.

Are you saying that we want to pass some values to bpf programs that
don't use cop?

> Isn't it only i386?  On amd64 or RISC archs it would mean just an extra
> fetch, right?  I am not sure what is (or would be) the current register
> allocation in the latest bpfjit snapshot.  Still, having memstore as a
> cache ought to win way more compared to the extra fetches or push/pops.

You can win even more if you switch to manually crafted assembly or
build a sophisticated JIT language tailored for NPF. You don't have
to use bpfjit at all.

> Right, but this is for the coprocessors.  I am fine with it (as we talked
> privately, I was considering to propose a very similar API), but it seems
> that bpf_set_initwords() already *efficiently* covers all my cases.  With
> this change, I will actually be able to remove NPF_COP_L3!

I think I know what are you getting at.

In your case BPF_COP is always the first instruction (or in the
first linear block). Because it's the first and it's a function
call, it can be moved outside of bpf program and inlined.

If this is the case, you don't need BPF_COP at all.

Except of course you may run out or memwords one day and you want
to have BPF_COP as a fallback.

Is the above correct?

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-10 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> Mindaugas Rasiukevicius wrote:
> > It is an argument/context to be passed for the BPF coprocessors (only).
> > We want to let the *caller* of bpf_filter() (or JIT-compiled BPF
> > program) pass some values to the BPF *program* using the memory store.
> > It allows us to re-use the memstore as a cache.
> 
> Are you saying that we want to pass some values to bpf programs that
> don't use cop?

Including the BPF programs which do not use COP, yes.

> > Isn't it only i386?  On amd64 or RISC archs it would mean just an extra
> > fetch, right?  I am not sure what is (or would be) the current register
> > allocation in the latest bpfjit snapshot.  Still, having memstore as a
> > cache ought to win way more compared to the extra fetches or push/pops.
> 
> You can win even more if you switch to manually crafted assembly or
> build a sophisticated JIT language tailored for NPF. You don't have
> to use bpfjit at all.

Well, sljit is what we've got.  I would think it can do nearly as good
as manual crafting if we cover the main cases based on Pareto principle.

> > Right, but this is for the coprocessors.  I am fine with it (as we
> > talked privately, I was considering to propose a very similar API), but
> > it seems that bpf_set_initwords() already *efficiently* covers all my
> > cases.  With this change, I will actually be able to remove NPF_COP_L3!
> 
> I think I know what are you getting at.
> 
> In your case BPF_COP is always the first instruction (or in the
> first linear block). Because it's the first and it's a function
> call, it can be moved outside of bpf program and inlined.
> 
> If this is the case, you don't need BPF_COP at all.

For this particular case - yes, correct.

> Except of course you may run out or memwords one day and you want
> to have BPF_COP as a fallback.

I still need BPF_COP for other tasks (e.g. NPF_COP_TABLE).  Running out
of words is unlikely, but COP can certainly be used to handle that too.

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-10 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Alexander Nasonov  wrote:
> > In your case BPF_COP is always the first instruction (or in the
> > first linear block). Because it's the first and it's a function
> > call, it can be moved outside of bpf program and inlined.
> > 
> > If this is the case, you don't need BPF_COP at all.
> 
> For this particular case - yes, correct.
> 
> > Except of course you may run out or memwords one day and you want
> > to have BPF_COP as a fallback.
> 
> I still need BPF_COP for other tasks (e.g. NPF_COP_TABLE).  Running out
> of words is unlikely, but COP can certainly be used to handle that too.

I wonder why do you need two different features when one is a
superset of the other if you could use BPF_COP? If the only reason
is performance, external memory is not a win-win proposition.

PS sljit has a non-standard fast call mechanism. If you're really
concerned about performance, you can generate copfuncs and I can
call them from bpfjit.

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-11 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> > > 
> > > If this is the case, you don't need BPF_COP at all.
> > 
> > For this particular case - yes, correct.
> > 
> > > Except of course you may run out or memwords one day and you want
> > > to have BPF_COP as a fallback.
> > 
> > I still need BPF_COP for other tasks (e.g. NPF_COP_TABLE).  Running out
> > of words is unlikely, but COP can certainly be used to handle that too.
> 
> I wonder why do you need two different features when one is a
> superset of the other if you could use BPF_COP?

The fact that you can abuse one feature to achieve the result of another
does not mean that it is a good solution.  Using the external memory store
and initialising/passing some values is just simple and straightforward,
besides being faster.

> <...> If the only reason
> is performance, external memory is not a win-win proposition.

You mean because using EREG penalises the access of the memstore?  I doubt
it is a big deal, but can we just benchmark and see?

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-15 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Alexander Nasonov  wrote:
> > Are you saying that we want to pass some values to bpf programs that
> > don't use cop?
> 
> Including the BPF programs which do not use COP, yes.
> 
> > > Isn't it only i386?  On amd64 or RISC archs it would mean just an extra
> > > fetch, right?  I am not sure what is (or would be) the current register
> > > allocation in the latest bpfjit snapshot.  Still, having memstore as a
> > > cache ought to win way more compared to the extra fetches or push/pops.
> > 
> > You can win even more if you switch to manually crafted assembly or
> > build a sophisticated JIT language tailored for NPF. You don't have
> > to use bpfjit at all.
> 
> Well, sljit is what we've got.  I would think it can do nearly as good
> as manual crafting if we cover the main cases based on Pareto principle.

All your recent changes to adapt bpfjit for npf show that you're hitting
sljit limitation all the time. Your expectations of sljit are probably
too high.

Originally, there were three arguments (buf, wirelen and buflen)
passed to a generated function. You wanted to add two more: ctx
and an auxiliary pointer. However, sljit doesn't support functions
with more than three arguments. So, you packed the arguments into
a struct:

struct bpf_args {
const struct mbuf * pkt;
size_t  wirelen;
size_t  buflen;
uint32_tmem[BPF_MEMWORDS];
void *  arg;
};

But you also added non-argument mem[BPF_MEMWORDS] array. If you
wanted external memory, you'd better passed mem array separately.
But you were hitting three arguments limit of sljit when calling
a copfunc.

When I later was implementing copfunc calls in bpfjit, I din't know
how you gonna use external memory and I felt that it wasn't necessary
(and I still do), so I changed your structs.

I moved non-argument "mem" from the bpf_args argument pack:

struct bpf_args {
const struct mbuf * pkt;
size_t  wirelen;
size_t  buflen;
void *  arg;
};

to bpf_state struct:

struct bpf_state {
uint32_tmem[BPF_MEMWORDS];
uint32_tregA;
};

and I passed bpf_state to copfunc.

By not making memory external I avoided doing a lot of changes in
bpfjit to adapt to a new execution model.

Now you want to have *both* external memory and copfuncs for a faster
execution of your code without considering other uses of bpfjit.

To be honest, I'm not very intersted in implementing external memory
in bpfjit. Quite the opposite, after all these discussions I want
to improve my index check optimization and make filters like 'host xxx'
run a bit faster. Needless to say that this optimization is the most
effective if bpf memory isn't visible from outside.

This discussion is going to nowhere. At this point, someone from
outside should review our opinions and help us to make a decision.

Alex

> > > Right, but this is for the coprocessors.  I am fine with it (as we
> > > talked privately, I was considering to propose a very similar API), but
> > > it seems that bpf_set_initwords() already *efficiently* covers all my
> > > cases.  With this change, I will actually be able to remove NPF_COP_L3!
> > 
> > I think I know what are you getting at.
> > 
> > In your case BPF_COP is always the first instruction (or in the
> > first linear block). Because it's the first and it's a function
> > call, it can be moved outside of bpf program and inlined.
> > 
> > If this is the case, you don't need BPF_COP at all.
> 
> For this particular case - yes, correct.
> 
> > Except of course you may run out or memwords one day and you want
> > to have BPF_COP as a fallback.
> 
> I still need BPF_COP for other tasks (e.g. NPF_COP_TABLE).  Running out
> of words is unlikely, but COP can certainly be used to handle that too.
> 
> -- 
> Mindaugas
> 

-- 


Re: BPF memstore and bpf_validate_ext()

2013-12-15 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> All your recent changes to adapt bpfjit for npf show that you're hitting
> sljit limitation all the time. Your expectations of sljit are probably
> too high.

We are discussing the external memory store which is useful regardless
whether BPF program is JIT-compiled or not.  It is a win in both cases.

Also, it was you who proposed sljit.  It can optimise *most* practical
cases (80-20 rule) and I am happy with that.  I do not understand why
are you concerned about those rare/unusual cases.  Do you have some
particular application in mind?  Something else than in our tree?

> <...>
> 
> When I later was implementing copfunc calls in bpfjit, I din't know
> how you gonna use external memory and I felt that it wasn't necessary
> (and I still do), so I changed your structs.

We can pass the memstore pointer as a separate argument (it would be
three arguments, fine for sljit), but what's the point..

> <...>
> 
> By not making memory external I avoided doing a lot of changes in
> bpfjit to adapt to a new execution model.
> 
> Now you want to have *both* external memory and copfuncs for a faster
> execution of your code without considering other uses of bpfjit.
> 
> To be honest, I'm not very intersted in implementing external memory
> in bpfjit. Quite the opposite, after all these discussions I want
> to improve my index check optimization and make filters like 'host xxx'
> run a bit faster. Needless to say that this optimization is the most
> effective if bpf memory isn't visible from outside.

Why are you ignoring the fact that your optimisations can still be added
and be effective?  I already suggested - we can add a flag to indicate that
the caller does not care about the result in the memory store.  Moreover,
the usual byte-code produced by tcpdump/pcap does not even use the memory
store so you optimisations would most of the time be applicable anyway!

> This discussion is going to nowhere. At this point, someone from
> outside should review our opinions and help us to make a decision.
> 
> Alex

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-15 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Also, it was you who proposed sljit.

Proposed for what? I implemented bpfjit using sljit if that's what
you mean. I offered you a help with implementing jit compiler for
npfcode. It was your idea to add COP/COPX and I agreed to implement
a support for it in bpfjit. I never agreed on implementing external
memory.

> It can optimise *most* practical
> cases (80-20 rule) and I am happy with that.  I do not understand why
> are you concerned about those rare/unusual cases.  Do you have some
> particular application in mind?  Something else than in our tree?

I don't have any application in mind but I don't understand why are you
pushing two extentions to bpf solely to get performance benefit for
your cases and you don't care that bpf looses performance even if there
are no cop instructions in a program at all.

> We can pass the memstore pointer as a separate argument (it would be
> three arguments, fine for sljit), but what's the point..

My point is that you mix "argument pack" with something else. They
should be separeted.

> Why are you ignoring the fact that your optimisations can still be added
> and be effective?  I already suggested - we can add a flag to indicate that
> the caller does not care about the result in the memory store.

I already offered to support SLJIT_FAST_CALL copfuncs in bpfjit.
They're much faster than regular copfuncs.  But that's mean you
will need to emit sljit code and you will have a limited number of
sljit registers and all other limitations of sljit. You still should
be able to copy data from auxiliary argument to memstore and you
can do it quite fast.

You didn't respond to me about it. If you ingored it because you don't
want to deal with sljit than you're pulling a blanket. It's easy to
suggest to have a flag but it's actually a lot or work. You need to
write several lines of C code to generate a single instruction.

I don't want to maintain two different modes of code generation.
If you want this flag, go ahead, write the code, write the tests
and everyone will be happy.


> Moreover,
> the usual byte-code produced by tcpdump/pcap does not even use the memory
> store so you optimisations would most of the time be applicable anyway!

Maybe in this case. I don't know all use cases. There are some
IDS/IPS that use bpf but I never looked at them.

In any case, this functionality will have to be tested.

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-16 Thread Darren Reed
On 16/12/2013 3:20 AM, Alexander Nasonov wrote:

> ..
> This discussion is going to nowhere. At this point, someone from
> outside should review our opinions and help us to make a decision.

It is one hack after another to support NPF using tables in BPF.

Darren



Re: BPF memstore and bpf_validate_ext()

2013-12-16 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> Mindaugas Rasiukevicius wrote:
> > Also, it was you who proposed sljit.
> 
> Proposed for what? I implemented bpfjit using sljit if that's what
> you mean. I offered you a help with implementing jit compiler for
> npfcode. It was your idea to add COP/COPX and I agreed to implement
> a support for it in bpfjit. I never agreed on implementing external
> memory.

Fair enough.  I still lack clear understanding why are you unhappy with
the external memory store..

> > It can optimise *most* practical
> > cases (80-20 rule) and I am happy with that.  I do not understand why
> > are you concerned about those rare/unusual cases.  Do you have some
> > particular application in mind?  Something else than in our tree?
> 
> I don't have any application in mind but I don't understand why are you
> pushing two extentions to bpf solely to get performance benefit for
> your cases and you don't care that bpf looses performance even if there
> are no cop instructions in a program at all.

I *do* care about performance of regular BPF usage i.e. tcpdump/libpcap.
However, I have been trying to explain that it is trivial to avoid the
performance penalty.. perhaps I miss something, but you did not explain
why is it problematic.

> > We can pass the memstore pointer as a separate argument (it would be
> > three arguments, fine for sljit), but what's the point..
> 
> My point is that you mix "argument pack" with something else. They
> should be separeted.

The external memory store can be used as an argument (and the initial
values determined as proposed in this thread).  If you want, we can pass
it as a third argument, I just think it is a pointless indirection level.
Would even need extra wrapping i.e. more work in bpfjit, but if you want
that separated - fine.

> > Why are you ignoring the fact that your optimisations can still be added
> > and be effective?  I already suggested - we can add a flag to indicate
> > that the caller does not care about the result in the memory store.
> 
> I already offered to support SLJIT_FAST_CALL copfuncs in bpfjit.
> They're much faster than regular copfuncs.  But that's mean you
> will need to emit sljit code and you will have a limited number of
> sljit registers and all other limitations of sljit. You still should
> be able to copy data from auxiliary argument to memstore and you
> can do it quite fast. <...>

That is great, but we are going circles here.  If a program just needs
some values stored in the memory store - why would you create a COP to
get few integers instead of simply letting the caller to pass them.  It
is the thing which actually has those numbers.

It straightforward, it is simpler, it is faster, and it does not make
the performance suffer for other programs.  BPF_COP optimisations are
brilliant, but.. we are talking about different issue.

> It's easy to
> suggest to have a flag but it's actually a lot or work. You need to
> write several lines of C code to generate a single instruction.

The flag would basically say "treat the memstore as internal i.e. just
do all the optimisations, because I assure there are no side effects".
It is a green light for what you said you already want to implement.
That would be one-liner check or I miss something?

> I don't want to maintain two different modes of code generation.
> If you want this flag, go ahead, write the code, write the tests
> and everyone will be happy.

I was simply trying to make suggestions which would make your life
easier, certainly not harder.  Alternatively, I can write patches.

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-16 Thread Mindaugas Rasiukevicius
Darren Reed  wrote:
> On 16/12/2013 3:20 AM, Alexander Nasonov wrote:
> 
> > ..
> > This discussion is going to nowhere. At this point, someone from
> > outside should review our opinions and help us to make a decision.
> 
> It is one hack after another to support NPF using tables in BPF.
> 

Nice try, but you missed it.  NPF tables are already supported and have
been happily working for a good while.  The discussion is not really about
them.  Try again.

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-18 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Alexander Nasonov  wrote:
> > My point is that you mix "argument pack" with something else. They
> > should be separeted.
> 
> The external memory store can be used as an argument (and the initial
> values determined as proposed in this thread).  If you want, we can pass
> it as a third argument, I just think it is a pointless indirection level.
> Would even need extra wrapping i.e. more work in bpfjit, but if you want
> that separated - fine.

I'll tell you more. Mixing mem with arguments will not only save
me one register/one indirection level but it will allow to use
memory addressing. It's very attractive solution from a performance
perspective. But it's clearly a hack and I don't want to have hacks
in public API.

> > I already offered to support SLJIT_FAST_CALL copfuncs in bpfjit.
> > They're much faster than regular copfuncs.  But that's mean you
> > will need to emit sljit code and you will have a limited number of
> > sljit registers and all other limitations of sljit. You still should
> > be able to copy data from auxiliary argument to memstore and you
> > can do it quite fast. <...>
> 
> That is great, but we are going circles here.  If a program just needs
> some values stored in the memory store - why would you create a COP to
> get few integers instead of simply letting the caller to pass them.  It
> is the thing which actually has those numbers.

What if your program "just needs" something else? Does it mean we have
to add a new feature to bpf? No. We have to stop somewhere. BPF_COP
is powerful enough for getting external data. It's a bit slower but I
can give you SLJIT_FAST_CALL interface. If it's still slow for you,
you should stop using sljit and consider other alternatives.

> > It's easy to
> > suggest to have a flag but it's actually a lot or work. You need to
> > write several lines of C code to generate a single instruction.
> 
> The flag would basically say "treat the memstore as internal i.e. just
> do all the optimisations, because I assure there are no side effects".
> It is a green light for what you said you already want to implement.
> That would be one-liner check or I miss something?

Ok, with my recent change on github, it's indeed one-line change
but having two different modes can introduce subtle bugs. If you
forget to set the mode properly, you may have a situation when your
memory is in a bad state only for packets of length 131 (or whatever)
and it's consistent for all other lengths.
I'm as a maintainer of bpfjit is responsible for a robust public
interface.

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-19 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> > The external memory store can be used as an argument (and the initial
> > values determined as proposed in this thread).  If you want, we can pass
> > it as a third argument, I just think it is a pointless indirection
> > level. Would even need extra wrapping i.e. more work in bpfjit, but if
> > you want that separated - fine.
> 
> I'll tell you more. Mixing mem with arguments will not only save
> me one register/one indirection level but it will allow to use
> memory addressing. It's very attractive solution from a performance
> perspective. But it's clearly a hack and I don't want to have hacks
> in public API.

We have discussed this before and I wrote quite long paragraph trying to
explain why I think such split does not really have real merits.  In a
nutshell: such separation does not solve any problem (does not reduce
complexity, does not help to honour the information hiding principle)
and the logical justification to have bpf_stack_t is a bit weak.

I am fine/neutral about adding a third uint32_t *mem argument.

> > That is great, but we are going circles here.  If a program just needs
> > some values stored in the memory store - why would you create a COP to
> > get few integers instead of simply letting the caller to pass them.  It
> > is the thing which actually has those numbers.
> 
> What if your program "just needs" something else? Does it mean we have
> to add a new feature to bpf? No. We have to stop somewhere. <...>

We should not stop because of the quantity.  We should consider whether it
*makes sense* to add such feature, looking at its virtues and implications.

I have already explained the benefits of the external memory store in this
thread (simple, straightforward way to pass values and have their cache).
Your main argument seems to be that the external memstore makes it more
difficult for bpfjit to optimise certain corner cases (which, as been
pointed out, are not common).

Given that BPF did not have JIT compiler for many years and even a very
simplistic JIT compilation is a huge win - I do not understand why do you
consider corner case optimisations as a higher value than the benefits
provided by the external memstore.

> BPF_COP
> is powerful enough for getting external data. It's a bit slower but I
> can give you SLJIT_FAST_CALL interface. If it's still slow for you,
> you should stop using sljit and consider other alternatives.

That is wonderful, but not the point..

> > > It's easy to
> > > suggest to have a flag but it's actually a lot or work. You need to
> > > write several lines of C code to generate a single instruction.
> > 
> > The flag would basically say "treat the memstore as internal i.e. just
> > do all the optimisations, because I assure there are no side effects".
> > It is a green light for what you said you already want to implement.
> > That would be one-liner check or I miss something?
> 
> Ok, with my recent change on github, it's indeed one-line change
> but having two different modes can introduce subtle bugs. If you
> forget to set the mode properly, you may have a situation when your
> memory is in a bad state only for packets of length 131 (or whatever)
> and it's consistent for all other lengths.
> I'm as a maintainer of bpfjit is responsible for a robust public
> interface.

Just keep the flag off by default?  Really, we have only few BPF users
in the NetBSD tree.  It is not rocket science to correctly set a flag.

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-19 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> > > That is great, but we are going circles here.  If a program just needs
> > > some values stored in the memory store - why would you create a COP to
> > > get few integers instead of simply letting the caller to pass them.  It
> > > is the thing which actually has those numbers.

Well, if it wasn't needed for many year in bpf, why do we need it now? ;-)

To answer your question about a program that needs some values stored,
you want a protocol to pass some values from the host C environment to
embedded BPF environment. You do it in a very ad-hoc manner. It's
not possible to say by looking at bpf program which memwords are passed
from outside and which are really internal.

BPF is a language and all its extensions should be designed with
this in mind.

> I have already explained the benefits of the external memory store in this
> thread (simple, straightforward way to pass values and have their cache).
> Your main argument seems to be that the external memstore makes it more
> difficult for bpfjit to optimise certain corner cases (which, as been
> pointed out, are not common).

Well, I was arguing about performance because you were concerned about
performance of your bpf programs.

Your solution lacks a concept. You're mixing things together without
bothering much about how will they interact to achieve a common
goal. I already pointed out that your COP is powerful enough to
copy external data to a memory store. To add to my other point
above, you're mixing local and global memory. Or, if you use an
analogy with Lua (that uses a byte code to interpret program),
there are local and global variables with a very clear distinction
between them. With your proposal, there is no clear distinction
inside bpf program which memory words are internal and which are
external.

Though, I don't think I want to see global memory in bpf.

> Given that BPF did not have JIT compiler for many years and even a very
> simplistic JIT compilation is a huge win - I do not understand why do you
> consider corner case optimisations as a higher value than the benefits
> provided by the external memstore.

I personally see no benefits in external memstore at all. You can copy
external data with a special copfunc.

> > BPF_COP
> > is powerful enough for getting external data. It's a bit slower but I
> > can give you SLJIT_FAST_CALL interface. If it's still slow for you,
> > you should stop using sljit and consider other alternatives.
> 
> That is wonderful, but not the point..

Ok, let me make a point them. If you want to make significant changes to
bpf, you'd better start from scrach and design whatever you want. I'd be
easier than arguing about bpf changes forever.

Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-19 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> 
> Well, if it wasn't needed for many year in bpf, why do we need it now? ;-)
> 

Because it was decided to use BPF byte-code for more applications and that
meant there is a need for improvements.  It is called evolution. :)

> To answer your question about a program that needs some values stored,
> you want a protocol to pass some values from the host C environment to
> embedded BPF environment. You do it in a very ad-hoc manner.

We are talking about byte-code i.e. quite low level environment.  The
protocol (contract) is specified at the caller level.  Can you explain
what kind of generic protocol do you expect to see?  Let me put it this
way - what *problem* are you trying to solve?

> It's
> not possible to say by looking at bpf program which memwords are passed
> from outside and which are really internal.

How is it different when using BPF coprocessor?  It is as clear (or not)
as using the external memory store.  It is a contract between the caller
and the BPF program.

> BPF is a language and all its extensions should be designed with
> this in mind.
> 
> > I have already explained the benefits of the external memory store in
> > this thread (simple, straightforward way to pass values and have their
> > cache). Your main argument seems to be that the external memstore makes
> > it more difficult for bpfjit to optimise certain corner cases (which,
> > as been pointed out, are not common).
> 
> Well, I was arguing about performance because you were concerned about
> performance of your bpf programs.
> 
> Your solution lacks a concept. You're mixing things together without
> bothering much about how will they interact to achieve a common
> goal. I already pointed out that your COP is powerful enough to
> copy external data to a memory store. To add to my other point
> above, you're mixing local and global memory. Or, if you use an
> analogy with Lua (that uses a byte code to interpret program),
> there are local and global variables with a very clear distinction
> between them. With your proposal, there is no clear distinction
> inside bpf program which memory words are internal and which are
> external.

Huh?  What does it have to do with local and global variables?  We barely
have a concept of variable.  We are discussing very simple byte-code which
is not even Turing-complete.  It is not Lua, it is not Haskell, it is tiny
assembly language.  I am all for well-thought abstractions and interfacing,
but how far do you want to go in abstracting 16 words?

I have been trying to figure out what existing or future practical problems
you are trying to solve?  Not metaphysical use cases or fantasy languages
based on BPF byte-code.. they do not exist, they make no sense.  Sometimes
array is just an array, not an object.  The sword is double-edged and the
other edge is called over-engineering.

It is disappointing that we are going ad infinitum..

-- 
Mindaugas


Re: BPF memstore and bpf_validate_ext()

2013-12-19 Thread David Laight
On Fri, Dec 20, 2013 at 01:28:12AM +0200, Mindaugas Rasiukevicius wrote:
> Alexander Nasonov  wrote:
> > 
> > Well, if it wasn't needed for many year in bpf, why do we need it now? ;-)
> > 
> 
> Because it was decided to use BPF byte-code for more applications and that
> meant there is a need for improvements.  It is called evolution. :)

Has anyone here looked closely at the changes linux is making to bpf?

David

-- 
David Laight: da...@l8s.co.uk


Re: BPF memstore and bpf_validate_ext()

2013-12-20 Thread Alexander Nasonov
Sorry for top-posting. I'm replying from my phone.

I've not looked at linux bpf before. I remember taking a quick look at 
bpf_jit_compile function but I didn't like emitting binary machine code with 
macro commands.

I spent few minutes today looking at linux code and I noticed few interesting 
things:

- They use negative offsets to access auxiliary data. So, there is a clear 
distinction between local memory store and external data. I don't think it's a 
new addition, though.
- They have a big enum of commands. Many of them translate to bpf commands but 
there are also special commands like load protocol number into A. There is a 
decoder from bpf but I have no clue how it works.
- Those commands are adapted to work with skbuf data.

Alex


20.12.13, 04:16, "David Laight" ":
> 
> On Fri, Dec 20, 2013 at 01:28:12AM +0200, Mindaugas Rasiukevicius wrote:
> > Alexander Nasonov  wrote:
> > > 
> > > Well, if it wasn't needed for many year in bpf, why do we need it now? ;-)
> > > 
> > 
> > Because it was decided to use BPF byte-code for more applications and that
> > meant there is a need for improvements.  It is called evolution. :)
> 
> Has anyone here looked closely at the changes linux is making to bpf?
> 
>   David
> 
> -- 
> David Laight: da...@l8s.co.uk


-- 
Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-20 Thread Justin Cormack
On Fri, Dec 20, 2013 at 1:13 PM, Alexander Nasonov  wrote:
> Sorry for top-posting. I'm replying from my phone.
>
> I've not looked at linux bpf before. I remember taking a quick look at 
> bpf_jit_compile function but I didn't like emitting binary machine code with 
> macro commands.
>
> I spent few minutes today looking at linux code and I noticed few interesting 
> things:
>
> - They use negative offsets to access auxiliary data. So, there is a clear 
> distinction between local memory store and external data. I don't think it's 
> a new addition, though.
> - They have a big enum of commands. Many of them translate to bpf commands 
> but there are also special commands like load protocol number into A. There 
> is a decoder from bpf but I have no clue how it works.
> - Those commands are adapted to work with skbuf data.

I have used some of the "non traditional" uses of bpf in Linux, in
particularly the syscall filtering code, which is designed to be a bit
like the packet filtering code. But I don't think it is a great model,
and I think the jit compiler is rather different, as it compiles to
asm. And the current route they are going seems to be validation
rather than in kernel jitting, see https://lwn.net/Articles/575531/

Justin


> Alex
>
>
> 20.12.13, 04:16, "David Laight" ":
>>
>> On Fri, Dec 20, 2013 at 01:28:12AM +0200, Mindaugas Rasiukevicius wrote:
>> > Alexander Nasonov  wrote:
>> > >
>> > > Well, if it wasn't needed for many year in bpf, why do we need it now? 
>> > > ;-)
>> > >
>> >
>> > Because it was decided to use BPF byte-code for more applications and that
>> > meant there is a need for improvements.  It is called evolution. :)
>>
>> Has anyone here looked closely at the changes linux is making to bpf?
>>
>>   David
>>
>> --
>> David Laight: da...@l8s.co.uk
>
>
> --
> Alex


Re: BPF memstore and bpf_validate_ext()

2013-12-29 Thread Alexander Nasonov
Mindaugas Rasiukevicius wrote:
> Moreover, the usual byte-code produced by tcpdump/pcap does not
> even use the memory store so you optimisations would most of the
> time be applicable anyway!

This is not always the case. For instance,

# tcpdump -y IEEE802_11 -i urtwn0 -d not tcp
tcpdump: data link type IEEE802_11
(000) ldx  #0x0
(001) txa
(002) add  #24
(003) st   M[0]
(004) ldb  [x + 0]
(005) jset #0x8 jt 6jf 11
(006) jset #0x4 jt 11   jf 7
(007) jset #0x80jt 8jf 11
(008) ld   M[0]
(009) add  #2
(010) st   M[0]
(011) ldb  [0]
(012) jset #0x4 jt 27   jf 13
(013) ldb  [0]
(014) jset #0x8 jt 15   jf 27
(015) ldx  M[0]
(016) ldh  [x + 6]
(017) jeq  #0x86dd  jt 18   jf 27
(018) ldx  M[0]
(019) ldb  [x + 14]
(020) jeq  #0x6 jt 37   jf 21
(021) ldx  M[0]
(022) ldb  [x + 14]
(023) jeq  #0x2cjt 24   jf 27
(024) ldx  M[0]
(025) ldb  [x + 48]
(026) jeq  #0x6 jt 37   jf 27
(027) ldb  [0]
(028) jset #0x4 jt 38   jf 29
(029) ldb  [0]
(030) jset #0x8 jt 31   jf 38
(031) ldx  M[0]
(032) ldh  [x + 6]
(033) jeq  #0x800   jt 34   jf 38
(034) ldx  M[0]
(035) ldb  [x + 17]
(036) jeq  #0x6 jt 37   jf 38
(037) ret  #0
(038) ret  #65535

Alex


Re: BPF memstore and bpf_validate_ext()

2014-01-01 Thread Mindaugas Rasiukevicius
Alexander Nasonov  wrote:
> Mindaugas Rasiukevicius wrote:
> > Moreover, the usual byte-code produced by tcpdump/pcap does not
> > even use the memory store so you optimisations would most of the
> > time be applicable anyway!
> 
> This is not always the case. For instance,
> 
> # tcpdump -y IEEE802_11 -i urtwn0 -d not tcp
> tcpdump: data link type IEEE802_11
> <...>

I am aware of these cases, but I optimise for 80%, not 20%.

-- 
Mindaugas