This may be better on the dev list.

I have a couple of questions about what's going on please....

On Tue, Jun 12, 2012 at 2:54 PM, Stephen Allen <[email protected]> wrote:
> ARQ currently materializes the RHS into an ArrayList which it then
> uses for a linear search on each LHS iteration.

It's materialized? Is there a streaming option if the RHS gets too
big? If not streaming, then perhaps a store-to-disk option? Better
yet, can the result of an index lookup be represented as a
QueryIterator? (I realize that last question may be hopelessly naive,
but I'm very new to this codebase).

> An easyish
> performance improvement would be to replace that with a HashSet
> containing bindings made up of the shared variables.

It would be helpful at this point to know all the variables that can
appear in either side. However, that's not possible here, is it? It
looks like that information is in bindingLeft and the iterator for
Bindings in tableRight, and you'd have to iterate exhaustively to find
them all. Or is there some other way to get this information? (The
query structure has this, but I can't see that it's readily accessible
from here).

Unless I'm mistaken, if I can't get hold of the variables that may
appear in the bindings, then a simple HashSet approach won't be
adequate.

> A more advanced change would be do what you've done and not
> materialize the RHS when it has a fast index search, taking scope
> differences and shared variables into account.

Does scope matter for Minus? The iterator for the right hand side
takes a null context. (That's one of the main differences from NOT
EXISTS, isn't it?)

> Although as you
> mention this becomes essentially the same query optimization decision
> as hash vs nested loop join at this point.

That's the tradeoff between working with what indexes you have, and
building a new one for the current situation. Of course, either
approach will be better than the current code.

Thinking out loud... I haven't tried it myself for this, but there's
also a hybrid approach where you use the existing index, and cache in
a hash as you go. That'd help for data that you revisit, but it may be
that most queries won't revisit things that get subtracted very often.

Paul

Reply via email to