hendrikmuhs commented on a change in pull request #460:
URL: https://github.com/apache/lucene/pull/460#discussion_r762005851
##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -1000,6 +1027,98 @@ private void writePresenceBits(
assert bytePos - dest == numPresenceBytes;
}
+ private long estimateNodeAddress(
Review comment:
I agree that one pass would be easier. What makes it complicated is the
variable length encodings. In order to patch up a cell you still need to
reserve the right number of cells to write into. If you reserved 1 byte but
later realize you need 2 you have a problem. A buffer where you can insert a
cell might help, but than the whole calculation falls apart again, because the
cell that you patched a step ago might need another patch and might overflow.
If you look at the implementation below, this is what `tolerance` is about:
I have an estimate that says, I need to store `125`, that's `1` byte, but if
it the real value will be `128` I need `2` bytes, that means my `tolerance` is
`2`. When the address slides I am still fine as long as it only goes up by
maximum `2`. If above `2`, I stretched my budget. The algorithm returns `0` in
this case, which means it falls back to absolute coding. In the end it is a
heuristic which might fail, we could do 3rd pass to catch a couple more cases,
but the added benefit is probably so small that it isn't worth the extra loop
(I can test this later, right now it sounds premature).
^ fits with your question regarding always using relative offsets. It would
be quite hard to implement. The extra bit is for free at the moment, because
the flag has a fixed size. It might be a problem if you already have other uses
of flags in mind and want to reserve it.
##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler,
FSTCompiler.UnCompiledNode<T> nodeIn)
}
if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
- assert target.node > 0;
+ assert targetNode > 0;
Review comment:
> also - if we were able to encode negative offsets we could always
apply this variable encoding
A negative offset would mean you target a node that has not been written
yet. That's not possible by design. In addition it is not necessary, the way
the fst is constructed is from the leafs to the root. A parent always has a
higher offset than its child. Therefore negative offsets never happen.
Absolute vs. relative coding: I wasn't able to use relative addresses all
times, if "fixed length arcs" are used I switch this optimization off
completely. To always use relative coding I must replace all absolute
addresses. I am not sure this is even possible.
When 1st experimenting with this technique years ago, I did some
measurements and calculations. In smaller fst's absolute coding is sufficient,
we have small addresses anyway. The larger the fst gets the more interesting
relative coding becomes. If the fst is highly compressible the absolute address
was often smaller than the relative one. E.g. leaf usually compress like that.
For larger fst's and fst's that do not compress well (because of values)
relative coding works nicely, because the construction yields good locality.
##########
File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
##########
@@ -720,9 +745,9 @@ long addNode(FSTCompiler<T> fstCompiler,
FSTCompiler.UnCompiledNode<T> nodeIn)
}
if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
- assert target.node > 0;
+ assert targetNode > 0;
Review comment:
An absolute node address targeting `0` is not possible due to the
reverse reading logic and that's the old assert anyway.
So the question remains, can a relative target be `0`?
In theory this would be a epsilon transition if I remember correctly. So the
arc would follow a label and target itself like a regexp with a `*`, e.g. `a*`.
In practice I think such a regex-like matching functionality is not what
this implementation is made for nor do I think it is possible at the moment.
The nodes are constructed as `UncompiledNode`, but in order to make such an
epsilon transition you need to know the target before you push it for
persistence. In simpler words, this is a chicken-egg problem.
I know about some experimentation to use fst's as regexp engine, but at the
moment I think this is not a problem. If it ever will be implemented other
things have to be solved and at that time this assert can be removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]