[
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645973#action_12645973
]
Paul Elschot commented on LUCENE-1410:
--------------------------------------
I've been at this quite irregularly. I'm trying to give the PFor class a more
OO interface and to get exception patching working at more decent speeds. In
case someone else wants to move this forward faster than it is moving now,
please holler.
After rereading this, and also after reading up a bit on MonetDb performance
improvement techniques, I have few more rants:
Taking another look at the decompression performance figures, and especially
the differences between native C++ and java, it could become worthwhile to also
implement TermQuery in native code.
With the high decompression speeds of FOR/BITS at lower numbers of frame bits
it might also become worthwhile to compress character data, for example numbers
with a low number of different characters.
Adding a dictionary as in PDICT might help compression even further.
This was probably one of the reasons for the column storage discussed earlier,
I'm now sorry I ignored that discussion.
In the index itself, column storage is also useful. One example is the
splitting of document numbers and frequency into separate streams, another
example is various offsets for seeking in the index.
I think it would be worthwhile to add a compressed integer array to the basic
types used in IndexInput and IndexOutput. I'm still strugling with the addition
of skip info into a tree of such compressed integer arrays (skip offsets
don't seem to fit naturally into a column, and I don't know whether the skip
size should be the same as the decompressed array size).
Placement of such compressed arrays in the index should also be aware of CPU
cache lines and of VM page (disk block) boundaries.
In higher levels of a tree of such compressed arrays, frame exceptions would be
best avoided to allow direct addressing, but the leafs could use frame
exceptions for better compression.
For terms that will occur at most once in one document more compression is
possible, so it might be worthwhile to add these as a key. At the moment I have
no idea how to enforce the restriction of at most once though.
> PFOR implementation
> -------------------
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Other
> Reporter: Paul Elschot
> Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch,
> LUCENE-1410d.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java,
> TestPFor2.java
>
> Original Estimate: 21840h
> Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]