[jira] Commented: (LUCENE-1410) PFOR implementation

Paul Elschot (JIRA) Sat, 08 Nov 2008 02:39:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645973#action_12645973
 ]


Paul Elschot commented on LUCENE-1410:
--------------------------------------

I've been at this quite irregularly. I'm trying to give the PFor class a more 
OO interface and to get exception patching working at more decent speeds. In 
case someone else wants to move this forward faster than it is moving now, 
please holler.

After rereading this, and also after reading up a bit on MonetDb performance 
improvement techniques, I have few more rants:

Taking another look at the decompression performance figures, and especially 
the differences between native C++ and java, it could become worthwhile to also 
implement TermQuery in native code.

With the high decompression speeds of FOR/BITS at lower numbers of frame bits 
it might also become worthwhile to compress character data, for example numbers 
with a low number of different characters.
Adding a dictionary as in PDICT might help compression even further.
This was probably one of the reasons for the column storage discussed earlier, 
I'm now sorry I ignored that discussion.
In the index itself, column storage is also useful. One example is the 
splitting of document numbers and frequency into separate streams, another 
example is various offsets for seeking in the index.

I think it would be worthwhile to add a compressed integer array to the basic 
types used in IndexInput and IndexOutput. I'm still strugling with the addition 
of skip info into a tree of such compressed integer arrays (skip offsets
don't seem to fit naturally into a column, and I don't know whether the skip 
size should be the same as the decompressed array size).
Placement of such compressed arrays in the index should also be aware of CPU 
cache lines and of VM page (disk block) boundaries.
In higher levels of a tree of such compressed arrays, frame exceptions would be 
best avoided to allow direct addressing, but the leafs could use frame 
exceptions for better compression.

For terms that will occur at most once in one document more compression is 
possible, so it might be worthwhile to add these as a key. At the moment I have 
no idea how to enforce the restriction of at most once though.



> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> LUCENE-1410d.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
> TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1410) PFOR implementation

Reply via email to