[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2013-03-22 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610898#comment-13610898
 ] 

Commit Tag Bot commented on LUCENE-3030:


[branch_4x commit] Michael McCandless
http://svn.apache.org/viewvc?view=revision&revision=1382141

LUCENE-3030: add missing CHANGES entry


> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0-ALPHA
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088268#comment-13088268
 ] 

Uwe Schindler commented on LUCENE-3030:
---

Awesome!

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088248#comment-13088248
 ] 

Michael McCandless commented on LUCENE-3030:


Thanks Robert -- looks good!  I'll commit shortly.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-19 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087962#comment-13087962
 ] 

Dawid Weiss commented on LUCENE-3030:
-

bq. awesome work mike!

I couldn't agree more, this is great stuff.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-19 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087943#comment-13087943
 ] 

Simon Willnauer commented on LUCENE-3030:
-

bq. There is a small hit to indexing perf here, somewhere between 0 - 5% or so 
depending on the run. Given the gains for MTQs I think this is an OK tradeoff. 
We can further optimize the BlockTreeTermsWriter later

I think 0 - 5 % is totally fine. if somebody is totally obsessed by indexing 
throughput they should overclock :)

awesome work mike!

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087797#comment-13087797
 ] 

Michael McCandless commented on LUCENE-3030:


There is a small hit to indexing perf here, somewhere between 0 - 5% or so 
depending on the run.  Given the gains for MTQs I think this is an OK tradeoff. 
 We can further optimize the BlockTreeTermsWriter later

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078400#comment-13078400
 ] 

Simon Willnauer commented on LUCENE-3030:
-

bq. These are huge speedups for the terms-dict intensive queries!
oh boy! This is awesome!

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078337#comment-13078337
 ] 

Michael McCandless commented on LUCENE-3030:


Here's the graph of the results:

!BlockTree.png!


> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076199#comment-13076199
 ] 

Michael McCandless commented on LUCENE-3030:


bq. One trivial thing we might want to do is to add the logic currently in AQ's 
ctor to CA, so that you ask CA for its termsenum.

+1 -- I think CA should serve up a TermsEnum when provided a Terms?

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076178#comment-13076178
 ] 

Robert Muir commented on LUCENE-3030:
-

Also, we should measure if a "prefix automaton" with intersect() is faster than 
PrefixTermsEnum (I suspect it could be!)

If this is true, we might want to not rewrite to prefixtermsenum anymore, 
instead changing PrefixQuery to extend AutomatonQuery too.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076177#comment-13076177
 ] 

Robert Muir commented on LUCENE-3030:
-

This is awesome, i really like adding the intersect() hook!

Thanks for making a branch, I will check it out and try to dive in to help with 
some of this  :)

One trivial thing we might want to do is to add the logic currently in AQ's 
ctor to CA, so that you ask CA for its termsenum.
this way, if it can be accomplished with a simpler enum like just 
terms.iterator() or prefixtermsenum etc etc we get that optimization always.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073800#comment-13073800
 ] 

Michael McCandless commented on LUCENE-3030:


I created a branch 
https://svn.apache.org/repos/asf/lucene/dev/branches/blocktree_3030 for 
iterating on this.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org