[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977353#action_12977353 ] Michael McCandless commented on LUCENE-2829: I think we should commit this, and if/when LUCENE-2694 and/or LUCENE-2831 are committed to 3.x, we can revisit it. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974212#action_12974212 ] Yonik Seeley commented on LUCENE-2829: -- Why not keep the TermState cache and use it for all queries except MTQ, while using a different mechanism for MTQ to avoid trashing the cache? The cache has a number of advantages that may never be duplicated in a different type of API, including - actually cache frequently used terms across different requests - cache terms reused in the same request. term proximity boosting is an example: +united +states united states^10 improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974223#action_12974223 ] Robert Muir commented on LUCENE-2829: - bq. edit: and as robert previously pointed out, if we cached misses as well, then we could avoid needless seeks on segments that don't contain the term. True, this is a good idea, just a little tricker: * In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting to true. * FilteredTermsEnum passes false here, so the multitermqueries don't populate the cache with garbage while enumerating (eg foo*), only explicitly at the end with cacheTerm() (per-segment) for the ones that were actually accepted. They sum up their docFreq themselves to prevent the first wasted seek in TermQuery. * So this solution would make MTQ worse, as it would cause them to trash the caches in the second wasted seek (the docsenum) where they do not today, with negative entries for the segments where the term doesn't exist. Today they do this wasted seek, but they don't trash the cache here. The only solution to prevent that is the PerReaderTermState (or something equally complicated). * We would have to look at other places where negative entries would hurt, for example rebuilding spellcheck indexes uses this 'termExists()' method implemented with docFreq. So we would have to likely change spellcheck's code to use a TermsEnum and seek(term, false)... using a termsenum in parallel with the spellcheck dictionary would obviously be more efficient for the index-based spellcheck case (forget about caching) versus docFreq()'ing every term... *but* we cannot assume the spellcheck Dictionary is actually in term order, (imagine the File-based dictionary case), so we can't implement this today. On 3.x i think its slightly less complicated as there is already a hack in the cache to prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all the MTQs just enumerate sequentially anyway... (except NRQ which doesn't enum many terms anyway, likely not a problem). But we would have to at least fix the spellcheck case there too I think. Not saying I don't like your idea... just saying there's more work to do it. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974229#action_12974229 ] Robert Muir commented on LUCENE-2829: - On further thought Yonik, your idea is really completely unrelated. We shouldn't be seeking to terms/relying upon the terms dictionary cache internally when we don't need to... whether or not its populated with negative entries for the more general case is unrelated, even if we go that route we shouldn't be lazy and rely upon that. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974262#action_12974262 ] Michael McCandless commented on LUCENE-2829: bq. The cache has a number of advantages that may never be duplicated in a different type of API +1 -- I agree we should keep the TermState cache. It has benefits outside of re-use win a single query. But allowing term-lookup-intensive clients like MTQ to do their own caching (ie pulling the TermState from the enum) is also important. I think we need both. On caching misses... that makes me nervous. If there are apps out there that do alot of checking for terms that don't exist that can destroy the cache. The cache is a great safety net but I think our core queries should be good consumers, when possible, and hold their own TermState. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974274#action_12974274 ] Earwin Burrfoot commented on LUCENE-2829: - Term lookup misses can be alleviated by a simple Bloom Filter. No caching misses required, helps both PK and near-PK queries. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974310#action_12974310 ] Robert Muir commented on LUCENE-2829: - Bloom filters and negative caches are nice, but please open separate issues! I am starting to feel like its mandatory to refactor the entirety of lucene to make a single incremental improvement. So, I'd like to proceed with this issue as-is, to make TermWeight explicitly do less seeks. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974350#action_12974350 ] Earwin Burrfoot commented on LUCENE-2829: - Nobody halts your progress, we're merely discussing. I, on the other hand, have a feeling that Lucene is overflowing with single incremental improvements aka hacks, as they are easier and faster to implement than trying to get a bigger picture, and, yes, rebuilding everything :) For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to keep the whole term dict in memory with reasonable lookup routine (eg. as FST). Having said that, I reiterate, I'm not here to stop you or turn this issue into something else. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974354#action_12974354 ] Robert Muir commented on LUCENE-2829: - bq. For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Right, it is hackish, but what is a worse hack is wasted seeks in our next 3.1 release because we can't keep scope under control and fix small problems without rewriting everything, which means less gets backported to our stable branch. Anyway, I'm just gonna mark this won't fix so I don't have to deal with it anymore. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973869#action_12973869 ] Michael McCandless commented on LUCENE-2829: I made a random PK lookup tester (committed to luceneutil), to lookup by docid (unique key) from the luceneutil index. Pre-patch it's 53 usec per lookup and with this patch it's 31 usec -- ~42% faster! improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973877#action_12973877 ] Robert Muir commented on LUCENE-2829: - right, we just have to not do stupid things like hash hashcodes to make it faster for when the data is hot... but as a start this is safe, hopefully we could do something non-invasive (and backportable) to make it faster. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973882#action_12973882 ] Hoss Man commented on LUCENE-2829: -- The patch is over my head, but providing a super optimized solution to the primary key type lookup problem definitely seems worthwhile -- it has me wondering if a PrimaryKeyQuery that works like TermQuery bug quits collecting as soon as it finds one matching document would be a good idea? improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973889#action_12973889 ] Robert Muir commented on LUCENE-2829: - Hoss Man, well I think if you surely know its a PK field you can definitely do something better, starting with a custom collector that does something like what you mentioned, with no PQ at all etc. But in this case, though i categorized it as PK, the general problem is this: * in lots of cases we do redundant seeks, like to get the docFreq, then to get the DocsEnum * in most cases the term dictionary cache helps here because the 2nd time (e.g. getting DocsEnum) is cached. Here's the problem with PK or PK-ish (low freq terms like what wildcards/fuzzies/range queries hit too): * our cache doesn't cache negative hits, the fact that a term *doesnt* exist in some segment. * For example in the PK case, if there are 15 segments we always get at most 1 cache hit and at least 14 misses when getting the DocsEnum, so we do at least 14 wasted seeks always. * For other low frequency terms that don't exist in all segments (very precise dates or what have you) the same idea applies, just to a lesser extent: the PK is the worst. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973892#action_12973892 ] Robert Muir commented on LUCENE-2829: - bq. I think a cleaner interface may be for the Weight.scorer method to receive the ord of the sub reader in the parent? Yes, ideally with the actual df in there. This would save the third seek in the bulkpostings branch. But at the same time, i'm worried/don't want this issue to evolve into TermState (LUCENE-2694). I wasn't thinking that this was any kind of end-solution but just an approach we could take that would work against 3.1 improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973908#action_12973908 ] Robert Muir commented on LUCENE-2829: - bq. But, then, passing a struct (parent/sub/ord) is a fairly small change, and, if it matches the change we will make on LUCENE-2694, then that's great. Ok, that might be a good approach, to fix the it this way in LUCENE-2694 (or actually, preferably add the parent/sub/ord in its own issue!), but in 3.1 we could use the struct to avoid wasted seeks on PK terms... Seems like backporting the entire termstate thing could be a little tricky/risky for 3.1, with not much to gain there except PK lookups anyway, since the multitermqueries there tend to be slow (dominated by term comparison) and don't even work per-segment anyway. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org