Re: How to use BitDocSet within a PostFilter
Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thing to check Roman On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.com wrote: Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrial http://www.wgsn.com/en/styletrial/ and WGSN Mindset http://www.wgsn.com/en/services/consultancy/, our bespoke consultancy services. The information in or attached to this email is
Re: How to use BitDocSet within a PostFilter
Yes that was it. Had no idea this was an issue! On Monday, August 3, 2015, Roman Chyla roman.ch...@gmail.commailto:roman.ch...@gmail.com wrote: Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thing to check Roman On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.comjavascript:; wrote: Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
How to use BitDocSet within a PostFilter
Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy services. The information in or attached to this email is confidential and may be legally privileged. If you are not the intended recipient of this message, any use, disclosure, copying, distribution or any action taken in reliance on it is prohibited and may be unlawful. If you have received this message in error, please notify the sender immediately by return email and delete this message and any copies from your computer and network. WGSN does not warrant that this