Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Roman Chyla
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.com wrote:

 Hi everyone,

 I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
 through grandchild documents during a search through the parents and filter
 out documents based on statistics gathered from aggregating the
 grandchildren together.  I've been successful in getting the logic correct,
 but it does not perform so well - I'm grabbing too many documents from the
 index along the way.  I'm trying to filter out grandchild documents which
 are not relevant to the statistics I'm collecting, in order to reduce the
 number of document objects pulled from the IndexReader.

 I've implemented the following code in my DelegatingCollector.collect:

 if (inStockSkusBitSet == null) {
 SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
 IndexSearcher to expose getDocSet.
 inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
 inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
 DocSet to expose getBits.
 inStockSkusBitSet = inStockSkusBitDocSet.getBits();
 }


 My BitDocSet reports a size which matches a standard query for the more
 limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
 reports this same cardinality.  Based on that fact, it seems that the
 getDocSet call itself must be working properly, and returning the right
 number of documents.  However, when I try to filter out grandchild
 documents using either BitDocSet.exists or BitSet.get (passing over any
 grandchild document which doesn't exist in the bitdocset or return true
 from the bitset), I get about 1/3 less results than I'm supposed to.   It
 seems many documents that should match the filter, are being excluded, and
 documents which should not match the filter, are being included.

 I'm trying to use it either of these ways:

 if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
 if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

 The currentChildDocNumber is simply the docNumber which is passed to
 DelegatingCollector.collect, decremented until I hit a document that
 doesn't belong to the parent document.

 I can't seem to figure out a way to actually use the BitDocSet (or its
 derivatives) to quickly eliminate document IDs.  It seems like this is how
 it's supposed to be used.  What am I getting wrong?

 Sorry if this is a newbie question, I've never written a PostFilter
 before, and frankly, the documentation out there is a little sketchy
 (mostly for version 4) - so many classes have changed names and so many of
 the more well-documented techniques are deprecated or removed now, it's
 tough to follow what the current best practice actually is.  I'm using the
 block join functionality heavily so I'm trying to keep more current than
 that.  I would be happy to send along the full source privately if it would
 help figure this out, and plan to write up some more elaborate instructions
 (updated for Solr 5) for the next person who decides to write a PostFilter
 and work with block joins, if I ever manage to get this performing well
 enough.

 Thanks for any pointers!  Totally open to doing this an entirely different
 way.  I read DocValues might be a more elegant approach but currently that
 would require reindexing, so trying to avoid that.

 Also, I've been wondering if the query above would read from the filter
 cache or not.  The query is constructed like this:


 private Term inStockTrueTerm = new Term(sku_history.is_in_stock,
 T);
 private Term objectTypeSkuHistoryTerm = new Term(object_type,
 sku_history);
 ...

 inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
 objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
 inStockSkusQuery = new BooleanQuery();
 inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
 inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
 BooleanClause.Occur.MUST);
 --
 Steve

 

 WGSN is a global foresight business. Our experts provide deep insight and
 analysis of consumer, fashion and design trends. We inspire our clients to
 plan and trade their range with unparalleled confidence and accuracy.
 Together, we Create Tomorrow.

 WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of
 market-leading products including WGSN.comhttp://www.wgsn.com, WGSN
 Lifestyle  Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN
 INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
 http://www.wgsn.com/en/styletrial/ and WGSN Mindset
 http://www.wgsn.com/en/services/consultancy/, our bespoke consultancy
 services.

 The information in or attached to this email is 

Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Stephen Weiss
Yes that was it.  Had no idea this was an issue!

On Monday, August 3, 2015, Roman Chyla 
roman.ch...@gmail.commailto:roman.ch...@gmail.com wrote:
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.comjavascript:; 
wrote:

 Hi everyone,

 I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
 through grandchild documents during a search through the parents and filter
 out documents based on statistics gathered from aggregating the
 grandchildren together.  I've been successful in getting the logic correct,
 but it does not perform so well - I'm grabbing too many documents from the
 index along the way.  I'm trying to filter out grandchild documents which
 are not relevant to the statistics I'm collecting, in order to reduce the
 number of document objects pulled from the IndexReader.

 I've implemented the following code in my DelegatingCollector.collect:

 if (inStockSkusBitSet == null) {
 SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
 IndexSearcher to expose getDocSet.
 inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
 inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
 DocSet to expose getBits.
 inStockSkusBitSet = inStockSkusBitDocSet.getBits();
 }


 My BitDocSet reports a size which matches a standard query for the more
 limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
 reports this same cardinality.  Based on that fact, it seems that the
 getDocSet call itself must be working properly, and returning the right
 number of documents.  However, when I try to filter out grandchild
 documents using either BitDocSet.exists or BitSet.get (passing over any
 grandchild document which doesn't exist in the bitdocset or return true
 from the bitset), I get about 1/3 less results than I'm supposed to.   It
 seems many documents that should match the filter, are being excluded, and
 documents which should not match the filter, are being included.

 I'm trying to use it either of these ways:

 if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
 if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

 The currentChildDocNumber is simply the docNumber which is passed to
 DelegatingCollector.collect, decremented until I hit a document that
 doesn't belong to the parent document.

 I can't seem to figure out a way to actually use the BitDocSet (or its
 derivatives) to quickly eliminate document IDs.  It seems like this is how
 it's supposed to be used.  What am I getting wrong?

 Sorry if this is a newbie question, I've never written a PostFilter
 before, and frankly, the documentation out there is a little sketchy
 (mostly for version 4) - so many classes have changed names and so many of
 the more well-documented techniques are deprecated or removed now, it's
 tough to follow what the current best practice actually is.  I'm using the
 block join functionality heavily so I'm trying to keep more current than
 that.  I would be happy to send along the full source privately if it would
 help figure this out, and plan to write up some more elaborate instructions
 (updated for Solr 5) for the next person who decides to write a PostFilter
 and work with block joins, if I ever manage to get this performing well
 enough.

 Thanks for any pointers!  Totally open to doing this an entirely different
 way.  I read DocValues might be a more elegant approach but currently that
 would require reindexing, so trying to avoid that.

 Also, I've been wondering if the query above would read from the filter
 cache or not.  The query is constructed like this:


 private Term inStockTrueTerm = new Term(sku_history.is_in_stock,
 T);
 private Term objectTypeSkuHistoryTerm = new Term(object_type,
 sku_history);
 ...

 inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
 objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
 inStockSkusQuery = new BooleanQuery();
 inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
 inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
 BooleanClause.Occur.MUST);
 --
 Steve

 

 WGSN is a global foresight business. Our experts provide deep insight and
 analysis of consumer, fashion and design trends. We inspire our clients to
 plan and trade their range with unparalleled confidence and accuracy.
 Together, we Create Tomorrow.

 WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of
 market-leading products including WGSN.comhttp://www.wgsn.com, WGSN
 Lifestyle  Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN
 INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
 

How to use BitDocSet within a PostFilter

2015-08-02 Thread Stephen Weiss
Hi everyone,

I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl 
through grandchild documents during a search through the parents and filter out 
documents based on statistics gathered from aggregating the grandchildren 
together.  I've been successful in getting the logic correct, but it does not 
perform so well - I'm grabbing too many documents from the index along the way. 
 I'm trying to filter out grandchild documents which are not relevant to the 
statistics I'm collecting, in order to reduce the number of document objects 
pulled from the IndexReader.

I've implemented the following code in my DelegatingCollector.collect:

if (inStockSkusBitSet == null) {
SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from 
IndexSearcher to expose getDocSet.
inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet 
to expose getBits.
inStockSkusBitSet = inStockSkusBitDocSet.getBits();
}


My BitDocSet reports a size which matches a standard query for the more limited 
set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this 
same cardinality.  Based on that fact, it seems that the getDocSet call itself 
must be working properly, and returning the right number of documents.  
However, when I try to filter out grandchild documents using either 
BitDocSet.exists or BitSet.get (passing over any grandchild document which 
doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 
less results than I'm supposed to.   It seems many documents that should match 
the filter, are being excluded, and documents which should not match the 
filter, are being included.

I'm trying to use it either of these ways:

if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

The currentChildDocNumber is simply the docNumber which is passed to 
DelegatingCollector.collect, decremented until I hit a document that doesn't 
belong to the parent document.

I can't seem to figure out a way to actually use the BitDocSet (or its 
derivatives) to quickly eliminate document IDs.  It seems like this is how it's 
supposed to be used.  What am I getting wrong?

Sorry if this is a newbie question, I've never written a PostFilter before, and 
frankly, the documentation out there is a little sketchy (mostly for version 4) 
- so many classes have changed names and so many of the more well-documented 
techniques are deprecated or removed now, it's tough to follow what the current 
best practice actually is.  I'm using the block join functionality heavily so 
I'm trying to keep more current than that.  I would be happy to send along the 
full source privately if it would help figure this out, and plan to write up 
some more elaborate instructions (updated for Solr 5) for the next person who 
decides to write a PostFilter and work with block joins, if I ever manage to 
get this performing well enough.

Thanks for any pointers!  Totally open to doing this an entirely different way. 
 I read DocValues might be a more elegant approach but currently that would 
require reindexing, so trying to avoid that.

Also, I've been wondering if the query above would read from the filter cache 
or not.  The query is constructed like this:


private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T);
private Term objectTypeSkuHistoryTerm = new Term(object_type, 
sku_history);
...

inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
inStockSkusQuery = new BooleanQuery();
inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST);
--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of 
market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle 
 Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN 
INstockhttp://www.wgsninstock.com/, WGSN 
StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN 
Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please notify the sender immediately by return email and delete this message 
and any copies from your computer and network. WGSN does not warrant that this