[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-19 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

I reviewed the patch more closely before I commit:

* Modified few javadocs
* Removed needsSampling() since we don't offer an extension any access to e.g. 
totalHits. We can add it when there's demand.
* Fixed a bug in how carryOver was implemented -- replaced by two members 
{{leftoverBin}} and {{leftoverIndex}}. So now if {{leftoverBin != -1}} we know 
to skip the first such documents in the next segment and depending on 
{{leftoverIndex}}, whether we need to sample any of them. Before that, we 
didn't really skip over the leftover docs in the bin, but started to count a 
new bin.
* Added a CHANGES entry.

I reviewed the test - would be good if we can write a unit test which 
specifically matches only few documents in one segment compared to the rest. I 
will look into it perhaps later.

I think it's ready, but if anyone wants to give createSample() another look, to 
make sure this time leftover works well, I won't commit it by tomorrow anyway.

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-17 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

Removed scores. Added javadoc explaining what happens to scores. Removed 
System.out.println

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-17 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

New patch. I'm still not really sure about the scorings, but please take a look 
at it.

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-11 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-06 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-06 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

I have a patch ready that implements the random sampling usign override on 
{{.getMatchingDocs()}}. It passes the test, so it should be ok :). 

It is slower however (only 3x speedup),  but maybe there is room for 
optimization?

Exact  :168 ms
Sampled:55 ms


> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-05 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

* Reverted move of XorShift64Random
* Added basic test for random sampling. Uses sampling ratios of 1.0d and 0.1d. 
1.0d should behave like no sampling at all. 

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-04 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

I have taken these points into account in the new patch.

* Separated {{XORShift64Random}} to o.a.l.util
* Moved SampledDocs and declared private
* Fixed some typos / redundancy in the javadocs.

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-04 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

New patch. 

My little benchmark results:

Exact  :176 ms
Sampled:36 ms

* I removed the {{SamplingParams}}. 
* I use random sampling in bins in the {{addDoc}}. This has almost no 
performance impact over fix-internal samling when running with a sampleRatio of 
0.001.
* Renamed the class to {{RandomSamplingFacetsCollector}}
* I decided to always add the first document of a segment. This counters the 
fact that randomindex in the last bin is larger than the segment-size. Gives 
slightly better results. 


I do have some questions:

* Code style-wise: I see {{this.}} is sometimes used, and sometimes not. What 
do you prefer?
* It seems {{FixedBitSet}} contains a lot of empty space, as only 1 in 1000 
hits are preserved. I'm not that familiar with all the DocIdSet 
implementations, but maybe there are more suitable subclasses? 

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-03 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5476:
-

Attachment: SamplingComparison_SamplingFacetsCollector.java

Nice patch,

I was surprised by the gain - only 3.5 time the gain for 1/1000 of the 
documents... I think this is because the random generation takes to long?

Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and 
{{.clear()}} for every bit.

I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a 
clone, and setting only the sampled bits. With it, instead of going on ever 
document and calling {{.get(docId)}} I used the iterator, which may be more 
efficient when it comes to skipping cleared bits (deleted docs?).

Ran the code on a scenario as you described (Bitset sized at 10M, randomly 
selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 
documents).
The results:
* none-cloned + iterator: ~20ms
* cloned + for loop on each docId: ~50ms

{code}
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
final DocIdSetIterator it = docIdSet.iterator();

try {
  for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; 
doc = it.nextDoc()) {
if (++countInBin == binsize) {
  countInBin = 0;
  randomIndex = random.nextInt(binsize);
}
if (countInBin == randomIndex) {
  sampledBits.set(doc);
}
  }
} catch (IOException e) {
  // should not happen
  throw new RuntimeException(e);
}
{code}

Also attaching the java file with a main() which compares the two 
implementations (original still named {{createSample()}} and the proposed one 
{{createSample2()}}. 

Hopefully, if this will be consistent, the sampling end-to-end should be closer 
the 10 factor gain as hoped for.

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-03 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

Here is a patch. 

I'm not totally satisfied with it though. There are two loose ends:

* When a segment has many docids but very few hits, they might be skipped 
altogether. If this happens in many segments, the estimate gets more and more 
off
* When a segment is small, there might be a few hits that are never touched as 
the randomIndex in the 'bin-o-hits' is greater than the index of the hit.

Some small performance indicators:

Using an in memory index with 10M documents, retrieving 25% of them and using a 
samplingRatio of 0.001 (skipping the first call, average on the next three)

Exact:  184 ms.
Sampled: 67 ms.

Not the 10-fold increase I had hoped for, but significantly faster.

Btw. for my use case the two problems I described above are not a real issue. I 
do a global count anyway, and use this to determine if the facets need to be 
sampled or not. When they need to be sampled, I know for sure that there are 
enough hits to give a proper estimate. The price of counting and sampling is 
lower than the price of retrieving exact facets.  



> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-02-28 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: LUCENE-5476.patch

Here is a patch (agains 4.7) that covers some of the feedback. 

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-02-28 Thread Rob Audenaerde (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Audenaerde updated LUCENE-5476:
---

Attachment: SamplingFacetsCollector.java

> Facet sampling
> --
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Rob Audenaerde
> Attachments: SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org