[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937580#comment-13937580 ] Gilad Barkai commented on LUCENE-5476: -- About the scores (the only part I got to review thus far), the scores should be a non-sparse float array. E,g, if there are 1M documents and the original set contains 1000 documents the score[] array would be of length 1000, If the sampled set will only have 10 documents, the score[] array should be only 10. The relevant part: {code} if (getKeepScores()) { scores[doc] = docs.scores[doc]; } {code} should be changed as the scores[] size and index should be relative to the sampled set and not the original results. Also the size of the score[] array could be the amount of bins? > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925158#comment-13925158 ] Gilad Barkai commented on LUCENE-5476: -- {quote} The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results. {quote} I understand this statement is confusing, I'll try to elaborate. If the sample was *exactly* at the sampling ratio, this would not be a problem, but since the sample - being random as it is - may be a bit larger, adjusting according to the original sampling ratio (rather than the actual one) may yield larger counts than the actual results. This could be solved by either limiting to the number of results, or adjusting the {{samplingRate}} to be the exact, post-sampling, ratio. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924348#comment-13924348 ] Gilad Barkai commented on LUCENE-5476: -- {quote} Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting facet counts it would a quick win to limit the number of estimated documents to the actual number of documents in the index that match that facet. (And maybe use the distribution as well, to make better estimates) {quote} That's a great idea! The {{docFreq}} of the category drill-down term is an upper bound - and could be used as a limit. It's cheap, but might not be the exact number as it also take under account deleted documents. The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922374#comment-13922374 ] Gilad Barkai commented on LUCENE-5476: -- Hi Rob, patch looks great. A few comments: * Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector & o.a.l.s.DocIdSet) * Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor could be made {{final}} * XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it was needed for usability when this was thought to be a core utility and was left by mistake? Should it be called somewhere? * {{getMatchingDocs()}} ** when {{!sampleNeeded()}} there's a call to {{super.getMatchingDocs()}}, this may be redundant method call as 5 lines above we call it, and the code always compute the {{totalHits}} first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results. ** {{totalHits}} is redundantly computed again in line 147-152 * {{needsSampling()}} could perhaps be protected, allowing other criteria for sampling to be added * {{createSample()}} ** {{randomIndex}} is initialized to {{0}}, effectively making the first document of every segment's bin to be selected as the representative of that bin, neglecting the rest of the bin (regardless of the seed). So if a bin is the size of a 1000 documents, than there are 999 documents that regardless of the seed would always be neglected. It may be better so initialize as {{randomIndex = random.nextInt(binsize)}} as it happens for the 2nd and on bins. ** While creating a new {{MatchingDocs}} with the sampled set, the original {{totalHits}} and original {{scores}} are used. I'm not 100% sure the first is an issue, but any facet accumulation which would rely on document scores would be hit by the second as the {{scores}} (at least by javadocs) are defined as non-sparse. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919380#comment-13919380 ] Gilad Barkai commented on LUCENE-5476: -- Sorry if this was mentioned before and I missed it - how can sampling work correctly (correctness of the end result) if it's done 'on the fly' ? Beforehand, one cannot know the number of documents that would match the query, as such, the sampling ratio is unknown, given that we can afford faceted search over N documents only. If the query yields 10K results and the sampling ration is 0.001 - would 10 documents make a good sample? Same if the query yields 100M results - is 10K sample good enough? Is it perhaps to much? I find it hard to figure a pre-defined sampling ratio which would fit different cases. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, > SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5476: - Attachment: SamplingComparison_SamplingFacetsCollector.java Nice patch, I was surprised by the gain - only 3.5 time the gain for 1/1000 of the documents... I think this is because the random generation takes to long? Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and {{.clear()}} for every bit. I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a clone, and setting only the sampled bits. With it, instead of going on ever document and calling {{.get(docId)}} I used the iterator, which may be more efficient when it comes to skipping cleared bits (deleted docs?). Ran the code on a scenario as you described (Bitset sized at 10M, randomly selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 documents). The results: * none-cloned + iterator: ~20ms * cloned + for loop on each docId: ~50ms {code} int countInBin = 0; int randomIndex = random.nextInt(binsize); final DocIdSetIterator it = docIdSet.iterator(); try { for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) { if (++countInBin == binsize) { countInBin = 0; randomIndex = random.nextInt(binsize); } if (countInBin == randomIndex) { sampledBits.set(doc); } } } catch (IOException e) { // should not happen throw new RuntimeException(e); } {code} Also attaching the java file with a main() which compares the two implementations (original still named {{createSample()}} and the proposed one {{createSample2()}}. Hopefully, if this will be consistent, the sampling end-to-end should be closer the 10 factor gain as hoped for. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917158#comment-13917158 ] Gilad Barkai edited comment on LUCENE-5476 at 3/1/14 7:03 PM: -- Great effort! I wish to throw in another part - the description of this issue is about sampling, but the implementation is about *random* sampling. This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO). A different sample could be (pseudo-code) {code} int acceptedModulu = (int)(1/sampleRatio); int next() { do { nextDoc = inner.next(); } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ; return nextDoc; } {code} This should be faster as a sampler, and perhaps saves us from creating a new {{DocIdSet}}. One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect. For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10). was (Author: gilad): Great effort! I wish to through in another part - the description of this issue is about sampling, but the implementation is about *random* sampling. This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO). A different sample could be (pseudo-code) {code} int acceptedModulu = (int)(1/sampleRatio); int next() { do { nextDoc = inner.next(); } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ; return nextDoc; } {code} This should be faster as a sampler, and perhaps saves us from creating a new {{DocIdSet}}. One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect. For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10). > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917158#comment-13917158 ] Gilad Barkai edited comment on LUCENE-5476 at 3/1/14 7:02 PM: -- Great effort! I wish to through in another part - the description of this issue is about sampling, but the implementation is about *random* sampling. This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO). A different sample could be (pseudo-code) {code} int acceptedModulu = (int)(1/sampleRatio); int next() { do { nextDoc = inner.next(); } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ; return nextDoc; } {code} This should be faster as a sampler, and perhaps saves us from creating a new {{DocIdSet}}. One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect. For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10). was (Author: gilad): Great effort! I wish to through in another part - the description of this issue is about sampling, but the implementation is about *random* sampling. This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO). A different sample could be {{code}} int acceptedModulu = (int)(1/sampleRatio); int next() { do { nextDoc = inner.next(); } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ; return nextDoc; } {{code}} This should be faster as a sampler, and perhaps saves us from creating a new {DocIdSet}. One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect. For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10). > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917158#comment-13917158 ] Gilad Barkai commented on LUCENE-5476: -- Great effort! I wish to through in another part - the description of this issue is about sampling, but the implementation is about *random* sampling. This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO). A different sample could be {{code}} int acceptedModulu = (int)(1/sampleRatio); int next() { do { nextDoc = inner.next(); } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ; return nextDoc; } {{code}} This should be faster as a sampler, and perhaps saves us from creating a new {DocIdSet}. One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect. For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10). > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5457) Expose SloppyMath earth diameter table
[ https://issues.apache.org/jira/browse/LUCENE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905736#comment-13905736 ] Gilad Barkai commented on LUCENE-5457: -- +1 Sorry for the confusion, indeed it should be diameter as the multiplication (*2) was moved to the pre-computed table, hence saving the operation in runtime as per Ryan's comment. > Expose SloppyMath earth diameter table > -- > > Key: LUCENE-5457 > URL: https://issues.apache.org/jira/browse/LUCENE-5457 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Fix For: 4.7 > > Attachments: LUCENE-5457.patch > > > LUCENE-5271 introduced a table in order to get approximate values of the > diameter of the earth given a latitude. This could be useful for other > computations so I think it would be nice to have a method that exposes this > table. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance
[ https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5271: - Attachment: LUCENE-5271.patch Ryan, thanks for looking at this. bq. If the lat/lon values are large, then the index would be out of bounds for the table Nice catch! I did not check for values over 90 degs Lat. Added a % with the the table's size. bq. Why was this test removed? assertEquals(314.40338, haversin(1, 2, 3, 4), 10e-5) Well the test's result are wrong :) The new more accurate method gets other results. I added other test instead: {code} double earthRadiusKMs = 6378.137; double halfCircle = earthRadiusKMs * Math.PI; assertEquals(halfCircle, haversin(0, 0, 0, 180), 0D); {code} Which computes half earth circle on the equator using both the harvestin and a simple circle equation using Earth's equator radius. It differs in over 20KMs from the old harvesin result btw. bq. Could you move the 2 * radius computation into the table? Awesome! renamed the table to diameter rather than radius. bq. I know this is an already existing problem, but could you move the division by 2 from h1/h2 to h? Done. > A slightly more accurate SloppyMath distance > > > Key: LUCENE-5271 > URL: https://issues.apache.org/jira/browse/LUCENE-5271 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5271.patch, LUCENE-5271.patch, LUCENE-5271.patch > > > SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) > ellipsoid radius as an approximation for computing the "spherical" distance. > (The TO_KILOMETERS constant). > While this is pretty accurate for long distances (latitude wise) this may > introduce some small errors while computing distances close to the equator > (as the earth radius there is larger than the avg.) > A more accurate approximation would be taking the avg. earth radius at the > source and destination points. But computing an ellipsoid radius at any given > point is a heavy function, and this distance should be used in a scoring > function.. So two optimizations are optional - > * Pre-compute a table with an earth radius per latitude (the longitude does > not affect the radius) > * Instead of using two point radius avg, figure out the avg. latitude > (exactly between the src and dst points) and get its radius. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs
[ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833596#comment-13833596 ] Gilad Barkai commented on LUCENE-5339: -- Been away from the issue for some time and it looks like a major progress, Chapeau à lui {{LabelAndValue}} & {{FacetResult}} use {{instanceof}} checks in their {{equals}} method - is that a must? {{FacetResult}} has a member called {{childCount}} - I think it's the number of categories/path/labels that were encountered. The current jdocs "How many labels were populated under the requested path" reveals implementation (population). Perhaps exchange populated with encountered? {{FloatRange}} and {{DoubleRange}} uses {{Math.nextUp/Down}} for infinity as the ranges are always inclusive. Perhaps these constants for float and double could be static final. {{TaxonomyFacetSumFloatAssociations}} and {{TaxonomyFacetSumValueSource}} reuse a LOT of code, can they extend one another? perhaps extract a common super for both? In {{TaxonomyFacets}} the parents array is saves, I could not see where it's being used (and I think it's not used even in the older taxonomy-facet implementation). {{FacetConfig}} confuses me a bit, as it's very much aware of the Taxonomy, on another it handles all the kinds of the facets. Perhaps {{FacetConfig.build()}} could be split up, allowing each {{FacetField.Type}} a build() method of its own, rather than every types' building being done in the same method. It will also bring a common parent class to all FacetField types, which I also like. As such, the taxonomy part, with {{processFacetFields()}} could be moved to its respective Facet implementation. > Simplify the facet module APIs > -- > > Key: LUCENE-5339 > URL: https://issues.apache.org/jira/browse/LUCENE-5339 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5339.patch, LUCENE-5339.patch > > > I'd like to explore simplifications to the facet module's APIs: I > think the current APIs are complex, and the addition of a new feature > (sparse faceting, LUCENE-5333) threatens to add even more classes > (e.g., FacetRequestBuilder). I think we can do better. > So, I've been prototyping some drastic changes; this is very > early/exploratory and I'm not sure where it'll wind up but I think the > new approach shows promise. > The big changes are: > * Instead of *FacetRequest/Params/Result, you directly instantiate > the classes that do facet counting (currently TaxonomyFacetCounts, > RangeFacetCounts or SortedSetDVFacetCounts), passing in the > SimpleFacetsCollector, and then you interact with those classes to > pull labels + values (topN under a path, sparse, specific labels). > * At index time, no more FacetIndexingParams/CategoryListParams; > instead, you make a new SimpleFacetFields and pass it the field it > should store facets + drill downs under. If you want more than > one CLI you create more than one instance of SimpleFacetFields. > * I added a simple schema, where you state which dimensions are > hierarchical or multi-valued. From this we decide how to index > the ordinals (no more OrdinalPolicy). > Sparse faceting is just another method (getAllDims), on both taxonomy > & ssdv facet classes. > I haven't created a common base class / interface for all of the > search-time facet classes, but I think this may be possible/clean, and > perhaps useful for drill sideways. > All the new classes are under oal.facet.simple.*. > Lots of things that don't work yet: drill sideways, complements, > associations, sampling, partitions, etc. This is just a start ... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830884#comment-13830884 ] Gilad Barkai commented on LUCENE-5316: -- I think we don't have to wait for LUCENE-5339, we could handle some of these issues now. * {{if (kids == null)}} and hashing - there's a solution to that: not hold it in a map. We could hold it in an {{int[][]}} and NO_CHILDREN for the entries which has no children. So we'll have one array at the size of the taxonomy, and the sum of others will be significantly smaller if we have flat dimensions. We'll lose some RAM compared to the Map, but it will speed things up because we'll simply return what's in {{children\[ord\]}}. * {{if (children == null)}} is done only because it's being allocated lazily. Does it make sense to keep it that way? We could compute the children upon taxonomy opening and be done with it. Today (trunk) reopen is very costly as it's in the O(TaxonomySize) which affects NRT reopen (it's not guaranteed that for each reopen someone will actually need the new facet information), , but if the computation of the children is done according to the O(new segments) than we're not wasting much. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, > LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance
[ https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5271: - Attachment: LUCENE-5271.patch Adapted tests to the more accurate distance method. Patch is ready. > A slightly more accurate SloppyMath distance > > > Key: LUCENE-5271 > URL: https://issues.apache.org/jira/browse/LUCENE-5271 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5271.patch, LUCENE-5271.patch > > > SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) > ellipsoid radius as an approximation for computing the "spherical" distance. > (The TO_KILOMETERS constant). > While this is pretty accurate for long distances (latitude wise) this may > introduce some small errors while computing distances close to the equator > (as the earth radius there is larger than the avg.) > A more accurate approximation would be taking the avg. earth radius at the > source and destination points. But computing an ellipsoid radius at any given > point is a heavy function, and this distance should be used in a scoring > function.. So two optimizations are optional - > * Pre-compute a table with an earth radius per latitude (the longitude does > not affect the radius) > * Instead of using two point radius avg, figure out the avg. latitude > (exactly between the src and dst points) and get its radius. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5316: - Attachment: LUCENE-5316.patch Changes: * Fixed concurrency issues * Moved to an {{int[] getChildren(int ord))}} API which should perform at least as fast as the taxonomy-arrays way. The code is not ready for commit: * Children-map does not support fast reopen (using an older map and only updating with newly added ordinals) * Children-map initialization still uses the taxonomy arrays, so they are not completely removed. Should a benchmark show this change performs better I'll make the extra changes for making it commit-ready. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, > LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5316: - Attachment: LUCENE-5316.patch Attaching a fixed patch with the concurrency resolved. This is not yet the patch with the {{int[]}} instead of an array. I'm not sure that following the path of the int[] is right, it will block all future extensions for using e.g on-disk children.. What do you think? Perhaps it is better to keep the iterator? > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, > LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs
[ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821142#comment-13821142 ] Gilad Barkai commented on LUCENE-5339: -- Mike, the idea of simplifying the API sounds great, but is it really that complected now? Facet's {{Accumulator}} is similar to Lucene's {{Collector}}, the {{Aggregator}} is sort of a {{Scorer}}, and a {{FacetRequest}} is a sort of {{Query}}. Actually the model after which the facets were designed was Lucene's. The optional {{IndexingParams}} came before the {{IndexWriterConfig}} but these can be said to be similar as well. More low-level objects such as the {{CategoryListParams}} are not a must, and the user may never know about them (and btw, they are similar to {{Codecs}}). I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow. Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem. When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results. We found out that for large result sets, an aggregation of Lucene's score (rather than {{+1}}), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category. As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code. Bottom line, I'm all for simplifying the API but the current cost seems to great, and I'm not sure the benefits are proportional :) > Simplify the facet module APIs > -- > > Key: LUCENE-5339 > URL: https://issues.apache.org/jira/browse/LUCENE-5339 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5339.patch > > > I'd like to explore simplifications to the facet module's APIs: I > think the current APIs are complex, and the addition of a new feature > (sparse faceting, LUCENE-5333) threatens to add even more classes > (e.g., FacetRequestBuilder). I think we can do better. > So, I've been prototyping some drastic changes; this is very > early/exploratory and I'm not sure where it'll wind up but I think the > new approach shows promise. > The big changes are: > * Instead of *FacetRequest/Params/Result, you directly instantiate > the classes that do facet counting (currently TaxonomyFacetCounts, > RangeFacetCounts or SortedSetDVFacetCounts), passing in the > SimpleFacetsCollector, and then you interact with those classes to > pull labels + values (topN under a path, sparse, specific labels). > * At index time, no more FacetIndexingParams/CategoryListParams; > instead, you make a new SimpleFacetFields and pass it the field it > should store facets + drill downs under. If you want more than > one CLI you create more than one instance of SimpleFacetFields. > * I added a simple schema, where you state which dimensions are > hierarchical or multi-valued. From this we decide how to index > the ordinals (no more OrdinalPolicy). > Sparse faceting is just another method (getAllDims), on both taxonomy > & ssdv facet classes. > I haven't created a common base class / interface for all of the > search-time facet classes, but I think this may be possible/clean, and > perhaps useful for drill sideways. > All the new classes are under oal.facet.simple.*. > Lots of things that don't work yet: drill sideways, complements, > associations, sampling, partitions, etc. This is just a start ... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816183#comment-13816183 ] Gilad Barkai commented on LUCENE-5316: -- That's... not an optimistic result :) I think there should not be any MT issues? The initialization occurs while the taxonomy is being opened, and later on it's just read operations.. Perhaps this API change should be modifies, and we'd try Shai's approach - instead of a {{ChildrenIterator}} with .next() method, return an {{int[]}}. The Map is still applicable for this approach, just that the int[] will not be wrapped with an iterator object. I'll also review the patch, try to figure out what MT issues I missed. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5316: - Attachment: LUCENE-5316.patch Introducing a map-based children iterator, which holds for every ordinal (real parent) an {{int[]}} containing its direct children. Each such {{int[]}} has an extra last slot for {{TaxonomyReader.INVALID_ORDINAL}} which spares an {{if}} call for every {{ChildrenIterator.next()}} call. This is a quick and dirty patch, just so we could verify the penalty for moving to the API/map is not great. If it is great, the whole issue should be reconsidered, and perhaps a move to direct {{int[]}} for iterating over children should be implemented. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812297#comment-13812297 ] Gilad Barkai commented on LUCENE-5316: -- The patch is not ready for commit - I just got a stackoverflow in my head while chasing this NPEs in the loop-as-recursion thing... Shai, you are right in the sense that we could avoid an extra loop/recursion if the kids are null, but this now works, doesn't throw NPEs in weird places and allows returning {{null}} while not calling the extra {{CI.next()}}. I'm still looking into that. I touched the recurring loop as little as possible, as I'm not 100% sure I understand it fully (also, changing a small innocent thing breaks things bad). In the old code there were hidden assumptions which are now no longer true, I traced some but not sure about others. bq. how about if we consolidate getPTA() and getChildren() into a single getTaxonomyTree() As for the API change - getPTA is being quietly - though with determination - swapped off. getChildren is all that is left. I'm not sure .getChildren() should be encapsulated with a TaxononyTree object if it's the only API. Do you think other API should be put there as well? bq. I think that we should experiment here with a TaxoTree object which holds the children in a map The current issue is not yet done, there are still tests to consider and hiding PTA even further from the users' eyes. I feel that this issue is big enough to stand for itself, while the children map is also large enough? It would contain code for the iterations but also some code in the taxo-reopen phase, that will replace (hopefully ) the PTA reopen code. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5316: - Attachment: LUCENE-5316.patch Updated version, iterator now returned as {{null}} if ordinal has no children. I think there are further improvements (a few {{if}}s and a loop check which could be avoided) but at least currently the .next() call is avoided as per Mike's suggestion. Hope this speed things up a bit. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch, LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810087#comment-13810087 ] Gilad Barkai commented on LUCENE-5316: -- I like the {{null}} for ordinals with no siblings. Made the change, and now I'm chasing all the NPEs that it caused, hope to get a new patch up and about soon. As for allowing the taxo to say the depth for each dim - that's more trickie. Obviously, that's a temporal state, as ever flat dimension can become non flat. Also figuring this our during search (more like, once per opening the taxo-reader) is o(taxoSize) at the current implementation. Perhaps, if we're willing to invest a little more time during indexing, we could "roll up" and tell the parents (say, in an incremental numeric field update) "how low can you go" with its children? In such a case, we could benefit not only from a flat dimension, but whenever an ordinal has no grandchildren. Investing during indexing will make it an O(1) operation. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5316: - Attachment: LUCENE-5316.patch {{TaxonomyReader.getParallelTaxonomyArrays}} is now protected, the implementation is only on {{DirectoryTaxonomyReader}} in which it is protected as well. Parallel arrays are only used in tests ATM - all tree traversing is done using {{TaxonomyReader.getChildre(int ordinal)}} which is now abstract, and implemented in DirTaxoReader. Mike, if you could please run this patch against the benchmarking maching it would be awesome - as the direct array access is now switched with a method call (iterator's {{.next()}} I hope we will not see any significant degradation. > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5316.patch > > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808968#comment-13808968 ] Gilad Barkai commented on LUCENE-5316: -- Thanks Mike, you're right on the money. The ram consumption is indeed an issue here. I'm not sure that the parent array is used during search at all... and perhaps could be removed (looking into that one as well). The youngestChild/olderSibing arrays should and could be replaces with either a more compact RAM representation, or at extreme cases, even on-disk. For a better RAM representation, the idea of a map from ord -> int[] of it's children is a start. In such a case, we benefit from not holding a 'youngerChild' int for each ordinal in the flat dimension - as they have no children. 2nd, we could benefit from the locality of ref, as all the children are near by and not spread over an array of millions. The non-huge-flat dimensions will no longer suffer because of the other dimensions. Also, it would make the worst case of NRT the same as the current update (O(Taxo-size)) but might be very small if only a few ordinals were added, as only their 'family' would be reallocated and managed, rather than the entire array. At a further phase - a compression could be allowed into that int[] of children - we know the children are in ascending order, and could only encode the DGaps, figure the largest DGAP and use packed ints instead of the int[]. It would add some (I hope) minor CPU consumption to the loop, but would benefit greatly when it comes to RAM consumption. I hope that all the logic could be encapsulated in the {{ChildrenIterator}} and the user will benefit from a clean API and better RAM utilization. I'll post a patch shortly, which covers the very first part - hiding the implementation detail of children arrays (making TaxoReader.getParallelTaxoArrays protected to begin with), and moving {{TopKFacetResultHandler}} to use {{ChildrenIterator}}. Currently debugging some nasty loop-as-a-recursion related bug :) > Taxonomy tree traversing improvement > > > Key: LUCENE-5316 > URL: https://issues.apache.org/jira/browse/LUCENE-5316 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gilad Barkai >Priority: Minor > > The taxonomy traversing is done today utilizing the > {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays > which hold for each ordinal it's (array #1) youngest child and (array #2) > older sibling. > This is a compact way of holding the tree information in memory, but it's not > perfect: > * Large (8 bytes per ordinal in memory) > * Exposes internal implementation > * Utilizing these arrays for tree traversing is not straight forward > * Lose reference locality while traversing (the array is accessed in > increasing only entries, but they may be distant from one another) > * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) > This issue is about making the traversing more easy, the code more readable, > and open it for future improvements (i.e memory footprint and NRT cost) - > without changing any of the internals. > A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5316) Taxonomy tree traversing improvement
Gilad Barkai created LUCENE-5316: Summary: Taxonomy tree traversing improvement Key: LUCENE-5316 URL: https://issues.apache.org/jira/browse/LUCENE-5316 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Gilad Barkai Priority: Minor The taxonomy traversing is done today utilizing the {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays which hold for each ordinal it's (array #1) youngest child and (array #2) older sibling. This is a compact way of holding the tree information in memory, but it's not perfect: * Large (8 bytes per ordinal in memory) * Exposes internal implementation * Utilizing these arrays for tree traversing is not straight forward * Lose reference locality while traversing (the array is accessed in increasing only entries, but they may be distant from one another) * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) This issue is about making the traversing more easy, the code more readable, and open it for future improvements (i.e memory footprint and NRT cost) - without changing any of the internals. A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5271) A slightly more accurate SloppyMath distance
[ https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791119#comment-13791119 ] Gilad Barkai edited comment on LUCENE-5271 at 10/10/13 2:30 AM: A proposed solution as per described. Please note it is _not_ ready for commit as it breaks one of the tests derived for ES/Solr (as a result of the improved accuracy). was (Author: gilad): A proposed solution as per described. Keep in mind that this is _not_ ready for commit as it breaks one of the tests derived for ES/Solr (as a result of the improved accuracy). > A slightly more accurate SloppyMath distance > > > Key: LUCENE-5271 > URL: https://issues.apache.org/jira/browse/LUCENE-5271 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5271.patch > > > SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) > ellipsoid radius as an approximation for computing the "spherical" distance. > (The TO_KILOMETERS constant). > While this is pretty accurate for long distances (latitude wise) this may > introduce some small errors while computing distances close to the equator > (as the earth radius there is larger than the avg.) > A more accurate approximation would be taking the avg. earth radius at the > source and destination points. But computing an ellipsoid radius at any given > point is a heavy function, and this distance should be used in a scoring > function.. So two optimizations are optional - > * Pre-compute a table with an earth radius per latitude (the longitude does > not affect the radius) > * Instead of using two point radius avg, figure out the avg. latitude > (exactly between the src and dst points) and get its radius. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance
[ https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5271: - Attachment: LUCENE-5271.patch A proposed solution as per described. Keep in mind that this is _not_ ready for commit as it breaks one of the tests derived for ES/Solr (as a result of the improved accuracy). > A slightly more accurate SloppyMath distance > > > Key: LUCENE-5271 > URL: https://issues.apache.org/jira/browse/LUCENE-5271 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other >Reporter: Gilad Barkai >Priority: Minor > Attachments: LUCENE-5271.patch > > > SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) > ellipsoid radius as an approximation for computing the "spherical" distance. > (The TO_KILOMETERS constant). > While this is pretty accurate for long distances (latitude wise) this may > introduce some small errors while computing distances close to the equator > (as the earth radius there is larger than the avg.) > A more accurate approximation would be taking the avg. earth radius at the > source and destination points. But computing an ellipsoid radius at any given > point is a heavy function, and this distance should be used in a scoring > function.. So two optimizations are optional - > * Pre-compute a table with an earth radius per latitude (the longitude does > not affect the radius) > * Instead of using two point radius avg, figure out the avg. latitude > (exactly between the src and dst points) and get its radius. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5271) A slightly more accurate SloppyMath distance
Gilad Barkai created LUCENE-5271: Summary: A slightly more accurate SloppyMath distance Key: LUCENE-5271 URL: https://issues.apache.org/jira/browse/LUCENE-5271 Project: Lucene - Core Issue Type: Improvement Components: modules/other Reporter: Gilad Barkai Priority: Minor SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) ellipsoid radius as an approximation for computing the "spherical" distance. (The TO_KILOMETERS constant). While this is pretty accurate for long distances (latitude wise) this may introduce some small errors while computing distances close to the equator (as the earth radius there is larger than the avg.) A more accurate approximation would be taking the avg. earth radius at the source and destination points. But computing an ellipsoid radius at any given point is a heavy function, and this distance should be used in a scoring function.. So two optimizations are optional - * Pre-compute a table with an earth radius per latitude (the longitude does not affect the radius) * Instead of using two point radius avg, figure out the avg. latitude (exactly between the src and dst points) and get its radius. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5155) Add OrdinalValueResolver in favor of FacetRequest.getValueOf
[ https://issues.apache.org/jira/browse/LUCENE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726262#comment-13726262 ] Gilad Barkai commented on LUCENE-5155: -- Patch looks good. +1 for commit. Perhaps also document that FRNode is now comparable? > Add OrdinalValueResolver in favor of FacetRequest.getValueOf > > > Key: LUCENE-5155 > URL: https://issues.apache.org/jira/browse/LUCENE-5155 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-5155.patch > > > FacetRequest.getValueOf is responsible for resolving an ordinal's value. It > is given FacetArrays, and typically does something like > {{arrays.getIntArray()[ord]}} -- for every ordinal! The purpose of this > method is to allow special requests, e.g. average, to do some post processing > on the values, that couldn't be done during aggregation. > I feel that getValueOf is in the wrong place -- the calls to > getInt/FloatArray are really redundant. Also, if an aggregator maintains some > statistics by which it needs to "correct" the aggregated values, it's not > trivial to pass it from the aggregator to the request. > Therefore I would like to make the following changes: > * Remove FacetRequest.getValueOf and .getFacetArraysSource > * Add FacetsAggregator.createOrdinalValueResolver which takes the FacetArrays > and has a simple API .valueOf(ordinal). > * Modify the FacetResultHandlers to use OrdValResolver. > This allows an OVR to initialize the right array instance(s) in the ctor, and > return the value of the requested ordinal, without doing arrays.getArray() > calls. > Will post a patch shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5016) Sampling can break FacetResult labeling
[ https://issues.apache.org/jira/browse/LUCENE-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670200#comment-13670200 ] Gilad Barkai commented on LUCENE-5016: -- Patch looks good. +1 for commit > Sampling can break FacetResult labeling > > > Key: LUCENE-5016 > URL: https://issues.apache.org/jira/browse/LUCENE-5016 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-5016.patch, test-labels.zip > > > When sampling FacetResults, the TopKInEachNodeHandler is used to get the > FacetResults. > This is my case: > A FacetResult is returned (which matches a FacetRequest) from the > StandardFacetAccumulator. The facet has 0 results. The labelling of the > root-node seems incorrect. I know, from the StandardFacetAccumulator, that > the rootnode has a label, so I can use that one. > Currently the recursivelyLabel method uses the taxonomyReader.getPath() to > retrieve the label. I think we can skip that for the rootNode when there are > no children (and gain a little performance on the way too?) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5015: - Attachment: LUCENE-5015.patch Shai, I think you're right, a null {{SampleFixer}} makes more sense. While working on a test which validates that a flow works with the {{null}} fixer, I found it it did not. The reason is Complements. By default the complements kicks in when enough results are found. I think this may hold the key to the performance differences as well. Rod, could you please try the following code and report the results? {code} SamplingAccumulator accumulator = new SamplingAccumulator( new RandomSampler(), facetSearchParams, searcher.getIndexReader, taxo); // Make sure no complements are in action accumulator.setComplementThreshold(StandardFacetsAccumulator.DISABLE_COMPLEMENT); facetsCollector = FacetsCollector.create(accumulator); {code} For the mean time, made the changes to the patch, and added the test for {{null}} fixer. > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-5015.patch, LUCENE-5015.patch, LUCENE-5015.patch, > LUCENE-5015.patch, LUCENE-5015.patch > > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5015: - Attachment: LUCENE-5015.patch True, looking at overSampleFactor is enough, but it's not obvious that TakmiFixer should be used with overSampleFactor > 1, to better the chances of the result top-k being accurate. I'll add some documentation w.r.t this issue, I hope it will do. New patch defaults to {{NoopSampleFixer}} which does not touch the results at all - if the need is only for a top-k and their counts does not matter, this is the least expensive one. Also if instead of counts, a percentage sould be displayed (as how much of the results match this category), the sampled valued out of the sample size would yield the same result as the amortized fixed results out of the actual result set size. That might render the amortized fixer moot.. New patch account of {{SampleFixer}} being set in {{SamplingParams}} > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-5015.patch, LUCENE-5015.patch, LUCENE-5015.patch, > LUCENE-5015.patch > > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5015: - Attachment: LUCENE-5015.patch Shai, thanks for the comments. First three points are taken care of in the new patch. As for SamplingParams taking a SampleFixer, it's a good idea, and I've been there, but it makes it harder on the e.g. {{SamplingAccumulator}} to figure out whether to oversample - and trim - for this SampleFixer. It would than move this logic to the SampleFixer. That's not bad, but it changes the API a bit more, also the name SampleFixer does not match the functionality any more (perhaps it should oversample and trim itself?) > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-5015.patch, LUCENE-5015.patch, LUCENE-5015.patch > > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5015: - Attachment: LUCENE-5015.patch Older patch was against trunk/lucene/facet. This one is rooted with trunk. > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Priority: Minor > Attachments: LUCENE-5015.patch, LUCENE-5015.patch > > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5015: - Attachment: LUCENE-5015.patch Added a parameter to {{SamplingParams}} named {{fixToExact}} which defaults to {{false}}. I think it is probable that one who uses sampling may not be interested in exact results. In the proposed approach, the {{Sampler}} would create either the old, slow, and accurate {{TakmiSampleFixer}} if {{SamplingParams.shouldFixToExact()}} is {{true}}. Otherwise the much (much!} faster {{AmortizedSampleFixer}} would be used, when it only take under account the sampling ratio, assuming the sampled set represent the whole set with 100% accuracy. With these changes, the code above should already use the amortized fixer, as the default is now it. If the old fixer is to be used - for comparison - the code could look as follows: {code} final FacetSearchParams facetSearchParams = new FacetSearchParams( facetRequests ); FacetsCollector facetsCollector; if ( isSampled ) { // Create SamplingParams which denotes fixing to exact SamplingParams samplingParams = new SamplingParams(); samplingParams.setFixToExact(true); // Use the custom sampling params while creating the RandomSampler facetsCollector = FacetsCollector.create( new SamplingAccumulator( new RandomSampler(samplingParams, new Random(someSeed)), facetSearchParams, searcher.getIndexReader(), taxo ) ); } else { facetsCollector = FacetsCollector.create( FacetsAccumulator.create( facetSearchParams, searcher.getIndexReader(), taxo ) ); } {code} The sampling tests still use the "exact" fixer, as it is not easy asserting against amortized results. I'm still looking into creating a complete faceted search flow test with the amortized-fixer. > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Priority: Minor > Attachments: LUCENE-5015.patch > > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665018#comment-13665018 ] Gilad Barkai commented on LUCENE-5015: -- Sampling, with its defaults, has its toll. In its defaults, Sampling aims to produce the exact top-K results for each request, as if a {{StandardFacetAccumulator}} would have been used. Meaning it aims at producing the same top-K with the same counts. The process begins with sampling the result set and computers the top-*cK* candidates for each of the *M* facet requests, producing amortized results. That part is faster than {{StandardFacetAccumulator}} because less documents' facets information gets processed. The next part is the "fixing", using a {{SampleFixer}} retrieved from a {{Sampler}}, in which "fixed" counts are produced which correlate better with the original document result set, rather than the sampled one. The default (and currently only implementation) for such fixer is {{TakmiSampleFixer}} which produced _exact_ counts for each of the *cK* candidates for each of the *M* facet requests. The counts are not computed against the facet information of each document, but rather matching the skiplist of the drill-down term, of each such candidate category with the bitset of the (actual) document results. The amount of matches is the count. This is equivalent to total-hit collector with a drilldown query for the candidate category over original query. There's tipping point in which not sampling is faster than sampling and fixing using *c* x *K* x *M* skiplists matches against the bitset representing the document results. *c* defaults to 2 (see overSampleFactor in SamplingParams); Over-sampling (a.k.a *c*) is important for exact counts, as it is conceivable that the accuracy of a sampled top-k is not 100%, but according to some measures we once ran it is very likely that the true top-K results are within the sampled *2K* results. Fixing those 2K with their actual counts and re-sorting them accordingly yields much more accurate top-K. E.g Requesting 5 count requests for top-10 with overSampleFactor of 2, results in 100 skiplist matching against the document results bitset. If amortized results suffice, a different {{SampleFixer}} could be coded - which E.g amortize the true count from the sampling ration. E.g if category C got count of 3, and the sample was of 1,000 results out of a 1,000,000 than the "AmortizedSampleFixer" would fix the count of C to be 3,000. Such fixing is very fast, and the overSampleFactor should be set to 1.0. Edit: I now see that it is not that easy to code a different SampleFixer, nor get it the information needed for the amortized result fixing as suggested above. I'll try to open the API some and make it more convenient. > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Priority: Minor > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create
[jira] [Commented] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664976#comment-13664976 ] Gilad Barkai commented on LUCENE-5015: -- Hello Rob, Indeed that looks unexpected. The immediate suspect is the "fixing" part of the sampling, where after sampled top-cK are computed for each facet request, each of the candidates for top-K gets a real count computation, rather than a count over the sampled set of results. How many results are in the result set? All the documents? > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > -- > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.3 >Reporter: Rob Audenaerde >Priority: Minor > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests SamplingStandard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573590#comment-13573590 ] Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:30 PM: -- Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the {{EncodingSpeed}} test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {{SimpleIntEncoder}} (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized {{DGapSemiPackedEncoder}}. Also, I'm not sure {{SimpleIntEncoder}} was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: {noformat} Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] --- Simple 32. 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 355735.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 348534.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 343434.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked)))8.6453 338633.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 339733.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt 4.9679 400240.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt 4.8198 397239.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 444844.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 446144.6103 641 6.4100 Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] ---
[jira] [Comment Edited] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573590#comment-13573590 ] Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:29 PM: -- Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the {EncodingSpeed} test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {SimpleIntEncoder} (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized {DGapSemiPackedEncoder}. Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: {noformat} Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] --- Simple 32. 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 355735.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 348534.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 343434.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked)))8.6453 338633.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 339733.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt 4.9679 400240.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt 4.8198 397239.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 444844.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 446144.6103 641 6.4100 Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] -
[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4609: - Attachment: SemiPackedEncoder.patch Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized {code}DGapSemiPackedEncoder{code}. Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: {noformat} Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] --- Simple 32. 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 355735.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 348534.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 343434.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked)))8.6453 338633.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 339733.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt 4.9679 400240.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt 4.8198 397239.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 444844.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 446144.6103 641 6.4100 Estimating ~1 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. EncoderBits/Int Encode TimeEncode Time Decode Time Decode Time [milliseconds][microsecond / int] [milliseconds] [microsecond / int] --
[jira] [Commented] (LUCENE-4748) Add DrillSideways helper class to Lucene facets module
[ https://issues.apache.org/jira/browse/LUCENE-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569780#comment-13569780 ] Gilad Barkai commented on LUCENE-4748: -- Great idea! Since drill-down followed by drill-sideways is a sort of (re)filtering over the original result set, perhaps the query result (say ScoredDocIds) could be passed through rather than re-evaluating the Query? IIRC the scores should not change during drill-down (and sideways as well), so without re-evaluating this could perhaps save some juice? > Add DrillSideways helper class to Lucene facets module > -- > > Key: LUCENE-4748 > URL: https://issues.apache.org/jira/browse/LUCENE-4748 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.2, 5.0 > > Attachments: LUCENE-4748.patch, LUCENE-4748.patch > > > This came out of a discussion on the java-user list with subject > "Faceted search in OR": http://markmail.org/thread/jmnq6z2x7ayzci5k > The basic idea is to count "near misses" during collection, ie > documents that matched the main query and also all except one of the > drill down filters. > Drill sideways makes for a very nice faceted search UI because you > don't "lose" the facet counts after drilling in. Eg maybe you do a > search for "cameras", and you see facets for the manufacturer, so you > drill into "Nikon". > With drill sideways, even after drilling down, you'll still get the > counts for all the other brands, where each count tells you how many > hits you'd get if you changed to a different manufacturer. > This becomes more fun if you add further drill-downs, eg maybe I next drill > down into Resolution=10 megapixels", and then I can see how many 10 > megapixel cameras all other manufacturers, and what other resolutions > Nikon cameras offer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4715) Add OrdinalPolicy.ALL_BUT_DIMENSION
[ https://issues.apache.org/jira/browse/LUCENE-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565570#comment-13565570 ] Gilad Barkai commented on LUCENE-4715: -- How can a mess be avoided when allowing different OrdinalPolicies in the same CLP ? There would be ordinals which has the parents, and ordinals that dont? How can the collector or aggregator know which ordinals should be delt with as without parents and which should not? > Add OrdinalPolicy.ALL_BUT_DIMENSION > --- > > Key: LUCENE-4715 > URL: https://issues.apache.org/jira/browse/LUCENE-4715 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4715.patch > > > With the move of OrdinalPolicy to CategoryListParams, > NonTopLevelOrdinalPolicy was nuked. It might be good to restore it, as > another enum value of OrdinalPolicy. > It's the same like ALL_PARENTS, only doesn't add the dimension ordinal, which > could save space as well as computation time. It's good for when you don't > care about the count of Date/, but only about its children counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4715) Add OrdinalPolicy.NO_DIMENSION
[ https://issues.apache.org/jira/browse/LUCENE-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565359#comment-13565359 ] Gilad Barkai commented on LUCENE-4715: -- Looking at the patch, I think I might misunderstood something - in the build method, for every category the right policy is checked, but the build itself is per CategoryListParam - so why cant the policy be the same for each CLP? If one wishes to get different policies etc - I think it would be logical to separate them to different clps, and this check should not be performed over each category? > Add OrdinalPolicy.NO_DIMENSION > -- > > Key: LUCENE-4715 > URL: https://issues.apache.org/jira/browse/LUCENE-4715 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4715.patch > > > With the move of OrdinalPolicy to CategoryListParams, > NonTopLevelOrdinalPolicy was nuked. It might be good to restore it, as > another enum value of OrdinalPolicy. > It's the same like ALL_PARENTS, only doesn't add the dimension ordinal, which > could save space as well as computation time. It's good for when you don't > care about the count of Date/, but only about its children counts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath
[ https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545370#comment-13545370 ] Gilad Barkai commented on LUCENE-4659: -- Magnificent, +1 for commit. > Cleanup CategoryPath > > > Key: LUCENE-4659 > URL: https://issues.apache.org/jira/browse/LUCENE-4659 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4659.patch, LUCENE-4659.patch, LUCENE-4659.patch > > > CategoryPath is supposed to be a simple object which holds a category path's > components, and offers some utility methods that can be used during indexing > and search. > Currently, it exposes lots of methods which aren't used, unless by tests - I > want to get rid of them. Also, the internal implementation manages 3 char[] > for holding the path components, while I think it would have been simpler if > it maintained a String[]. I'd like to explore that option too (the input is > anyway String, so why copy char[]?). > Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most > of the mutable methods. The ones that remain will probably go away when I > move from char[] to String[]. Immuntability is important because in various > places in the code we convert a CategoryPath back and forth to String, with > TODOs to stop doing that if CP was immutable. > Will attach a patch that covers the first step - get rid of unneeded methods > and beginning to make it immutable. > Perhaps this can be done in multiple commits? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath
[ https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545363#comment-13545363 ] Gilad Barkai commented on LUCENE-4659: -- Patch looks really good. I'm not concerned about the new objects for subPath. Actually, since the {{HashMap}} s are now against {{CategoryPath}} and it's not constantly being translated to/from {{String}} I'm looking forward a better performance than before. Nice job, and a nasty one that must have been.. Chapeau à lui! A few (minor) comments: * {{copyFullPath()}} - {{numCharsCopied}} is redundant? The return value could have been {{(idx + component[upto].length() - start)}} * {{equals()}} line 143 - perhaps use only one index rather than both index {{j}} and {{i}} ? Also, CPs are more likely to be different at the end, than in the start (e.g further away from the root than the dimension) - perhaps iterate in reverse (up the tree)? > Cleanup CategoryPath > > > Key: LUCENE-4659 > URL: https://issues.apache.org/jira/browse/LUCENE-4659 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4659.patch, LUCENE-4659.patch > > > CategoryPath is supposed to be a simple object which holds a category path's > components, and offers some utility methods that can be used during indexing > and search. > Currently, it exposes lots of methods which aren't used, unless by tests - I > want to get rid of them. Also, the internal implementation manages 3 char[] > for holding the path components, while I think it would have been simpler if > it maintained a String[]. I'd like to explore that option too (the input is > anyway String, so why copy char[]?). > Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most > of the mutable methods. The ones that remain will probably go away when I > move from char[] to String[]. Immuntability is important because in various > places in the code we convert a CategoryPath back and forth to String, with > TODOs to stop doing that if CP was immutable. > Will attach a patch that covers the first step - get rid of unneeded methods > and beginning to make it immutable. > Perhaps this can be done in multiple commits? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath
[ https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544103#comment-13544103 ] Gilad Barkai commented on LUCENE-4659: -- I think I wished for CP to become immutable as one of my birthday wishes (while taking out the candles).. these do come true! That's a very promising start, way to go! CP's internals of char[] was for performance and reusability, I thought it was used internally in the CL2O cache - but now I see that it does not? hmm.. Other methods which could be killed, and are in the way of immutability: {{.setFromSerializable()}} and {{.add()}} which are only used in tests. Than, the only method which breaks immutability is {{.trim()}}, and it could be managed with a C'tor. > Cleanup CategoryPath > > > Key: LUCENE-4659 > URL: https://issues.apache.org/jira/browse/LUCENE-4659 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4659.patch > > > CategoryPath is supposed to be a simple object which holds a category path's > components, and offers some utility methods that can be used during indexing > and search. > Currently, it exposes lots of methods which aren't used, unless by tests - I > want to get rid of them. Also, the internal implementation manages 3 char[] > for holding the path components, while I think it would have been simpler if > it maintained a String[]. I'd like to explore that option too (the input is > anyway String, so why copy char[]?). > Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most > of the mutable methods. The ones that remain will probably go away when I > move from char[] to String[]. Immuntability is important because in various > places in the code we convert a CategoryPath back and forth to String, with > TODOs to stop doing that if CP was immutable. > Will attach a patch that covers the first step - get rid of unneeded methods > and beginning to make it immutable. > Perhaps this can be done in multiple commits? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538193#comment-13538193 ] Gilad Barkai commented on LUCENE-4609: -- Thank you Adrian! I'll will look into it. > Write a PackedIntsEncoder/Decoder for facets > > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536391#comment-13536391 ] Gilad Barkai commented on LUCENE-4609: -- In a unique encoding dgap, there's no zero, so in order to save that little extra bit, every gap could be encoded as a (gap - 1). Tricky is the word :) > Write a PackedIntsEncoder/Decoder for facets > > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536384#comment-13536384 ] Gilad Barkai commented on LUCENE-4609: -- bq. Hopefully you don't need to separately encode "leftover unused bits" ... ie byte[].length (which is "free" here, since codec already stores this) should suffice. I'm missing something.. if there are 2 bits per value, and the codec knows its only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How could the decoder know when to stop without knowing how many bits should not be encoded at the end? > Write a PackedIntsEncoder/Decoder for facets > > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536347#comment-13536347 ] Gilad Barkai commented on LUCENE-4609: -- bq. Do you encode the gaps or the straight up ords? Well, It's a 'end point' encoder, meaning it encodes whatever values are received directly to the output. One could create an encoder as: {{new SortingIntEncoder(new UniqueValuesIntEncoder(new DGapIntEncoder(new PackedEncoder(}}, so the values the packed encoder would receive are already after sort, unique and dgap. {quote} This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings (it adds complexity, slows down decode, and didn't seem to help that much in reducing the size). {quote} PForDelta is indeed slower. But we've met scenarios in which most dgaps are small - hence the NOnes, and the Four/Eight Flag encoders. If indeed most values are small, say, could fit in 4 bits, but there's also one or two larger values which would require 12 or 14 bits, we could benefit hear greatly. This is all relevant only where there are large amount of categories per document. bq. it seems like you are writing the full header per field That is right. To be frank, I'm not 100% sure what {{PackedInts}} does.. nor how large its header is.. But I think perhaps some header per doc is required anyway? For bits-per-value smaller than the size of a byte, there's a need to know how many bits should be left out from the last read byte. I started writing my own version as a first step toward the 'mixed' version, in which a 1 byte header is written, that contained both the the 'bits per value' as the first 5 bits, and the amount of extra bits in the last 3 bits. I'm still playing with it, hope to share it soon. > Write a PackedIntsEncoder/Decoder for facets > > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4609: - Attachment: LUCENE-4609.patch Attached a PackedEncoder, which is based on {{PackedInts}}. Currently only the approach of a 'per-document' bits-per-value is implemented. I'm not convinced the header could be spared, as at the very least, the number of bits to neglect at the end of the stream should be written. E.g if there are 2 bits per value, and there are 17 values, there's a need for 34 bits, but everything is written in (at least) bytes, so 6 bits should be neglected. Updated EncodingTest and EncodingSpeed, and found out that the compression factor is not that good, probably due to large numbers which bumps the amount of required bits to higher value. Started to look into a semi-packed encoder, which could encode most values in a packed manner, but could also add large values as, e.g., vints. Example: for 6 bits per value, all values 0-62 are packed, while a packed value of 63 (packed all 1' s) is a marker that the next value is written in a non-packed manner (say vint, Elias delta, whole 32 bits.. ). This should improve the compression factor when most ints are small, and only a few are large. Impact on encoding/decoding speed remains to be seen.. > Write a PackedIntsEncoder/Decoder for facets > > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4633) DirectoryTaxonomyWriter.replaceTaxonomy should refresh the reader
[ https://issues.apache.org/jira/browse/LUCENE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13533324#comment-13533324 ] Gilad Barkai commented on LUCENE-4633: -- +1 patch looks good. > DirectoryTaxonomyWriter.replaceTaxonomy should refresh the reader > - > > Key: LUCENE-4633 > URL: https://issues.apache.org/jira/browse/LUCENE-4633 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 4.0 >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4633.patch > > > While migrating code to Lucene 4.0 I tripped it. If you call > replaceTaxonomy() with e.g. a taxonomy index that contains category "a", and > then you try to add category "a" to the new taxonomy, it receives a new > ordinal! > The reason is that replaceTaxo doesn't refresh the internal IndexReader, but > does clear the cache (as it should). This causes the next addCategory to not > find category "a" in the cache, and not in the reader instance at hand. > Simple fix, I'll attach a patch with it and a test exposing the bug. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting
[ https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530307#comment-13530307 ] Gilad Barkai commented on LUCENE-4619: -- Throwing in a crazy idea.. can the facedIndexingParams be part of IndexWriterConfig? > Create a specialized path for facets counting > - > > Key: LUCENE-4619 > URL: https://issues.apache.org/jira/browse/LUCENE-4619 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4619.patch > > > Mike and I have been discussing that on several issues (LUCENE-4600, > LUCENE-4602) and on GTalk ... it looks like the current API abstractions may > be responsible for some of the performance loss that we see, compared to > specialized code. > During our discussion, we've decided to target a specific use case - facets > counting and work on it, top-to-bottom by reusing as much code as possible. > Specifically, we'd like to implement a FacetsCollector/Accumulator which can > do only counting (i.e. respects only CountFacetRequest), no sampling, > partitions and complements. The API allows us to do so very cleanly, and in > the context of that issue, we'd like to do the following: > * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams > and CategoryPath (List, Iterable, whatever) and adds the needed information > to both the taxonomy index as well as the search index. > ** That API is similar in nature to CategoryDocumentBuilder, only easier to > consume -- it's just another field that you add to the Document. > ** We'll have two extensions for it: PayloadFacetsField and > DocValuesFacetsField, so that we can benchmark the two approaches. > Eventually, one of them we believe, will be eliminated, and we'll remain w/ > just one (hopefully the DV one). > * Implement either a FacetsAccumulator/Collector which takes a bunch of > CountFacetRequests and returns the top-counts. > ** Aggregations are done in-collection, rather than post. Note that we have > LUCENE-4600 open for exploring that. Either we finish this exploration here, > or do it there. Just FYI that the issue exists. > ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open > a separate issue to explore improving that API to be bulk, and then we can > decide if this specialized Collector should use those abstractions, or be > really optimized for the facet counting case. > * At the moment, this path will assume that a document holds multiple > dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike > for a document), and therefore use OrdPolicy.NO_PARENTS. > ** Later, we'd like to explore how to have this specialized path handle the > ALL_PARENTS case too, as it shouldn't be so hard to do. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4622) TopKFacetsResultHandler should tie break sort by label not ord?
[ https://issues.apache.org/jira/browse/LUCENE-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529988#comment-13529988 ] Gilad Barkai commented on LUCENE-4622: -- I'm not fond of the post processing... Today the sort is consistent. Lucene breaks even on doc ids, which order may not be consistent due to out of order merges. This is not the case with category ordinals. If one wishes to post process they should be able to do so quite easy? But as pointed out, it might not produce the results as intended due to a lot of categories which scored the same and were left out. > TopKFacetsResultHandler should tie break sort by label not ord? > --- > > Key: LUCENE-4622 > URL: https://issues.apache.org/jira/browse/LUCENE-4622 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Michael McCandless > > EG I now get these facets: > {noformat} > Author (5) > Lisa (2) > Frank (1) > Susan (1) > Bob (1) > {noformat} > The primary sort is by count, but secondary is by ord (= order in which they > were indexed), which is not really understandable/transparent to the end > user. I think it'd be best if we could do tie-break sort by label ... > But talking to Shai, this seems hard/costly to fix, because when visiting the > facet ords to collect the top K, we don't currently resolve to label, and in > the worst case (say my example had a million labels with count 1) that's a > lot of extra label lookups ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4610) Implement a NoParentsAccumulator
[ https://issues.apache.org/jira/browse/LUCENE-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529716#comment-13529716 ] Gilad Barkai commented on LUCENE-4610: -- Since taxonomies do not tend to change in depth (i.e, if FileType has only depth 1, it is not likely it will suddenly grow to depth 3) - than perhaps the definition of "these N dimensions are single-valued" should be set in a CategoryListParam ? Shai mentioned yesterday that adding the parents during aggregation is problematic when partitions are in place. Parents might not be on the same partition as their children, nor the aggregator is aware of the partition it is currently aggregating upon. For partitions it seems adding the parents in a post process - as was first suggested - is the right approach. > Implement a NoParentsAccumulator > > > Key: LUCENE-4610 > URL: https://issues.apache.org/jira/browse/LUCENE-4610 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera > > Mike experimented with encoding just the exact categories ordinals on > LUCENE-4602, and I added OrdinalPolicy.NO_PARENTS, with a comment saying that > this requires a special FacetsAccumulator. > The idea is to write the exact categories only for each document, and then at > search time count up the parents chain to compute requested facets (I say > count, but it can be any weight). > One limitation of such accumulator is that it cannot be used when e.g. a > document is associated with two categories who share the same parent, because > that may result in incorrect weights computed (e.g. a document might have > several Authors, and so counting the Author facet may yield wrong counts). So > it can be used only when the app knows it doesn't add such facets, or that it > always asks to aggregate a 'root' that in its path this criteria doesn't hold > (no categories share the same parent). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results
[ https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4461: - Attachment: LUCENE-4461.patch Proposed fix - In {{StandardFacetsAccumulator}}, guard against handling and merging the same request more than once. Also a matching test is introduced, inspired by previous patch. Thanks Rodrigo! > Multiple FacetRequest with the same path creates inconsistent results > - > > Key: LUCENE-4461 > URL: https://issues.apache.org/jira/browse/LUCENE-4461 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6 >Reporter: Rodrigo Vega > Labels: facet, faceted-search > Attachments: LUCENE-4461.patch, LuceneFacetTest.java > > > Multiple FacetRequest are getting merged into one creating wrong results in > this case: > FacetSearchParams facetSearchParams = new FacetSearchParams(); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > Problem can be fixed by defining hashcode and equals in certain way that > Lucene recognize we are talking about different requests. > Attached test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4600) Facets should aggregate during collection, not at the end
[ https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527077#comment-13527077 ] Gilad Barkai commented on LUCENE-4600: -- Aggregating all doc ids first also make it easier to compute actual results after sampling. That is done by taking the sampling result top-(c)K and calculating their true value over all matching documents, giving the benefit of sampling and results which could make sense to the user (e.g in counting the end number would actually be the number of matching documents to this category). As for aggregating 'on the fly' it has some other issues * It (was?) believed that accessing the counting array during query execution may lead to memory cache issues. The entire counting array could be accessed for every document over and over, and it's not guaranteed it would fit into the cache (that's the CPU's one). That might not be a problem on modern hardware * While the OS can cache all payload data itself, it gets difficult as the index grows. If the OS fails to cache the file, it is (again, was?) believed that going over the file in sequential manner once without seeks (at least by the current thread) would make it faster. It sort of becoming a religion with all those "believes", as some scenarios used to make sense a few years ago. I'm not sure they still do. Can't wait to see how some of these co-exist with the benchmark results. If all religions could have been benchmarked... ;) > Facets should aggregate during collection, not at the end > - > > Key: LUCENE-4600 > URL: https://issues.apache.org/jira/browse/LUCENE-4600 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > > Today the facet module simply gathers all hits (as a bitset, optionally with > a float[] to hold scores as well, if you will aggregate them) during > collection, and then at the end when you call getFacetsResults(), it makes a > 2nd pass over all those hits doing the actual aggregation. > We should investigate just aggregating as we collect instead, so we don't > have to tie up transient RAM (fairly small for the bit set but possibly big > for the float[]). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays
[ https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511331#comment-13511331 ] Gilad Barkai commented on LUCENE-4565: -- bq. I thought that it's over-verbosing to put them in the method names. Rather, I think that good javadocs are what's needed here. It's an API that one reads one time usually. {{.children()}} looks like it's getting all children, it does not. {{.siblings()}} looks like it returns all siblings, which it does not. I think a good JavaDoc is a blessing, but it's not a penalty in making the code document itself - which it did till now. I see no reason to change that. bq. They are reused My bad! I need coffee... > Simplify TaxoReader ParentArray/ChildrenArrays > -- > > Key: LUCENE-4565 > URL: https://issues.apache.org/jira/browse/LUCENE-4565 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-4565.patch > > > TaxoReader exposes two structures which provide information about a > categories parent/childs/siblings: ParentArray and ChildrenArrays. > ChildrenArrays are derived (i.e. created) from ParentArray. > I propose to consolidate all that into one API ParentInfo, or > CategoryTreeInfo (a better name?) which will provide the same information, > only from one object. So instead of making these calls: > {code} > int[] parents = taxoReader.getParentArray(); > int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray(); > int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray(); > {code} > one would make these calls: > {code} > int[] parents = taxoReader.getParentInfo().parents(); > int[] youngestChilds = taxoReader.getParentInfo().youngestChilds(); > int[] olderSiblings = taxoReader.getParentInfo().olderSiblings(); > {code} > Not a big change, just consolidate more code into one logical place. All of > these arrays will continue to be lazily allocated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays
[ https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511315#comment-13511315 ] Gilad Barkai commented on LUCENE-4565: -- Patch's OK, some comments: * The notion of youngestChild and olderSibling is important. I would not have removed the 'older/youngest' parts. Especially not from the TaxoArray's API. But would rather have them everywhere. * In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to "reuse" the TermsEnum and DocsEnum - only they are not reused. I got confused trying to find the reusability. Is there an internal java expert optimization for such cases? > Simplify TaxoReader ParentArray/ChildrenArrays > -- > > Key: LUCENE-4565 > URL: https://issues.apache.org/jira/browse/LUCENE-4565 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-4565.patch > > > TaxoReader exposes two structures which provide information about a > categories parent/childs/siblings: ParentArray and ChildrenArrays. > ChildrenArrays are derived (i.e. created) from ParentArray. > I propose to consolidate all that into one API ParentInfo, or > CategoryTreeInfo (a better name?) which will provide the same information, > only from one object. So instead of making these calls: > {code} > int[] parents = taxoReader.getParentArray(); > int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray(); > int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray(); > {code} > one would make these calls: > {code} > int[] parents = taxoReader.getParentInfo().parents(); > int[] youngestChilds = taxoReader.getParentInfo().youngestChilds(); > int[] olderSiblings = taxoReader.getParentInfo().olderSiblings(); > {code} > Not a big change, just consolidate more code into one logical place. All of > these arrays will continue to be lazily allocated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays
[ https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510603#comment-13510603 ] Gilad Barkai commented on LUCENE-4565: -- FamilyTree? Genealogy? How about Stemma ? > Simplify TaxoReader ParentArray/ChildrenArrays > -- > > Key: LUCENE-4565 > URL: https://issues.apache.org/jira/browse/LUCENE-4565 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > > TaxoReader exposes two structures which provide information about a > categories parent/childs/siblings: ParentArray and ChildrenArrays. > ChildrenArrays are derived (i.e. created) from ParentArray. > I propose to consolidate all that into one API ParentInfo, or > CategoryTreeInfo (a better name?) which will provide the same information, > only from one object. So instead of making these calls: > {code} > int[] parents = taxoReader.getParentArray(); > int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray(); > int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray(); > {code} > one would make these calls: > {code} > int[] parents = taxoReader.getParentInfo().parents(); > int[] youngestChilds = taxoReader.getParentInfo().youngestChilds(); > int[] olderSiblings = taxoReader.getParentInfo().olderSiblings(); > {code} > Not a big change, just consolidate more code into one logical place. All of > these arrays will continue to be lazily allocated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4586) Change default ResultMode of FacetRequest to PER_NODE_IN_TREE
[ https://issues.apache.org/jira/browse/LUCENE-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509847#comment-13509847 ] Gilad Barkai commented on LUCENE-4586: -- Patch looks good, ready for a commit. +1 > Change default ResultMode of FacetRequest to PER_NODE_IN_TREE > - > > Key: LUCENE-4586 > URL: https://issues.apache.org/jira/browse/LUCENE-4586 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4586.patch, LUCENE-4586.patch > > > Today the default ResultMode is GLOBAL_FLAT, but it should be > PER_NODE_IN_TREE. ResultMode is being used whenever you set the depth of > FacetRequest to greater than 1. The difference between the two is: > * PER_NODE_IN_TREE would then compute the top-K categories recursively, for > every top category at every level (up to depth). The results are returned in > a tree structure as well. For instance: > {noformat} > Date > 2010 > March > February > 2011 > April > May > {noformat} > * GLOBAL_FLAT computes the top categories among all the nodes up to depth, > and returns a flat list of categories. > GLOBAL_FLAT is faster to compute than PER_NODE_IN_TREE (it just computes > top-K among N total categories), however I think that it's less intuitive, > and therefore should not be used as a default. In fact, I think this is kind > of an expert usage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509628#comment-13509628 ] Gilad Barkai commented on LUCENE-4580: -- +1 > Facet DrillDown should return a ConstantScoreQuery > -- > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4580.patch, LUCENE-4580.patch > > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509527#comment-13509527 ] Gilad Barkai commented on LUCENE-4580: -- Patch looks good. Two comments: * In {{SimpleSearcher}} the code is change as such the {{DrillDown.query()}} always takes a {{new FacetSearchParams()}} (no default) - but that's not obvious it is the same one used for the search, or that it contains the {{FacetIndexingParams}} used for indexing. The defaults made sure that if none was specified along the way - the user should not bother himself with it. It is a small piece of code, but in an example might confuse the reader. I'm somewhat leaning toward allowing a default query() call without specifying a FSP. * In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of floats instead against 0.0. > Facet DrillDown should return a ConstantScoreQuery > -- > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4580.patch > > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509527#comment-13509527 ] Gilad Barkai edited comment on LUCENE-4580 at 12/4/12 5:47 AM: --- Patch looks good. Two comments: * In {{SimpleSearcher}} the code is changed as such the {{DrillDown.query()}} always takes a {{new FacetSearchParams()}} (no default) - but that's not obvious it is the same one used for the search, or that it contains the {{FacetIndexingParams}} used for indexing. The defaults made sure that if none was specified along the way - the user should not bother himself with it. It is a small piece of code, but in an example might confuse the reader. I'm somewhat leaning toward allowing a default query() call without specifying a FSP. * In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of floats instead against 0.0. was (Author: gilad): Patch looks good. Two comments: * In {{SimpleSearcher}} the code is change as such the {{DrillDown.query()}} always takes a {{new FacetSearchParams()}} (no default) - but that's not obvious it is the same one used for the search, or that it contains the {{FacetIndexingParams}} used for indexing. The defaults made sure that if none was specified along the way - the user should not bother himself with it. It is a small piece of code, but in an example might confuse the reader. I'm somewhat leaning toward allowing a default query() call without specifying a FSP. * In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of floats instead against 0.0. > Facet DrillDown should return a ConstantScoreQuery > -- > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > Attachments: LUCENE-4580.patch > > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508202#comment-13508202 ] Gilad Barkai commented on LUCENE-4580: -- bq. it should be a combination of TermsFilter and BooleanFilter. So in fact if we want to keep DrillDown behave like today, we should use BooleanFilter and TermsFilter. +1 > Facet DrillDown should return a Filter not Query > > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508195#comment-13508195 ] Gilad Barkai commented on LUCENE-4580: -- {{DrillDown}} is a useful class with a straight-forward API, which makes the life of basic users simpler. As Shai pointed out, today there is no dependency on the Queries module, but the code contains a hidden bug in which a 'drill down' operation may change the score of the results. And adding a Filter or a {{ConstantScoreQuery}} looks the right way to go. That sort of a fix is possible, while keeping the usefulness of the DrillDown class, only if the code becomes dependent on the queries module. On the other hand, removing the dependency would force most faceted users to write that exact extra code as mentioned. Preventing such cases was the reason that utility class was created. 'Drilling Down' is a basic feature of a faceted search application, and the DrillDown class provides an easy way of invoking it. Having a faceted search application without utilizing the queries module (e.g filtering) seems remote - is there any such scenario? Module dependency may result with a user loading jars he does not need or care about, but the queries module jar is likely to be found on any faceted search application. Modules should be independent, but I see enough gain in here. It would not bother me if the faceted module would depend on the query module. I find it logical. -1 for forcing users to write same code over and over to keep facet module independent of the queries module +1 for adding {{DrillDown.filter(CategoryPath...)}} - That looks like the way to go > Facet DrillDown should return a Filter not Query > > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query
[ https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507982#comment-13507982 ] Gilad Barkai commented on LUCENE-4580: -- +1 > Facet DrillDown should return a Filter not Query > > > Key: LUCENE-4580 > URL: https://issues.apache.org/jira/browse/LUCENE-4580 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Priority: Minor > > DrillDown is a helper class which the user can use to convert a facet value > that a user selected into a Query for performing drill-down or narrowing the > results. The API has several static methods that create e.g. a Term or Query. > Rather than creating a Query, it would make more sense to create a Filter I > think. In most cases, the clicked facets should not affect the scoring of > documents. Anyway, even if it turns out that it must return a Query (which I > doubt), we should at least modify the impl to return a ConstantScoreQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3441) Add NRT support to LuceneTaxonomyReader
[ https://issues.apache.org/jira/browse/LUCENE-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501966#comment-13501966 ] Gilad Barkai commented on LUCENE-3441: -- Patch looks very good. +1. > Add NRT support to LuceneTaxonomyReader > --- > > Key: LUCENE-3441 > URL: https://issues.apache.org/jira/browse/LUCENE-3441 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Attachments: LUCENE-3441.patch, LUCENE-3441.patch > > > Currently LuceneTaxonomyReader does not support NRT - i.e., on changes to > LuceneTaxonomyWriter, you cannot have the reader updated, like > IndexReader/Writer. In order to do that we need to do the following: > # Add ctor to LuceneTaxonomyReader to allow you to instantiate it with > LuceneTaxonomyWriter. > # Add API to LuceneTaxonomyWriter to expose its internal IndexReader > # Change LTR.refresh() to return an LTR, rather than void. This is actually > not strictly related to that issue, but since we'll need to modify refresh() > impl, I think it'll be good to change its API as well. Since all of facet API > is @lucene.experimental, no backwards issues here (and the sooner we do it, > the better). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4532) TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy failure
[ https://issues.apache.org/jira/browse/LUCENE-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491466#comment-13491466 ] Gilad Barkai commented on LUCENE-4532: -- Reviewed the patch - and it looks very good. A few comments: 1. in TestDirectoryTaxonomyWriter.java, the error string _"index.create.time not found in commitData"_ should be updated. 2. if the index creation time is in the commit data, it will not be removed - as the epoch is added to whatever commit data was read from the index. I think perhaps it should be removed? 3. since the members related to the old 'timestamp' method are removed - no test could check the migration from old to new methods. Might be a good idea to add one with a comment to remove it when backward compatibility is no longer required (Lucene 6?). 4. 'Epoch' is usually in the context of time, or in relation of a period. Perhaps the name 'version' is more closely related to the implementation? > TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy failure > > > Key: LUCENE-4532 > URL: https://issues.apache.org/jira/browse/LUCENE-4532 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Attachments: LUCENE-4532.patch > > > The following failure on Jenkins: > {noformat} > > Build: http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Windows/1404/ > > Java: 32bit/jdk1.6.0_37 -client -XX:+UseConcMarkSweepGC > > > > 1 tests failed. > > REGRESSION: > > org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy > > > > Error Message: > > > > > > Stack Trace: > > java.lang.ArrayIndexOutOfBoundsException > > at > > __randomizedtesting.SeedInfo.seed([6AB10D3E4E956CFA:BFB2863DB7E077E0]:0) > > at java.lang.System.arraycopy(Native Method) > > at > > org.apache.lucene.facet.taxonomy.directory.ParentArray.refresh(ParentArray.java:99) > > at > > org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.refresh(DirectoryTaxonomyReader.java:407) > > at > > org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.doTestReadRecreatedTaxono(TestDirectoryTaxonomyReader.java:167) > > at > > org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy(TestDirectoryTaxonomyReader.java:130) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787) > > at > > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) > > at > > org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51) > > at > > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > > at > > com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) > > at > > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > > at > > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) > > at > > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) > > at > > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) > > at > > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782) > > at > > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442) > > at > > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746) > > at > > com.carrotsearch.randomizedte
[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results
[ https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472437#comment-13472437 ] Gilad Barkai commented on LUCENE-4461: -- Well solve you code issues no :) But the current code is indeed broken by the issue you raised - and that should be fixed. I re examined the code, and I think the different hashcode you presented will work - though please note it will consume some extra CPU, as the same request will be handled twice (that's the heap to figure out the top-k of the request) to create separate FacetResults for each request. > Multiple FacetRequest with the same path creates inconsistent results > - > > Key: LUCENE-4461 > URL: https://issues.apache.org/jira/browse/LUCENE-4461 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6 >Reporter: Rodrigo Vega > Labels: facet, faceted-search > Attachments: LuceneFacetTest.java > > > Multiple FacetRequest are getting merged into one creating wrong results in > this case: > FacetSearchParams facetSearchParams = new FacetSearchParams(); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > Problem can be fixed by defining hashcode and equals in certain way that > Lucene recognize we are talking about different requests. > Attached test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results
[ https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472416#comment-13472416 ] Gilad Barkai edited comment on LUCENE-4461 at 10/9/12 2:16 PM: --- The same category can be set as a filter and as a request without them colliding - a filter is not correlated or dependent on a facet request. facets filters are done at the query level which affects the result set, while the facetRequest defines which categories to retrieve out of the result set. I probably miss something here :) was (Author: gilad): The same category can be set as a filter and as a request without them colliding - a filter is correlated or dependent on a facet request. facets filters are done at the query level which affects the result set, while the facetRequest defines which categories to retrieve out of the result set. I probably miss something here :) > Multiple FacetRequest with the same path creates inconsistent results > - > > Key: LUCENE-4461 > URL: https://issues.apache.org/jira/browse/LUCENE-4461 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6 >Reporter: Rodrigo Vega > Labels: facet, faceted-search > Attachments: LuceneFacetTest.java > > > Multiple FacetRequest are getting merged into one creating wrong results in > this case: > FacetSearchParams facetSearchParams = new FacetSearchParams(); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > Problem can be fixed by defining hashcode and equals in certain way that > Lucene recognize we are talking about different requests. > Attached test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results
[ https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472416#comment-13472416 ] Gilad Barkai commented on LUCENE-4461: -- The same category can be set as a filter and as a request without them colliding - a filter is correlated or dependent on a facet request. facets filters are done at the query level which affects the result set, while the facetRequest defines which categories to retrieve out of the result set. I probably miss something here :) > Multiple FacetRequest with the same path creates inconsistent results > - > > Key: LUCENE-4461 > URL: https://issues.apache.org/jira/browse/LUCENE-4461 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6 >Reporter: Rodrigo Vega > Labels: facet, faceted-search > Attachments: LuceneFacetTest.java > > > Multiple FacetRequest are getting merged into one creating wrong results in > this case: > FacetSearchParams facetSearchParams = new FacetSearchParams(); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > Problem can be fixed by defining hashcode and equals in certain way that > Lucene recognize we are talking about different requests. > Attached test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results
[ https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472164#comment-13472164 ] Gilad Barkai commented on LUCENE-4461: -- Nice catch! Took a while to pinpoint the reason - lines 173-181 of StandardFacetsAccumulator. In the mentioned lines, a 'merge' is performed over categories which matched the request, but reside on different partitions. bq. Partitions are an optimization which limit the RAM requirements per query to a constant, rather than linear to the taxonomy size (could be millions of categories). The taxonomy is virtually "splitted" into partitions of constant size, a top-k is heaped from each partition, and all those top-k results are being merged to a global top-k list The proposed solution of changing the hashCode and equals so that the same request will have two hashCodes and will not be equal to itself is very likely to break other parts of the code. Perhaps such cases could be prevented all together? e.g throwing an exception when the (exact) same request is added twice. Is that a reasonable solution? Are there cases where it is necessary to request the same path twice? Please note that a different count, depth, path etc - makes a different request, so requesting "author" with count 10 and count 11 makes different requests - which are handled simultaneously correctly in current versions. > Multiple FacetRequest with the same path creates inconsistent results > - > > Key: LUCENE-4461 > URL: https://issues.apache.org/jira/browse/LUCENE-4461 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6 >Reporter: Rodrigo Vega > Labels: facet, faceted-search > Attachments: LuceneFacetTest.java > > > Multiple FacetRequest are getting merged into one creating wrong results in > this case: > FacetSearchParams facetSearchParams = new FacetSearchParams(); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > facetSearchParams.addFacetRequest(new CountFacetRequest(new > CategoryPath("author"), 10)); > Problem can be fixed by defining hashcode and equals in certain way that > Lucene recognize we are talking about different requests. > Attached test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect
[ https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4411: - Attachment: LUCENE-4411.patch Attached a proposed fix + test. Delegating is impossible as {{FacetRequest}}'s getters are {{final}}. The only way to 'delegate' the information is using the setters in the wrapping class (e.g setDepth(original.getDepth()). This solution does not seem the right one, but other approaches involves changing much more code and reducing the amount of protection the public API offers (e.g the user will find it easier to break something). The patch also introduce (the missing) delegation of the {{SortBy}}, {{SortOrder}}, {{numResultsToLable}} and {{ResultMode}} methods. Somewhat out of scope of the issue - I tried to figure out why the wrapping and keeping the ??original?? request is important: The ??count?? (number of categories to return) is {{final}}, set at construction. While Sampling is in effect, in order to better the chances of 'hitting' the true top-k results, the notion of oversampling is introduced, which asks for more than just K (e.g 3 * K results) - so another request should be made. The 'original' request is saves so the end-result would hold the original request, and not the over-sampled one (every {{FacetResult}} has its originating {{FacetRequest}}). > Depth requested in a facetRequest is reset when Sampling is in effect > - > > Key: LUCENE-4411 > URL: https://issues.apache.org/jira/browse/LUCENE-4411 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6.1, 5.0, 4.0 >Reporter: Gilad Barkai > Attachments: LUCENE-4411.patch, OversampleWithDepthTest.java, > OversampleWithDepthTest.java > > > FacetRequest can be set a Depth parameter, which controls the depth of the > result tree to be returned. > When Sampling is enabled (and actually used) the Depth parameter gets reset > to its default (1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect
[ https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459543#comment-13459543 ] Gilad Barkai commented on LUCENE-4411: -- When sampling is under effect, the original FacetRequest is replaces with a wrapper, which takes under account different sampling related parameters named OverSampledFacetRequest. This "wrapping" class modifies only a small set of parameters in the request and should otherwise delegate everything to the original one - but it does not. Some of the information that is lost from the original request: SortOrder, SortBy, number of results to lable, ResultMode and Depth. A patch would be available shortly. > Depth requested in a facetRequest is reset when Sampling is in effect > - > > Key: LUCENE-4411 > URL: https://issues.apache.org/jira/browse/LUCENE-4411 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6.1, 5.0, 4.0 >Reporter: Gilad Barkai > Attachments: OversampleWithDepthTest.java, > OversampleWithDepthTest.java > > > FacetRequest can be set a Depth parameter, which controls the depth of the > result tree to be returned. > When Sampling is enabled (and actually used) the Depth parameter gets reset > to its default (1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect
[ https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4411: - Attachment: OversampleWithDepthTest.java A Test revealing the bug for 3.6 > Depth requested in a facetRequest is reset when Sampling is in effect > - > > Key: LUCENE-4411 > URL: https://issues.apache.org/jira/browse/LUCENE-4411 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6.1, 5.0, 4.0 >Reporter: Gilad Barkai > Attachments: OversampleWithDepthTest.java, > OversampleWithDepthTest.java > > > FacetRequest can be set a Depth parameter, which controls the depth of the > result tree to be returned. > When Sampling is enabled (and actually used) the Depth parameter gets reset > to its default (1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect
[ https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4411: - Attachment: OversampleWithDepthTest.java A Test revealing the bug for Trunk > Depth requested in a facetRequest is reset when Sampling is in effect > - > > Key: LUCENE-4411 > URL: https://issues.apache.org/jira/browse/LUCENE-4411 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6.1, 5.0, 4.0 >Reporter: Gilad Barkai > Attachments: OversampleWithDepthTest.java > > > FacetRequest can be set a Depth parameter, which controls the depth of the > result tree to be returned. > When Sampling is enabled (and actually used) the Depth parameter gets reset > to its default (1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect
Gilad Barkai created LUCENE-4411: Summary: Depth requested in a facetRequest is reset when Sampling is in effect Key: LUCENE-4411 URL: https://issues.apache.org/jira/browse/LUCENE-4411 Project: Lucene - Core Issue Type: Bug Components: modules/facet Affects Versions: 3.6.1, 5.0, 4.0 Reporter: Gilad Barkai Attachments: OversampleWithDepthTest.java FacetRequest can be set a Depth parameter, which controls the depth of the result tree to be returned. When Sampling is enabled (and actually used) the Depth parameter gets reset to its default (1). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4302) Javadoc for facet User Guide does not display because of SAXParseException (Eclipse, Maven)
[ https://issues.apache.org/jira/browse/LUCENE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433404#comment-13433404 ] Gilad Barkai commented on LUCENE-4302: -- I'm with Shai on this - According to w3schools this is actually not a problem. Please see: http://www.w3schools.com/tags/tag_br.asp In HTML has no closing tag. > Javadoc for facet User Guide does not display because of SAXParseException > (Eclipse, Maven) > --- > > Key: LUCENE-4302 > URL: https://issues.apache.org/jira/browse/LUCENE-4302 > Project: Lucene - Core > Issue Type: Bug > Components: general/javadocs >Affects Versions: 4.0-ALPHA > Environment: Windows 7-64bit/Eclipse Juno (4.2)/Maven m2e > plugin/firefox latest >Reporter: Karl Nicholas >Priority: Minor > Labels: documentation > Original Estimate: 1h > Remaining Estimate: 1h > > I have opened javadoc for Facet API while using Eclipse, which downloaded the > javadocs using Maven m2e plugin. When I click on facet User Guide on the > overview page I get the following exception in FireFox: > http://127.0.0.1:49231/help/nftopic/jar:file:/C:/Users/karl/.m2/repository/org/apache/lucene/lucene-facet/4.0.0-ALPHA/ > lucene-facet-4.0.0-ALPHA-javadoc.jar!/org/apache/lucene/facet/doc-files/userguide.html > An error occured while processing the requested document: > org.xml.sax.SAXParseException; lineNumber: 121; columnNumber: 16; The element > type "br" must be terminated by the matching end-tag "". > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown > Source) > The link, or requested document is: > http://127.0.0.1:49231/help/nftopic/jar:file:/C:/Users/karl/.m2/repository/org/apache/lucene/lucene-facet/4.0.0-ALPHA/ > lucene-facet-4.0.0-ALPHA-javadoc.jar!/org/apache/lucene/facet/doc-files/userguide.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420144#comment-13420144 ] Gilad Barkai commented on LUCENE-4234: -- Thank you for looking into this and making the appropriate changes. Patch looks good. > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai >Assignee: Shai Erera > Fix For: 4.0, 5.0, 3.6.2 > > Attachments: LUCENE-4234.patch, LUCENE-4234.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420125#comment-13420125 ] Gilad Barkai commented on LUCENE-4234: -- Hi Shai, thank you for reviewing. 1. Merged 2. Using FBS is possible, though it's a larger change than the fix, perhaps handle this in a seperate issue? 3. IMHO that's a waste for a large index. While allocating a large array is faster, it takes a lot of memory - probably for no reason. Each query will take longer, but concurrent queries will not hurt the GC that much. If the index is large concurrent queries might hit the GC hard, perhaps even OOM ? I must admit I never measured the reallocation penalty. > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai >Assignee: Shai Erera > Fix For: 4.0, 5.0, 3.6.2 > > Attachments: LUCENE-4234.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4234: - Attachment: LUCENE-4234.patch One patch (to rule them all) instead of the two. No changes were made thus far. > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai >Assignee: Shai Erera > Fix For: 4.0, 5.0, 3.6.2 > > Attachments: LUCENE-4234.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4234: - Attachment: (was: LUCENE-4234-test.patch) > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai >Assignee: Shai Erera > Fix For: 4.0, 5.0, 3.6.2 > > Attachments: LUCENE-4234.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4234: - Attachment: (was: LUCENE-4234-fix.patch) > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai >Assignee: Shai Erera > Fix For: 4.0, 5.0, 3.6.2 > > Attachments: LUCENE-4234.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4234: - Attachment: LUCENE-4234-fix.patch Proposed fix, the code was optimized for a capacity of 1000 documents, but BitSet was initizlized with the same value of 1000, causing .fastSet() to fail. Initializing the scores array to size of 64, and initializing the bitset to .maxDoc() as in the non-score-keeping version. > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai > Fix For: 3.6.1, 4.0 > > Attachments: LUCENE-4234-fix.patch, LUCENE-4234-test.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
[ https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-4234: - Attachment: LUCENE-4234-test.patch A test revealing the bug. > Exception when FacetsCollector is used with ScoreFacetRequest > - > > Key: LUCENE-4234 > URL: https://issues.apache.org/jira/browse/LUCENE-4234 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Gilad Barkai > Fix For: 3.6.1, 4.0 > > Attachments: LUCENE-4234-test.patch > > > Aggregating facets with Lucene's Score using FacetsCollector results in an > Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest
Gilad Barkai created LUCENE-4234: Summary: Exception when FacetsCollector is used with ScoreFacetRequest Key: LUCENE-4234 URL: https://issues.apache.org/jira/browse/LUCENE-4234 Project: Lucene - Java Issue Type: Bug Components: modules/facet Affects Versions: 4.0-ALPHA, 3.6 Reporter: Gilad Barkai Fix For: 3.6.1, 4.0 Aggregating facets with Lucene's Score using FacetsCollector results in an Exception (assertion when enabled). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files
[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407086#comment-13407086 ] Gilad Barkai commented on LUCENE-4190: -- {quote} So if we insist on a perfect solution: then fine, the perfect solution I accept is for lucene to totally own the directory, don't put files in there! Then the behavior is clear, no bugs, we delete everything. {quote} But than we're left with the original problem - should a poor user (say, me) accidentally put an index in an already filled directory (say /tmp) - the price to pay for is great. Too great IMHO. > IndexWriter deletes non-Lucene files > > > Key: LUCENE-4190 > URL: https://issues.apache.org/jira/browse/LUCENE-4190 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Robert Muir > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4190.patch, LUCENE-4190.patch > > > Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog > post: > http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html > IndexWriter will now (as of 4.0) delete all foreign files from the index > directory. We made this change because Codecs are free to write to any files > now, so the space of filenames is hard to "bound". > But if the user accidentally uses the wrong directory (eg c:/) then we will > in fact delete important stuff. > I think we can at least use some simple criteria (must start with _, maybe > must fit certain pattern eg _(_X).Y), so we are much less likely to > delete a non-Lucene file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files
[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406877#comment-13406877 ] Gilad Barkai commented on LUCENE-4190: -- Perhaps out of context, but here goes.. Users sometimes do stupid things, me included, such as putting the index in a non-dedicated-directory. But should they pay the penalty just because the code should not get overly complicated? Codecs create their own files, and no one seems able to control what files they create (other than in assert?); Than, is it possible for the codec to handle the removal of the files it created? That would make codecs work the same way the Index handles the 'core' index files - each codec will be able to erase its own. Another closely related option - let IW consult with the codecs about 'non-core-files' and see which one should/could be removed. I only suggest this because I fear for users' files which might get erased. Disclosure: It'll be ages before I understand Lucene 4 half as much as I do Lucene 3.6 (not that that's much), so forgive me if I stepped on anyone's toes, or just described how to implement a time machine :) > IndexWriter deletes non-Lucene files > > > Key: LUCENE-4190 > URL: https://issues.apache.org/jira/browse/LUCENE-4190 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Robert Muir > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4190.patch, LUCENE-4190.patch > > > Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog > post: > http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html > IndexWriter will now (as of 4.0) delete all foreign files from the index > directory. We made this change because Codecs are free to write to any files > now, so the space of filenames is hard to "bound". > But if the user accidentally uses the wrong directory (eg c:/) then we will > in fact delete important stuff. > I think we can at least use some simple criteria (must start with _, maybe > must fit certain pattern eg _(_X).Y), so we are much less likely to > delete a non-Lucene file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak
[ https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094388#comment-13094388 ] Gilad Barkai commented on LUCENE-3409: -- This issue is relevant for trunk as well. Please update the Affected versions accordingly. > NRT reader/writer over RAMDirectory memory leak > --- > > Key: LUCENE-3409 > URL: https://issues.apache.org/jira/browse/LUCENE-3409 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 3.0.2, 3.3 >Reporter: tal steier > > with NRT reader/writer, emptying an index using: > writer.deleteAll() > writer.commit() > doesn't release all allocated memory. > for example the following code will generate a memory leak: > /** >* Reveals a memory leak in NRT reader/writer >* >* The following main() does 10K cycles of: >* >* Add 10K empty documents to index writer >* commit() >* open NRT reader over the writer, and immediately close it >* delete all documents from the writer >* commit changes to the writer >* >* >* Running with -Xmx256M results in an OOME after ~2600 cycles >*/ > public static void main(String[] args) throws Exception { > RAMDirectory d = new RAMDirectory(); > IndexWriter w = new IndexWriter(d, new > IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); > Document doc = new Document(); > > for(int i = 0; i < 1; i++) { > for(int j = 0; j < 1; ++j) { > w.addDocument(doc); > } > w.commit(); > IndexReader.open(w, true).close(); > w.deleteAll(); > w.commit(); > } > > w.close(); > d.close(); > } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field
[ https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-3390: - Attachment: SortByDouble.java example code > Incorrect sort by Numeric (double) values for documents missing the sorting > field > - > > Key: LUCENE-3390 > URL: https://issues.apache.org/jira/browse/LUCENE-3390 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.3 >Reporter: Gilad Barkai >Priority: Minor > Labels: double, numeric, sort > Attachments: SortByDouble.java > > > While sorting results over a numeric double field, documents which do not > contain a value for the sorting field seem to get 0 (ZERO) value in the sort. > This behavior is unexpected, as zero is "comparable" to the rest of the > values. A better solution would either be allowing the user to define such a > "non-value" default, or always bring those document results as the last ones. > Example scenario: > Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any > value. > Searching with MatchAllDocsQuery, with sort over that field in descending > order yields the docid results of 0, 2, 1. > Example code: > public static void main(String[] args) throws Exception { > RAMDirectory d = new RAMDirectory(); > IndexWriter w = new IndexWriter(d, new > IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); > > // 1st doc, value 3.5d > Document doc = new Document(); > doc.add(new NumericField("f", Store.YES, true).setDoubleValue(3.5d)); > w.addDocument(doc); > > // 2nd doc, value of -10d > doc = new Document(); > doc.add(new NumericField("f", Store.YES, true).setDoubleValue(-10d)); > w.addDocument(doc); > > // 3rd doc, no value at all > w.addDocument(new Document()); > w.close(); > IndexSearcher s = new IndexSearcher(d); > Sort sort = new Sort(new SortField("f", SortField.DOUBLE, true)); > TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort); > for (ScoreDoc sd : td.scoreDocs) { > System.out.println(sd.doc + ": " + s.doc(sd.doc).get("f")); > } > s.close(); > d.close(); > } > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field
[ https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-3390: - Description: While sorting results over a numeric double field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones. Example scenario: Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value. Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1. While the document with the missing value does match the query, I would expect it to come last, as it is not comparable by the other documents. For example, asking for the top 2 documents brings the document without any value which seems as a bug? was: While sorting results over a numeric double field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones. Example scenario: Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value. Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1. Example code: public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); // 1st doc, value 3.5d Document doc = new Document(); doc.add(new NumericField("f", Store.YES, true).setDoubleValue(3.5d)); w.addDocument(doc); // 2nd doc, value of -10d doc = new Document(); doc.add(new NumericField("f", Store.YES, true).setDoubleValue(-10d)); w.addDocument(doc); // 3rd doc, no value at all w.addDocument(new Document()); w.close(); IndexSearcher s = new IndexSearcher(d); Sort sort = new Sort(new SortField("f", SortField.DOUBLE, true)); TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort); for (ScoreDoc sd : td.scoreDocs) { System.out.println(sd.doc + ": " + s.doc(sd.doc).get("f")); } s.close(); d.close(); } > Incorrect sort by Numeric (double) values for documents missing the sorting > field > - > > Key: LUCENE-3390 > URL: https://issues.apache.org/jira/browse/LUCENE-3390 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.3 >Reporter: Gilad Barkai >Priority: Minor > Labels: double, numeric, sort > Attachments: SortByDouble.java > > > While sorting results over a numeric double field, documents which do not > contain a value for the sorting field seem to get 0 (ZERO) value in the sort. > This behavior is unexpected, as zero is "comparable" to the rest of the > values. A better solution would either be allowing the user to define such a > "non-value" default, or always bring those document results as the last ones. > Example scenario: > Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any > value. > Searching with MatchAllDocsQuery, with sort over that field in descending > order yields the docid results of 0, 2, 1. > While the document with the missing value does match the query, I would > expect it to come last, as it is not comparable by the other documents. For > example, asking for the top 2 documents brings the document without any value > which seems as a bug? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric values for documents missing the sorting field
[ https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-3390: - Labels: double float int long numeric sort (was: double numeric sort) Description: While sorting results over a numeric field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. (Tested against Double, Float, Int & Long numeric fields ascending and descending order). This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones. Example scenario: Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value. Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1. Asking for the top 2 documents brings the document without any value as the 2nd result - which seems as a bug? was: While sorting results over a numeric double field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones. Example scenario: Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value. Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1. While the document with the missing value does match the query, I would expect it to come last, as it is not comparable by the other documents. For example, asking for the top 2 documents brings the document without any value which seems as a bug? Summary: Incorrect sort by Numeric values for documents missing the sorting field (was: Incorrect sort by Numeric (double) values for documents missing the sorting field) > Incorrect sort by Numeric values for documents missing the sorting field > > > Key: LUCENE-3390 > URL: https://issues.apache.org/jira/browse/LUCENE-3390 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Affects Versions: 3.3 >Reporter: Gilad Barkai >Priority: Minor > Labels: double, float, int, long, numeric, sort > Attachments: SortByDouble.java > > > While sorting results over a numeric field, documents which do not contain a > value for the sorting field seem to get 0 (ZERO) value in the sort. (Tested > against Double, Float, Int & Long numeric fields ascending and descending > order). > This behavior is unexpected, as zero is "comparable" to the rest of the > values. A better solution would either be allowing the user to define such a > "non-value" default, or always bring those document results as the last ones. > Example scenario: > Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any > value. > Searching with MatchAllDocsQuery, with sort over that field in descending > order yields the docid results of 0, 2, 1. > Asking for the top 2 documents brings the document without any value as the > 2nd result - which seems as a bug? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field
Incorrect sort by Numeric (double) values for documents missing the sorting field - Key: LUCENE-3390 URL: https://issues.apache.org/jira/browse/LUCENE-3390 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.3 Reporter: Gilad Barkai Priority: Minor While sorting results over a numeric double field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones. Example scenario: Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value. Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1. Example code: public static void main(String[] args) throws Exception { RAMDirectory d = new RAMDirectory(); IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer())); // 1st doc, value 3.5d Document doc = new Document(); doc.add(new NumericField("f", Store.YES, true).setDoubleValue(3.5d)); w.addDocument(doc); // 2nd doc, value of -10d doc = new Document(); doc.add(new NumericField("f", Store.YES, true).setDoubleValue(-10d)); w.addDocument(doc); // 3rd doc, no value at all w.addDocument(new Document()); w.close(); IndexSearcher s = new IndexSearcher(d); Sort sort = new Sort(new SortField("f", SortField.DOUBLE, true)); TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort); for (ScoreDoc sd : td.scoreDocs) { System.out.println(sd.doc + ": " + s.doc(sd.doc).get("f")); } s.close(); d.close(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org